Chapter 06

The Machine Learning Pipeline

Classification, anomaly detection, self-supervised learning, and composite scoring for candidate prioritisation.

Saman Tabatabaeian — Deep Field Labs MitraSETI Tutorial Series

1. Why Machine Learning in SETI?

Earlier tutorials explained how MitraSETI finds coherent energy in time–frequency data: de-Doppler integration, clustering, and RFI-aware preprocessing turn a vast spectrogram into a list of candidates. This chapter is about what happens next. Machine learning does not replace the physics-first search; it prioritises and interprets what the search returns. The goal is to move from "here are thousands of blobs that passed a threshold" to "here is a short, ranked set of the most interesting things to look at first — and here is why the model thinks so."

Finding is not the same as understanding

A de-Doppler pipeline is excellent at answering: Is there a narrowband feature that moves through frequency in a way consistent with a particular drift model? That is a powerful filter, but it is not a complete story. The sky and the ground are full of structures that look "signal-like" in a spectrogram: RFI from electronics and communications, pulsars and other astrophysical periodic emitters, calibrator artifacts and gain steps, and instrumental quirks that leave faint tracks or stripes.

So the practical question becomes: given a candidate patch, what kind of thing is this — and how much should I care?

What ML adds

Classification asks: Is this RFI, a known astrophysical class, a calibrator artifact, or something that does not fit our labeled buckets? The model maps a 2D patch (and often metadata) to probabilities over classes. That is the familiar "supervised learning" picture: learn from examples that humans (or upstream heuristics) have labeled.

Anomaly detection asks a different question: Does this look like anything I have seen during training? Even when every class label is wrong or incomplete, the geometry of the internal representation can still tell you that a patch is weird — far from the bulk of training examples. For SETI, "weird" is not proof of aliens, but it is a rational hook for human attention: something may be mislabeled, rare, or genuinely new.

The human bottleneck

Without ML, a typical workflow stalls on human review. A single observation can yield thousands of candidates after clustering. Eyeballing each patch is slow, inconsistent, and does not scale to surveys. With ML in the loop, the pipeline can emit a short ranked list: top N patches by an interestingness score, with attention heatmaps and OOD flags as hooks for interpretation.

Think of ML as a triage nurse: it does not diagnose E.T.; it decides who gets seen first and what questions to ask.

2. Convolutional Neural Networks (CNNs)

You do not need prior ML coursework to understand the convolutional neural network. It is a recipe for turning an image into a summary that highlights local patterns and textures.

Input: a patch, not the whole sky

The input is a 2D array — a spectrogram patch cropped around a candidate. Rows are time (or time steps), columns are frequency channels, and cell values are power or a normalised intensity. For the network, it is just a grayscale image with height and width.

Convolution: a sliding stencil

Imagine a tiny 3×3 stencil (a filter or kernel) laid on top of the patch. At each position, you multiply the nine stencil weights by the nine underlying pixel values and add the products. That single number becomes one entry in an output map. Then you slide the stencil one step (or several steps — a stride) and repeat.

Figure 6.1 — CNN convolution: a 3×3 kernel slides across the input patch, producing a feature map that highlights local patterns.

Intuition: If the stencil is shaped to respond to a bright–dark–bright transition along a direction, sliding it everywhere tells you where that transition occurs. Different stencils emphasise edges, gradients, blobs, or ripples.

Many filters → many feature maps

One filter produces one 2D map. A typical layer uses dozens or hundreds of filters in parallel. Each map is a feature map: a heatmap of "how much this pattern is present here." Early layers often look like edge detectors; later maps are harder to visualise but encode richer compositions of those edges.

Pooling: zooming out

Pooling downsamples each feature map: for example, max pooling replaces each 2×2 block with the maximum value inside the block. Average pooling uses the mean instead.

Analogy: Convolution is "look closely with many magnifying glasses"; pooling is "step back and keep only the strongest local evidence." You trade spatial resolution for broader context and fewer parameters downstream.

Stacking layers: a hierarchy of abstractions

A deep CNN alternates convolution → nonlinearity → pooling (with modern variants sometimes adjusting this pattern). The stack builds a hierarchy:

Low level: edges, gradients, short streaks.
Mid level: corners, short line segments, textures.
High level: elongated structures, curved or straight tracks, "something organised vs something diffuse."

For spectrograms, a drifting narrowband line often appears as energy aligned along a slanted ridge in time–frequency. Ridges are built from oriented intensity changes — the same family of structures CNNs were designed to detect in photographs.

Why CNNs and spectrograms are a natural match

Drift lines are oriented structures. A CNN does not "know" Doppler; it detects local alignment of energy. A slanted track produces a characteristic pattern of edges at a specific angle repeated along the track. Stacked convolutions integrate that local evidence into longer, coherent structures without hand-crafting a Hough transform for every pipeline variant.

Lens stack analogy

Picture a column of optical lenses, each coated or shaped to emphasise different aspects of a scene: one highlights vertical contrast, another horizontal, another fine speckle. After passing through the stack, you are not looking at the raw scene — you are looking at a stack of abstract views optimised (by training) for the task. A CNN is that stack, implemented as learned filters rather than hand-designed glass.

3. Transformers — Beyond Local Patterns

CNNs are local by construction: a 3×3 filter only ever sees a small neighbourhood at once. Deeper layers grow the effective field of view, but long-range relationships may require many layers or large kernels.

Transformers take a different approach: they let every position talk to every other position in one conceptual step, mediated by attention.

Self-attention in plain language

Split the input (after embedding) into a sequence of tokens — here, often time slices or spatial tokens derived from the CNN backbone. For each token, self-attention asks: "When I interpret this slice, which other slices should I read carefully?"

The model learns three linear projections of each token: queries (Q), keys (K), and values (V). Conceptually:

A query is "what I am looking for."
A key is "a label on what you offer."
A value is "the content you contribute if I listen."

Compatibility scores come from matching queries to keys; those scores become weights that mix the values.

Figure 6.2 — Transformer self-attention: Token 3 attends to all other tokens with learned weights. Thicker connections indicate stronger attention.

The scaled dot-product attention formula

✦

Key Concept — Scaled Dot-Product Attention

Attention(Q, K, V) = softmax( Q·Kᵀ / √d_k ) · V

Q·Kᵀ measures similarity between every query position and every key position. Dividing by √d_k stabilises gradients when the dimension d_k is large. Softmax turns scores into positive weights that sum to one per query row. Multiplying by V mixes the content vectors according to those weights.

Analogy: Imagine each time slice holds a short memo. Every memo simultaneously votes on which other memos are relevant to reading it. The softmax is the committee vote; the output is a new memo that blends others according to that vote.

Multi-head attention

Multi-head attention runs several attention blocks in parallel with different learned projections. Each head can specialise: one might emphasise slow global envelopes, another rapid on–off structure, another correlations between widely separated frequency bands (depending on how tokens are defined).

Why Transformers matter for SETI patches

A drifting or modulated signal can create dependencies across distant time steps: phase coherence, intermittent on–off patterns, or slow amplitude drift. A CNN must compose many local steps to link far-apart evidence. A Transformer can, in principle, draw a direct link between distant regions if the data support it.

Positional encoding

Attention on its own is permutation-invariant in the sense that shuffling tokens would shuffle outputs in the same way — unless you tell the model where each token was. Positional encodings (learned or fixed sinusoidal) are added to token embeddings so that "time order" and "frequency order" are explicit.

Analogy: Self-attention is a round-table where everyone can hear everyone — but name tags (positions) prevent the room from confusing who spoke when.

4. CNN + Transformer Hybrid (MitraSETI)

MitraSETI's ML stack is a hybrid: it uses the strengths of both families.

Figure 6.3 — MitraSETI hybrid architecture: CNN extracts local texture; Transformer integrates global coherence across the patch.

Roles

CNN backbone: scans the patch for local spectro-temporal texture — edges, ridges, speckle, broadband RFI blocks, and short-lived glitches. It answers: what patterns exist locally?
Transformer encoder: operates on a sequence of tokens produced from the CNN (for example, one token per time step or per spatial cell of a downsampled map). It answers: how do those patterns evolve and cohere over the full patch?

Outputs

The head typically produces:

Classification logits — scores for each class before softmax.
An out-of-distribution (OOD) signal — derived from representation geometry or auxiliary heads, depending on implementation details.

Training objectives

Cross-entropy (with labels) trains the classifier to match human or heuristic categories.
Contrastive pre-training (see SimCLR below) trains the backbone so that augmented views of the same patch agree in embedding space while different patches disagree — without needing exhaustive labels for every subtle appearance.

Big picture: CNNs supply local evidence; Transformers integrate evidence across the whole patch. Together they model both texture and global temporal structure.

5. The Labeled Data Problem

Supervised classification needs labeled examples: "RFI," "narrowband drifting," "pulsar-like," "unknown," and so on. SETI faces a structural difficulty:

◈

Insight — The "No Confirmed Technosignatures" Problem

We have no verified extraterrestrial engineering signals to use as a positive class. Any pipeline that only learns "E.T. vs not E.T." from real positives is stuck at zero true positives in the training set. This is a structural absence, not a data-collection failure — it shapes the entire ML strategy.

Three practical responses

Synthetic injections — simulate signals in real noise. This teaches detectors what idealised physics looks like in your instrument model, but synthetic ≠ real: RFI, scattering, and site-specific quirks are hard to fake completely.
Self-supervised pre-training — learn representations from unlabeled spectrograms using invariance to augmentations (SimCLR-style). The model learns structure before anyone assigns class names.
Anomaly-first use — even when classes are imperfect, distance from the training cloud highlights rare or mis-modeled events. That aligns with scientific caution: "interesting" is epistemic ("we do not understand this yet"), not a proof.

6. SimCLR Self-Supervised Pre-Training

SimCLR (Chen et al., 2020 — "A Simple Framework for Contrastive Learning of Visual Representations") is a clean story you can implement and reason about.

✦

Key Concept — SimCLR Core Idea

Learn an embedding function f(·) such that two augmented views of the same image land near each other in embedding space, while views from different images land far apart. No class labels are required for that step.

Figure 6.4 — SimCLR contrastive learning: two augmented views of the same patch are pulled together in embedding space; views from different patches are pushed apart.

Steps

Take a spectrogram patch x.
Apply two independent random augmentations → x_a and x_b.
Encode both with the same CNN backbone (and a small projection head) → embeddings z_a, z_b.
Optimise a contrastive loss (NT-Xent / InfoNCE family): pull z_a and z_b together, push z_a away from z_c where c indexes negatives (other patches in the batch).

Analogy: Show someone two photos of the same person on different days (glasses on/off, different lighting). A good identity embedding ignores nuisance variation but preserves identity. SimCLR does that for patches, using augmentations as controlled nuisances.

Astronomy- and instrument-aware augmentations

Augmentation	What it does	Why it helps
Frequency shift	Shifts content along the frequency axis	Real signals appear at arbitrary RF channels
Time crop + resize	Random temporal window, rescaled	Observations differ in length and cadence
Gaussian noise	Adds noise to pixel values	Sensitivity and system temperature vary
Channel masking	Zeros random frequency channels	Mimics RFI flagging and dropped bands
Brightness / contrast	Scales dynamic range	Gain calibration and processing change appearance

After pre-training, the backbone often encodes semantically useful structure — continuity vs speckle, line-like vs blob-like — before any classifier head is trained.

Fine-tuning

Add a classification head on top of the backbone and train with labeled data (even if labels are noisy or incomplete). Pre-training tends to reduce the amount of hand-labeling needed to reach usable accuracy.

7. Attention Heatmaps — Making ML Interpretable

The black-box problem

If the model outputs "candidate" or a high score, a scientist reasonably asks: on what evidence? Without tools, deep networks feel opaque.

Attention as a spotlight

Self-attention weights (or derived relevance maps) approximate where the model looked when mixing information across tokens. For MitraSETI-style models, you can map those weights back onto time–frequency coordinates.

Figure 6.5 — Attention heatmaps: four-panel diagnostic showing where the model focuses when scoring a candidate.

Implementation sketch (MitraSETI-style)

Register forward hooks on Transformer attention modules to capture attention probability matrices during a forward pass.
Run inference on a candidate patch.
Aggregate heads and layers into a single time-aligned importance vector or 2D map (exact aggregation choices vary by tokenisation).
Visualise a four-panel figure: original patch, heatmap, overlay (heatmap on spectrogram), and an attention curve (e.g., importance vs time).

Validation and debugging

A healthy model attending to a true narrowband track should concentrate weight on the slanted ridge, not on random RFI islands or band edges — subject to how tokens partition the patch.
If attention clings to known artifacts (DC spike, birdies), you may have label leakage, biased training data, or a tokenisation that hides the physics from the Transformer.

Attention maps are not causal proofs — they are diagnostics, like residuals in a fit.

8. Out-of-Distribution (OOD) Detection

Classification vs "none of the above"

Standard softmax classifiers assume one of the training classes is correct. OOD detection asks whether the input belongs outside the training manifold at all.

Representation distance

A common implementation theme is: embed the patch with the trained encoder, then compare the embedding to training statistics (class centroids, Gaussian models, kNN density in embedding space, or energy-based scores). High OOD scores mean far from familiar — not "alien," but worth a second look.

Role in prioritisation

OOD complements classification: a patch might be classified as a vague "RFI-like" class yet sit far from all training examples — suggesting novel interference or pipeline edge cases that merit review. In MitraSETI, OOD feeds the interestingness score as an anomaly term.

9. The Interestingness Score

MitraSETI combines several physics- and ML-derived cues into a 0–100 interestingness score for automated prioritisation: process thousands of observations, then surface the top ten (or top k) for human follow-up.

✦

Key Concept — Interestingness Score Weights

The interestingness score is a weighted combination of six physics- and ML-derived components. Default conceptual weights sum to 1.0, but they encode pragmatic triage, not a theorem about E.T. Treat the score as ranking pressure, not truth ordering.

Weighted components (conceptual defaults)

Component	Weight	What it measures
SNR significance	0.25	Log-scaled signal strength relative to noise
Drift meaningfulness	0.20	Whether drift sits in a physically motivated band
RFI cleanliness	0.20	1 − P_RFI from the model or filters
OOD anomaly	0.15	Novelty of the representation
Confidence	0.10	Model certainty (used carefully — overconfidence can mislead)
Cadence survival	0.10	Passed ON/OFF or related cadence tests

Figure 6.6 — Interestingness score: stacked weight bar (top) and example raw component scores for a candidate (bottom).

Drift meaningfulness in more detail

◈

Insight — Drift Meaningfulness Scoring

Drift rate is a strong prior in radio SETI narratives: many hypothesised exoplanet transmitters introduce modest Doppler drift. The scoring encodes heuristic preferences, not a theorem:

Near-zero drift (< 0.001 Hz/s): score → 0 (often terrestrial or instrumentally static).
Boundary drift (≥ ~98% of searched max drift): score → 0 (often truncated or edge-of-grid artefacts).
Sweet spot (roughly 0.05–2.0 Hz/s): score → 1.0 — aligned with habitable-zone planet Doppler scales in many survey setups (always survey-dependent).

These thresholds encode pragmatic triage, not a theorem about E.T.

How to use the score

Treat it as ranking pressure, not truth ordering. The best use is: uniform processing, consistent metrics, and transparent components so humans can override when domain knowledge disagrees with a weight.

10. Periodicity Detection (FFT-Based)

Some astrophysical and human-made emitters are periodic: pulsars, rotating beacons, swept interferers with stable periods. A continuous de-Doppler integrator is tuned for smooth drift along time; it can under-emphasise signals that turn on and off or phase-reset in ways that look like broadband modulation in a naive transform.

MitraSETI v0.2.0 FFT periodicity path (outline)

Extract a power time series for a frequency channel (or a narrow band after masking).
Compute the FFT → build a periodogram (power vs trial period or frequency).
If the peak exceeds a 5σ threshold (estimated from noise statistics), flag a periodic candidate.
Harmonic check: look for peaks at 2×, 3×, 4× the fundamental period to reduce false alarms from non-sinusoidal pulses.
Fold the time series at the best period: average aligned segments to build a pulse profile.

Why this matters in context

Combining de-Doppler narrowband search with FFT-based periodicity in one open-source SETI-oriented tool is relatively novel: many pipelines emphasise one family of models. MitraSETI's integration supports a broader hypothesis class — still not exhaustive, but strictly richer than drift-only integration.

Closing Mental Model

De-Doppler and clustering propose candidates; ML characterises them, flags novelty, and ranks them under explicit scoring rules. CNNs are local pattern engines; Transformers relate distant time–frequency evidence; SimCLR bootstraps representation learning when labels are scarce; attention maps illuminate model focus; OOD and periodicity modules widen the net beyond a single signal model.

None of this replaces careful observing, instrument characterization, or skeptical statistics. It channels human attention where it matters — exactly what automated SETI survey pipelines need at scale.