Classification, anomaly detection, self-supervised learning, and composite scoring for candidate prioritisation.
Earlier tutorials explained how MitraSETI finds coherent energy in time–frequency data: de-Doppler integration, clustering, and RFI-aware preprocessing turn a vast spectrogram into a list of candidates. This chapter is about what happens next. Machine learning does not replace the physics-first search; it prioritises and interprets what the search returns. The goal is to move from "here are thousands of blobs that passed a threshold" to "here is a short, ranked set of the most interesting things to look at first — and here is why the model thinks so."
A de-Doppler pipeline is excellent at answering: Is there a narrowband feature that moves through frequency in a way consistent with a particular drift model? That is a powerful filter, but it is not a complete story. The sky and the ground are full of structures that look "signal-like" in a spectrogram: RFI from electronics and communications, pulsars and other astrophysical periodic emitters, calibrator artifacts and gain steps, and instrumental quirks that leave faint tracks or stripes.
So the practical question becomes: given a candidate patch, what kind of thing is this — and how much should I care?
Classification asks: Is this RFI, a known astrophysical class, a calibrator artifact, or something that does not fit our labeled buckets? The model maps a 2D patch (and often metadata) to probabilities over classes. That is the familiar "supervised learning" picture: learn from examples that humans (or upstream heuristics) have labeled.
Anomaly detection asks a different question: Does this look like anything I have seen during training? Even when every class label is wrong or incomplete, the geometry of the internal representation can still tell you that a patch is weird — far from the bulk of training examples. For SETI, "weird" is not proof of aliens, but it is a rational hook for human attention: something may be mislabeled, rare, or genuinely new.
Without ML, a typical workflow stalls on human review. A single observation can yield thousands of candidates after clustering. Eyeballing each patch is slow, inconsistent, and does not scale to surveys. With ML in the loop, the pipeline can emit a short ranked list: top N patches by an interestingness score, with attention heatmaps and OOD flags as hooks for interpretation.
Think of ML as a triage nurse: it does not diagnose E.T.; it decides who gets seen first and what questions to ask.
You do not need prior ML coursework to understand the convolutional neural network. It is a recipe for turning an image into a summary that highlights local patterns and textures.
The input is a 2D array — a spectrogram patch cropped around a candidate. Rows are time (or time steps), columns are frequency channels, and cell values are power or a normalised intensity. For the network, it is just a grayscale image with height and width.
Imagine a tiny 3×3 stencil (a filter or kernel) laid on top of the patch. At each position, you multiply the nine stencil weights by the nine underlying pixel values and add the products. That single number becomes one entry in an output map. Then you slide the stencil one step (or several steps — a stride) and repeat.
Intuition: If the stencil is shaped to respond to a bright–dark–bright transition along a direction, sliding it everywhere tells you where that transition occurs. Different stencils emphasise edges, gradients, blobs, or ripples.
One filter produces one 2D map. A typical layer uses dozens or hundreds of filters in parallel. Each map is a feature map: a heatmap of "how much this pattern is present here." Early layers often look like edge detectors; later maps are harder to visualise but encode richer compositions of those edges.
Pooling downsamples each feature map: for example, max pooling replaces each 2×2 block with the maximum value inside the block. Average pooling uses the mean instead.
Analogy: Convolution is "look closely with many magnifying glasses"; pooling is "step back and keep only the strongest local evidence." You trade spatial resolution for broader context and fewer parameters downstream.
A deep CNN alternates convolution → nonlinearity → pooling (with modern variants sometimes adjusting this pattern). The stack builds a hierarchy:
For spectrograms, a drifting narrowband line often appears as energy aligned along a slanted ridge in time–frequency. Ridges are built from oriented intensity changes — the same family of structures CNNs were designed to detect in photographs.
Drift lines are oriented structures. A CNN does not "know" Doppler; it detects local alignment of energy. A slanted track produces a characteristic pattern of edges at a specific angle repeated along the track. Stacked convolutions integrate that local evidence into longer, coherent structures without hand-crafting a Hough transform for every pipeline variant.
Picture a column of optical lenses, each coated or shaped to emphasise different aspects of a scene: one highlights vertical contrast, another horizontal, another fine speckle. After passing through the stack, you are not looking at the raw scene — you are looking at a stack of abstract views optimised (by training) for the task. A CNN is that stack, implemented as learned filters rather than hand-designed glass.
CNNs are local by construction: a 3×3 filter only ever sees a small neighbourhood at once. Deeper layers grow the effective field of view, but long-range relationships may require many layers or large kernels.
Transformers take a different approach: they let every position talk to every other position in one conceptual step, mediated by attention.
Split the input (after embedding) into a sequence of tokens — here, often time slices or spatial tokens derived from the CNN backbone. For each token, self-attention asks: "When I interpret this slice, which other slices should I read carefully?"
The model learns three linear projections of each token: queries (Q), keys (K), and values (V). Conceptually:
Compatibility scores come from matching queries to keys; those scores become weights that mix the values.
Q·Kᵀ measures similarity between every query position and every key position. Dividing by √dk stabilises gradients when the dimension dk is large. Softmax turns scores into positive weights that sum to one per query row. Multiplying by V mixes the content vectors according to those weights.
Analogy: Imagine each time slice holds a short memo. Every memo simultaneously votes on which other memos are relevant to reading it. The softmax is the committee vote; the output is a new memo that blends others according to that vote.
Multi-head attention runs several attention blocks in parallel with different learned projections. Each head can specialise: one might emphasise slow global envelopes, another rapid on–off structure, another correlations between widely separated frequency bands (depending on how tokens are defined).
A drifting or modulated signal can create dependencies across distant time steps: phase coherence, intermittent on–off patterns, or slow amplitude drift. A CNN must compose many local steps to link far-apart evidence. A Transformer can, in principle, draw a direct link between distant regions if the data support it.
Attention on its own is permutation-invariant in the sense that shuffling tokens would shuffle outputs in the same way — unless you tell the model where each token was. Positional encodings (learned or fixed sinusoidal) are added to token embeddings so that "time order" and "frequency order" are explicit.
Analogy: Self-attention is a round-table where everyone can hear everyone — but name tags (positions) prevent the room from confusing who spoke when.
MitraSETI's ML stack is a hybrid: it uses the strengths of both families.
The head typically produces:
Big picture: CNNs supply local evidence; Transformers integrate evidence across the whole patch. Together they model both texture and global temporal structure.
Supervised classification needs labeled examples: "RFI," "narrowband drifting," "pulsar-like," "unknown," and so on. SETI faces a structural difficulty:
We have no verified extraterrestrial engineering signals to use as a positive class. Any pipeline that only learns "E.T. vs not E.T." from real positives is stuck at zero true positives in the training set. This is a structural absence, not a data-collection failure — it shapes the entire ML strategy.
SimCLR (Chen et al., 2020 — "A Simple Framework for Contrastive Learning of Visual Representations") is a clean story you can implement and reason about.
Learn an embedding function f(·) such that two augmented views of the same image land near each other in embedding space, while views from different images land far apart. No class labels are required for that step.
Analogy: Show someone two photos of the same person on different days (glasses on/off, different lighting). A good identity embedding ignores nuisance variation but preserves identity. SimCLR does that for patches, using augmentations as controlled nuisances.
| Augmentation | What it does | Why it helps |
|---|---|---|
| Frequency shift | Shifts content along the frequency axis | Real signals appear at arbitrary RF channels |
| Time crop + resize | Random temporal window, rescaled | Observations differ in length and cadence |
| Gaussian noise | Adds noise to pixel values | Sensitivity and system temperature vary |
| Channel masking | Zeros random frequency channels | Mimics RFI flagging and dropped bands |
| Brightness / contrast | Scales dynamic range | Gain calibration and processing change appearance |
After pre-training, the backbone often encodes semantically useful structure — continuity vs speckle, line-like vs blob-like — before any classifier head is trained.
Add a classification head on top of the backbone and train with labeled data (even if labels are noisy or incomplete). Pre-training tends to reduce the amount of hand-labeling needed to reach usable accuracy.
If the model outputs "candidate" or a high score, a scientist reasonably asks: on what evidence? Without tools, deep networks feel opaque.
Self-attention weights (or derived relevance maps) approximate where the model looked when mixing information across tokens. For MitraSETI-style models, you can map those weights back onto time–frequency coordinates.
Attention maps are not causal proofs — they are diagnostics, like residuals in a fit.
Standard softmax classifiers assume one of the training classes is correct. OOD detection asks whether the input belongs outside the training manifold at all.
A common implementation theme is: embed the patch with the trained encoder, then compare the embedding to training statistics (class centroids, Gaussian models, kNN density in embedding space, or energy-based scores). High OOD scores mean far from familiar — not "alien," but worth a second look.
OOD complements classification: a patch might be classified as a vague "RFI-like" class yet sit far from all training examples — suggesting novel interference or pipeline edge cases that merit review. In MitraSETI, OOD feeds the interestingness score as an anomaly term.
MitraSETI combines several physics- and ML-derived cues into a 0–100 interestingness score for automated prioritisation: process thousands of observations, then surface the top ten (or top k) for human follow-up.
The interestingness score is a weighted combination of six physics- and ML-derived components. Default conceptual weights sum to 1.0, but they encode pragmatic triage, not a theorem about E.T. Treat the score as ranking pressure, not truth ordering.
| Component | Weight | What it measures |
|---|---|---|
| SNR significance | 0.25 | Log-scaled signal strength relative to noise |
| Drift meaningfulness | 0.20 | Whether drift sits in a physically motivated band |
| RFI cleanliness | 0.20 | 1 − PRFI from the model or filters |
| OOD anomaly | 0.15 | Novelty of the representation |
| Confidence | 0.10 | Model certainty (used carefully — overconfidence can mislead) |
| Cadence survival | 0.10 | Passed ON/OFF or related cadence tests |
Drift rate is a strong prior in radio SETI narratives: many hypothesised exoplanet transmitters introduce modest Doppler drift. The scoring encodes heuristic preferences, not a theorem:
These thresholds encode pragmatic triage, not a theorem about E.T.
Treat it as ranking pressure, not truth ordering. The best use is: uniform processing, consistent metrics, and transparent components so humans can override when domain knowledge disagrees with a weight.
Some astrophysical and human-made emitters are periodic: pulsars, rotating beacons, swept interferers with stable periods. A continuous de-Doppler integrator is tuned for smooth drift along time; it can under-emphasise signals that turn on and off or phase-reset in ways that look like broadband modulation in a naive transform.
Combining de-Doppler narrowband search with FFT-based periodicity in one open-source SETI-oriented tool is relatively novel: many pipelines emphasise one family of models. MitraSETI's integration supports a broader hypothesis class — still not exhaustive, but strictly richer than drift-only integration.
De-Doppler and clustering propose candidates; ML characterises them, flags novelty, and ranks them under explicit scoring rules. CNNs are local pattern engines; Transformers relate distant time–frequency evidence; SimCLR bootstraps representation learning when labels are scarce; attention maps illuminate model focus; OOD and periodicity modules widen the net beyond a single signal model.
None of this replaces careful observing, instrument characterization, or skeptical statistics. It channels human attention where it matters — exactly what automated SETI survey pipelines need at scale.
The full CNN + Transformer classifier runs in the cloud on every upload — no GPU setup required.