Raw file to ranked candidates — the operational map of every stage, from first byte to final score.
When you point MitraSETI at a telescope file, dozens of algorithms run in a fixed order. This chapter is the map: what goes in, what comes out, and what happens at each stage between ingest and a ranked, scored candidate list. Use it as an operational reference when you read code, tune parameters, or explain the system to collaborators.
Inputs. The pipeline accepts standard radio-survey products in two common shapes:
.fil) — a widely used binary format with a textual header followed by packed spectral data..h5) — hierarchical files that may bundle metadata, arrays, and auxiliary products in one container.Both represent the same underlying idea: power as a function of time and frequency.
Outputs. After a full run you get a ranked list of candidate signals. Each candidate carries:
Six stages end to end. Think of the pipeline as a funnel:
RFI cleaning must precede de-Doppler so integrated power is not dominated by stationary birdies. Clustering must precede the neural network so you pay inference cost once per physical candidate, not once per redundant peak. Export and persistence come last so every artifact — FITS rows, JSON history, HTML figures — reflects the same final scores and labels.
The sections below walk each stage in the order data actually flows.
Header semantics. Before any science step, the reader extracts observational metadata from the file. You should expect at least:
fch1) — anchor of the frequency axis (implementation-specific sign conventions for foff still apply; the pipeline uses the header consistently).foff) — bandwidth per spectral bin, often negative if channels run high-to-low.tsamp) — seconds per time sample.Together, these define the grid on which every later algorithm operates.
Raw data layout. The body of the file is interpreted as a two-dimensional array of float32 power values, with dimensions time × frequency. Each row is one spectrum; each column is one channel's evolution over time. All downstream steps — kurtosis per channel, de-Doppler integration along slopes, patch extraction for the CNN — assume this layout (or an equivalent streaming chunk that preserves the same indexing).
Format detection. The entry point inspects the path and/or magic bytes to decide between Sigproc .fil and HDF5 .h5. The goal is one internal representation: header fields normalized, data accessible as blocks so memory can be managed on large files.
Streaming mode. For operational surveys, waiting until a night is "finished" is not always ideal. MitraSETI can watch a directory and process new files as they appear (or as they grow, depending on deployment). Streaming reuses the same stages; only the producer changes — from "open this path once" to "enqueue work when a new .fil or .h5 lands." That mode is what enables long runs that accumulate statistics across hundreds of files without manual babysitting.
If ingest fails, verify byte order, nbits, and nchans in the Sigproc header, or the dataset paths inside HDF5 with h5dump -n. Mismatched channel counts are the most common reason downstream stages see "striped" artifacts or crash inside normalization.
Radio-frequency interference (RFI) is the main enemy of narrowband SETI. Stage 2 combines blind statistical cleaning with a curated list of known bad actors.
Per-channel kurtosis summarizes how "Gaussian" the power distribution in each frequency bin looks over time. RFI often produces heavy tails or burstiness compared to thermal noise.
The pipeline computes SK for each channel, then applies adaptive thresholds derived from robust statistics:
Flagged channels are not zeroed arbitrarily in a way that corrupts neighboring science: the usual repair is to replace flagged columns with the column median (or an equivalent stable fill) so de-Doppler still sees a smooth enough background while obvious spectral defects are suppressed.
Humanity has already cataloged many persistent terrestrial sources: communication bands, radar, navigation signals, and observatory-specific ghosts. MitraSETI maintains a database of 27 known RFI sources (frequencies or patterns worth treating specially).
When a later stage proposes energy near one of these frequencies, the pipeline labels the association. Important: matches are not silently discarded here — they are flagged for downstream scoring so that genuine astrophysical signals that happen to sit near a bad frequency can still be examined with full context, while the classifier and interestingness score can penalize likely RFI.
Narrowband extraterrestrial or spacecraft carriers rarely sit vertically in a spectrogram; Doppler drift from relative motion smears them along a line in frequency–time. Stage 3 searches for such coherent slopes.
Algorithm choice. By default MitraSETI uses the Taylor tree (see the dedicated chapter): an efficient hierarchical scheme that avoids exhaustively re-integrating every slope at full cost. A brute-force integrator remains available for validation, small experiments, or parity checks.
Normalization. Before search, the spectrogram is normalized per channel: typically subtract the median and divide by a robust sigma estimate so sensitivity is comparable across wide bands and gain variations do not dominate.
Search loop. The engine tries drift rates from −max_drift to +max_drift (in the units configured for your setup — commonly Hz/s or equivalent). For each trial drift it effectively asks: "If a signal slid along this slope, how bright would it be if I integrated along it?" Peaks in that search surface become detections characterized by:
Yield. On real data this stage is deliberately sensitive: you should expect hundreds to thousands of raw hits — many will be RFI remnants, sidelobes, or noise excursions. That is why Stage 4 exists.
Taylor tree versus brute force (when to care). For survey-scale data the Taylor tree is the default because complexity scales more gently than naïve re-integration at every drift and every pixel. Brute force remains valuable when you distrust a bug in the tree, when the product has very few channels or dumps, or when you are generating gold-standard plots for a paper figure with affordable compute.
Multiple hits often describe one physical emitter (harmonics, leakage across bins, or duplicate peaks from the search grid). Clustering collapses them.
HDBSCAN path. If there are more than five hits, MitraSETI clusters in a 3D feature space:
HDBSCAN finds density-connected groups without forcing you to specify the cluster count. Points labeled −1 are noise in the clustering sense — they do not form a dense group and are discarded as standalone hits.
Greedy merge path. If there are five or fewer hits, running a density clusterer is usually unnecessary; a greedy merge combines obviously redundant detections.
Representatives. For each surviving cluster, the pipeline picks the hit with the highest SNR as the single candidate representing that cluster.
Yield. After clustering you typically land in the tens of candidates per file — often on the order of 10–50, depending on RFI environment, thresholding, and sky content. That is the set passed to the expensive ML stage.
For each clustered candidate, MitraSETI builds a rich feature bundle suitable for both human review and machine ranking.
a. Patch extraction. A 128×128 spectrogram patch is cut centered on the candidate in frequency–time, large enough to capture the drift line and local background.
b. CNN + Transformer. The patch feeds a hybrid model: convolutional front ends capture local texture; Transformer blocks model longer-range structure along the patch. The head emits multi-class probabilities — categories along the lines of narrowband_drifting, broadband, pulsed, noise, and related labels used in training.
c. RFI probability, OOD, confidence. Alongside the discrete label, the system exposes:
d. Periodicity detection. An FFT on channel time series (or equivalent periodogram step) checks for regular pulses that might not be fully described by a single drifting ridge.
e. Attention heatmap. The Transformer's attention is visualized as a heatmap over the patch so humans can see which time–frequency regions drove the decision — critical for publication figures and debugging false positives.
f. Interestingness score. Finally, a six-component composite combines SNR-like strength, spectral narrowness, drift plausibility, RFI penalties, model agreement, and related terms into a single 0–100-style interestingness figure (exact weighting lives in the implementation; the idea is one number for triage).
Result. After Stage 5, each candidate is no longer "a blob in a search plot" — it is a fully characterized object ready for export, persistence, and cross-survey comparison.
Science does not stop at a Python object in memory. Stage 6 makes results durable, shareable, and contextual.
FITS export. Candidates can be written to a FITS binary table structured for Virtual Observatory (VO) compatibility, so standard tools (TOPCAT, Aladin, custom pipelines) can ingest them.
Persistence tracking. A JSON state file records candidates across epochs — so if the same sky position is re-observed, you can ask whether a line of sight keeps producing a consistent feature or whether it was a one-off glitch.
Cross-matching. When AstroLens (or related optical anomaly products) is available, the pipeline can match radio candidates to optical anomalies by sky position and metadata, supporting multi-wavelength hypotheses.
Unified sky map. For communication with collaborators and the public, MitraSETI can produce a single map overlaying radio candidate locations with optical detections — one visual story for "what the machine found, where."
HTML report. A publication-ready HTML summary bundles plots, tables, and key statistics so a run can be archived or attached to a lab notebook without re-running notebooks.
CLI output. For quick checks, the terminal prints a ranked table of candidates with the most important columns (frequency, drift, SNR, class, scores).
MitraSETI exposes a Click-based command-line interface. Typical entry points:
| Command | Role |
|---|---|
mitraseti search <file> |
Run the full pipeline on a single filterbank or HDF5 file. |
mitraseti stream --dir <dir> |
Continuous mode: watch a directory and process new files as they arrive. |
mitraseti benchmark |
Run speed benchmarks to compare configurations or hardware. |
mitraseti export --format fits |
Export persisted results to FITS (and related options as implemented). |
mitraseti crossmatch |
Cross-match radio candidates with AstroLens (or configured optical catalogs). |
mitraseti report |
Generate the publication-style HTML (or linked) report. |
mitraseti rfi |
Manage the RFI database (add notes, enable/disable entries, inspect sources). |
mitraseti persistence |
Inspect persistent sources tracked across epochs in the JSON state. |
mitraseti paths |
Print configured paths (models, catalogs, output roots) so deployments are debuggable. |
Exact flags (--help on each subcommand) may evolve between releases; treat this table as the conceptual CLI surface.
Typical workflows. A lab member runs mitraseti search on a suspect file during debugging; the observatory daemon runs mitraseti stream overnight; after a campaign, mitraseti export and mitraseti report produce artifacts for the team drive, while mitraseti crossmatch ties the radio list to optical cuts. mitraseti paths is the first command to run on a new machine when imports succeed but files are "not found."
The Voyager 1 carrier is a famous sanity check: a known drifting narrow line from a distant spacecraft. Here is a plausible narrative of what MitraSETI does to one representative observation file (numbers illustrative of a high-resolution FFT spectrometer product).
MitraSETI successfully identifies the Voyager 1 carrier at ~8,419,296,991.5 Hz with a drift of 0.287 Hz/s and SNR 47.18. The model classifies it as narrowband_drifting with 99.63% confidence — a textbook detection that aligns with spacecraft ephemerides and human expertise.
The detailed walkthrough:
narrowband_drifting with 99.63% confidence — the model recognizes the classic diagonal carrier morphology.This example ties the abstract stages to numbers you can sanity-check against spacecraft ephemerides and telescope logs.
On a large streaming campaign (representative MitraSETI deployment figures):
From 288,864 flagged spectral features down to just 11 surviving candidates: the pipeline achieves an effective RFI rejection rate of ~99.996%. This figure emphasizes the pipeline's role — not to prove ETI, but to wrestle terabytes of contaminated spectra down to a small, ranked set worth expert time — transparently, stage by stage.
How to read the percentages. The 99.996% figure is a useful order-of-magnitude for how aggressively the combined stages reject structured interference relative to the volume of flagged spectral content — not a formal detection probability for technosignatures. Always pair headline rates with false alarm checks on off-source fields and injections of synthetic carriers when you tune thresholds.
When you open the source code, keep this chapter beside you: each function name should map to one of the six stages, and each CLI subcommand should map to one of the Stage 6 integration paths.