Chapter 08

The Complete MitraSETI Pipeline

Raw file to ranked candidates — the operational map of every stage, from first byte to final score.

Saman Tabatabaeian — Deep Field Labs Intermediate MitraSETI Tutorial Series

When you point MitraSETI at a telescope file, dozens of algorithms run in a fixed order. This chapter is the map: what goes in, what comes out, and what happens at each stage between ingest and a ranked, scored candidate list. Use it as an operational reference when you read code, tune parameters, or explain the system to collaborators.

1. Overview: What Happens When You Run MitraSETI?

Inputs. The pipeline accepts standard radio-survey products in two common shapes:

Sigproc filterbank (.fil) — a widely used binary format with a textual header followed by packed spectral data.
HDF5 (.h5) — hierarchical files that may bundle metadata, arrays, and auxiliary products in one container.

Both represent the same underlying idea: power as a function of time and frequency.

Outputs. After a full run you get a ranked list of candidate signals. Each candidate carries:

Classifications from the neural model (for example narrowband drifting, broadband, pulsed, or noise-like categories).
Scores including RFI probability, out-of-distribution (OOD) measures, model confidence, and a composite interestingness score.
Visualizations such as spectrogram patches, attention heatmaps, and (when integrated) sky plots that combine radio detections with optical context.

Figure 1 — The MitraSETI pipeline funnel: six stages progressively narrow millions of spectral channels down to a handful of ranked candidates.

✦

Key Concept — The Six Stages

Six stages end to end. Think of the pipeline as a funnel:

Ingest — Parse the file, validate headers, load (or stream) the 2D power array.
RFI filter — Statistically excise bad channels and flag known terrestrial emitters.
De-Doppler search — Find energy that drifts coherently in frequency–time, producing raw "hits."
Cluster — Merge redundant hits into one representative per physical signal.
Classify & score — Run deep models, periodicity checks, and composite scoring on each candidate.
Output — Export FITS, JSON persistence, cross-matches, maps, HTML reports, and CLI summaries.

✦

Key Concept — Why the Stage Order Is Fixed

RFI cleaning must precede de-Doppler so integrated power is not dominated by stationary birdies. Clustering must precede the neural network so you pay inference cost once per physical candidate, not once per redundant peak. Export and persistence come last so every artifact — FITS rows, JSON history, HTML figures — reflects the same final scores and labels.

The sections below walk each stage in the order data actually flows.

2. Stage 1: Data Ingestion

Header semantics. Before any science step, the reader extracts observational metadata from the file. You should expect at least:

Source name — what field or target the observation is tagged as.
Start frequency of channel 1 (fch1) — anchor of the frequency axis (implementation-specific sign conventions for foff still apply; the pipeline uses the header consistently).
Channel spacing (foff) — bandwidth per spectral bin, often negative if channels run high-to-low.
Time resolution (tsamp) — seconds per time sample.
Number of frequency channels — width of each spectrum.
Number of time steps — how many spectra are stacked along the time axis.

Together, these define the grid on which every later algorithm operates.

Raw data layout. The body of the file is interpreted as a two-dimensional array of float32 power values, with dimensions time × frequency. Each row is one spectrum; each column is one channel's evolution over time. All downstream steps — kurtosis per channel, de-Doppler integration along slopes, patch extraction for the CNN — assume this layout (or an equivalent streaming chunk that preserves the same indexing).

Format detection. The entry point inspects the path and/or magic bytes to decide between Sigproc .fil and HDF5 .h5. The goal is one internal representation: header fields normalized, data accessible as blocks so memory can be managed on large files.

Streaming mode. For operational surveys, waiting until a night is "finished" is not always ideal. MitraSETI can watch a directory and process new files as they appear (or as they grow, depending on deployment). Streaming reuses the same stages; only the producer changes — from "open this path once" to "enqueue work when a new .fil or .h5 lands." That mode is what enables long runs that accumulate statistics across hundreds of files without manual babysitting.

💡

Practical Tip

If ingest fails, verify byte order, nbits, and nchans in the Sigproc header, or the dataset paths inside HDF5 with h5dump -n. Mismatched channel counts are the most common reason downstream stages see "striped" artifacts or crash inside normalization.

3. Stage 2: RFI Excision

Radio-frequency interference (RFI) is the main enemy of narrowband SETI. Stage 2 combines blind statistical cleaning with a curated list of known bad actors.

3a. Spectral Kurtosis (SK)

Per-channel kurtosis summarizes how "Gaussian" the power distribution in each frequency bin looks over time. RFI often produces heavy tails or burstiness compared to thermal noise.

The pipeline computes SK for each channel, then applies adaptive thresholds derived from robust statistics:

Take the median of SK values across channels as a central tendency.
Estimate scatter with the median absolute deviation (MAD); the scale factor 1.4826 makes MAD comparable to a standard deviation under Gaussian noise.
Flag channels where SK deviates beyond roughly median ± 3 × 1.4826 × MAD (the exact recipe matches the implementation's adaptive policy).

Threshold = median(SK) ± 3 × 1.4826 × MAD(SK)

Flagged channels are not zeroed arbitrarily in a way that corrupts neighboring science: the usual repair is to replace flagged columns with the column median (or an equivalent stable fill) so de-Doppler still sees a smooth enough background while obvious spectral defects are suppressed.

3b. Known RFI Database

Humanity has already cataloged many persistent terrestrial sources: communication bands, radar, navigation signals, and observatory-specific ghosts. MitraSETI maintains a database of 27 known RFI sources (frequencies or patterns worth treating specially).

When a later stage proposes energy near one of these frequencies, the pipeline labels the association. Important: matches are not silently discarded here — they are flagged for downstream scoring so that genuine astrophysical signals that happen to sit near a bad frequency can still be examined with full context, while the classifier and interestingness score can penalize likely RFI.

4. Stage 3: De-Doppler Search

Narrowband extraterrestrial or spacecraft carriers rarely sit vertically in a spectrogram; Doppler drift from relative motion smears them along a line in frequency–time. Stage 3 searches for such coherent slopes.

Algorithm choice. By default MitraSETI uses the Taylor tree (see the dedicated chapter): an efficient hierarchical scheme that avoids exhaustively re-integrating every slope at full cost. A brute-force integrator remains available for validation, small experiments, or parity checks.

Normalization. Before search, the spectrogram is normalized per channel: typically subtract the median and divide by a robust sigma estimate so sensitivity is comparable across wide bands and gain variations do not dominate.

Search loop. The engine tries drift rates from −max_drift to +max_drift (in the units configured for your setup — commonly Hz/s or equivalent). For each trial drift it effectively asks: "If a signal slid along this slope, how bright would it be if I integrated along it?" Peaks in that search surface become detections characterized by:

Frequency (bin or interpolated center),
Drift rate of the best-fit slope,
Signal-to-noise ratio (SNR) of the integrated detection.

Yield. On real data this stage is deliberately sensitive: you should expect hundreds to thousands of raw hits — many will be RFI remnants, sidelobes, or noise excursions. That is why Stage 4 exists.

Taylor tree versus brute force (when to care). For survey-scale data the Taylor tree is the default because complexity scales more gently than naïve re-integration at every drift and every pixel. Brute force remains valuable when you distrust a bug in the tree, when the product has very few channels or dumps, or when you are generating gold-standard plots for a paper figure with affordable compute.

5. Stage 4: Signal Clustering

Multiple hits often describe one physical emitter (harmonics, leakage across bins, or duplicate peaks from the search grid). Clustering collapses them.

HDBSCAN path. If there are more than five hits, MitraSETI clusters in a 3D feature space:

Frequency
Drift rate
Log SNR (compressing dynamic range so faint and bright features coexist in one metric space)

HDBSCAN finds density-connected groups without forcing you to specify the cluster count. Points labeled −1 are noise in the clustering sense — they do not form a dense group and are discarded as standalone hits.

Greedy merge path. If there are five or fewer hits, running a density clusterer is usually unnecessary; a greedy merge combines obviously redundant detections.

Representatives. For each surviving cluster, the pipeline picks the hit with the highest SNR as the single candidate representing that cluster.

Yield. After clustering you typically land in the tens of candidates per file — often on the order of 10–50, depending on RFI environment, thresholding, and sky content. That is the set passed to the expensive ML stage.

6. Stage 5: ML Classification & Scoring

For each clustered candidate, MitraSETI builds a rich feature bundle suitable for both human review and machine ranking.

a. Patch extraction. A 128×128 spectrogram patch is cut centered on the candidate in frequency–time, large enough to capture the drift line and local background.

b. CNN + Transformer. The patch feeds a hybrid model: convolutional front ends capture local texture; Transformer blocks model longer-range structure along the patch. The head emits multi-class probabilities — categories along the lines of narrowband_drifting, broadband, pulsed, noise, and related labels used in training.

c. RFI probability, OOD, confidence. Alongside the discrete label, the system exposes:

RFI probability — how strongly the model associates the patch with terrestrial interference patterns.
OOD score — how "far" the example lies from the training manifold (useful for flagging weird edge cases).
Confidence — internal certainty on the predicted class.

d. Periodicity detection. An FFT on channel time series (or equivalent periodogram step) checks for regular pulses that might not be fully described by a single drifting ridge.

e. Attention heatmap. The Transformer's attention is visualized as a heatmap over the patch so humans can see which time–frequency regions drove the decision — critical for publication figures and debugging false positives.

f. Interestingness score. Finally, a six-component composite combines SNR-like strength, spectral narrowness, drift plausibility, RFI penalties, model agreement, and related terms into a single 0–100-style interestingness figure (exact weighting lives in the implementation; the idea is one number for triage).

Result. After Stage 5, each candidate is no longer "a blob in a search plot" — it is a fully characterized object ready for export, persistence, and cross-survey comparison.

7. Stage 6: Output & Integration

Science does not stop at a Python object in memory. Stage 6 makes results durable, shareable, and contextual.

FITS export. Candidates can be written to a FITS binary table structured for Virtual Observatory (VO) compatibility, so standard tools (TOPCAT, Aladin, custom pipelines) can ingest them.

Persistence tracking. A JSON state file records candidates across epochs — so if the same sky position is re-observed, you can ask whether a line of sight keeps producing a consistent feature or whether it was a one-off glitch.

Cross-matching. When AstroLens (or related optical anomaly products) is available, the pipeline can match radio candidates to optical anomalies by sky position and metadata, supporting multi-wavelength hypotheses.

Unified sky map. For communication with collaborators and the public, MitraSETI can produce a single map overlaying radio candidate locations with optical detections — one visual story for "what the machine found, where."

HTML report. A publication-ready HTML summary bundles plots, tables, and key statistics so a run can be archived or attached to a lab notebook without re-running notebooks.

CLI output. For quick checks, the terminal prints a ranked table of candidates with the most important columns (frequency, drift, SNR, class, scores).

Figure 2 — Hit counts narrow dramatically at each pipeline stage, illustrating the funnel from raw spectral data to ranked candidates.

8. The Click CLI

MitraSETI exposes a Click-based command-line interface. Typical entry points:

Command	Role
`mitraseti search <file>`	Run the full pipeline on a single filterbank or HDF5 file.
`mitraseti stream --dir <dir>`	Continuous mode: watch a directory and process new files as they arrive.
`mitraseti benchmark`	Run speed benchmarks to compare configurations or hardware.
`mitraseti export --format fits`	Export persisted results to FITS (and related options as implemented).
`mitraseti crossmatch`	Cross-match radio candidates with AstroLens (or configured optical catalogs).
`mitraseti report`	Generate the publication-style HTML (or linked) report.
`mitraseti rfi`	Manage the RFI database (add notes, enable/disable entries, inspect sources).
`mitraseti persistence`	Inspect persistent sources tracked across epochs in the JSON state.
`mitraseti paths`	Print configured paths (models, catalogs, output roots) so deployments are debuggable.

Exact flags (--help on each subcommand) may evolve between releases; treat this table as the conceptual CLI surface.

Typical workflows. A lab member runs mitraseti search on a suspect file during debugging; the observatory daemon runs mitraseti stream overnight; after a campaign, mitraseti export and mitraseti report produce artifacts for the team drive, while mitraseti crossmatch ties the radio list to optical cuts. mitraseti paths is the first command to run on a new machine when imports succeed but files are "not found."

9. Concrete Example: Processing Voyager 1

The Voyager 1 carrier is a famous sanity check: a known drifting narrow line from a distant spacecraft. Here is a plausible narrative of what MitraSETI does to one representative observation file (numbers illustrative of a high-resolution FFT spectrometer product).

★

Fun Fact — Voyager 1 Detection

MitraSETI successfully identifies the Voyager 1 carrier at ~8,419,296,991.5 Hz with a drift of 0.287 Hz/s and SNR 47.18. The model classifies it as narrowband_drifting with 99.63% confidence — a textbook detection that aligns with spacecraft ephemerides and human expertise.

Figure 3 — Voyager 1 processing narrative: from 1M channels to a confirmed narrowband_drifting carrier at 8.4 GHz.

The detailed walkthrough:

File read. The header reports roughly 1,048,576 channels and 16 time steps — an enormous frequency dimension with a short temporal span, consistent with a single FFT dump or stacked spectra style product. Start frequency sits near 8419.xxx MHz (the Voyager downlink neighborhood in X-band).
SK filter. Spectral kurtosis flags about 5% of channels — often concentrated around known RFI and spectral ripples, not randomly uniform noise.
Taylor tree search. With normalization applied, the tree search finds two coherent features above SNR 10 — candidates worth clustering.
Clustering. With only two hits, the pipeline does not need HDBSCAN's density model; greedy merge (or equivalent small-n path) leaves two distinct candidates.
ML classification. The brighter ridge is classified as narrowband_drifting with 99.63% confidence — the model recognizes the classic diagonal carrier morphology.
Interestingness. Composite scoring returns on the order of 85/100: high SNR, physically plausible drift, and low RFI indicators after database and statistical steps.
Output. The ranked list identifies the Voyager 1 carrier near 8419296991.5 Hz, drift 0.287 Hz/s, SNR 47.18 — a detection that lines up with ephemeris expectations and human expertise.

This example ties the abstract stages to numbers you can sanity-check against spacecraft ephemerides and telescope logs.

10. Performance Numbers

On a large streaming campaign (representative MitraSETI deployment figures):

100 files processed in about 2.57 hours in streaming mode — throughput dominated by ingest, SK, Taylor tree passes, and batched ML inference.
288,864 individual RFI-like features or channels rejected or down-weighted across statistical and catalog steps.
11 candidates survived all filters and scoring to the final human-facing list.
That implies an effective RFI rejection rate of about 99.996% relative to raw spectral defect counts — most energy never makes it to "interesting candidate" status.

✦

Key Concept — 99.996% RFI Rejection

From 288,864 flagged spectral features down to just 11 surviving candidates: the pipeline achieves an effective RFI rejection rate of ~99.996%. This figure emphasizes the pipeline's role — not to prove ETI, but to wrestle terabytes of contaminated spectra down to a small, ranked set worth expert time — transparently, stage by stage.

How to read the percentages. The 99.996% figure is a useful order-of-magnitude for how aggressively the combined stages reject structured interference relative to the volume of flagged spectral content — not a formal detection probability for technosignatures. Always pair headline rates with false alarm checks on off-source fields and injections of synthetic carriers when you tune thresholds.