Chapter 05

Clustering with HDBSCAN & Anomaly Detection

Grouping redundant de-Doppler hits into single candidates with density-based clustering, then scoring survivors for novelty.

Author: Saman Tabatabaeian — Deep Field Labs MitraSETI Tutorial Series

You already know how a de-Doppler search turns a spectrogram into a list of hits: each hit is a local maximum along some trial drift, with a frequency, drift rate, and signal-to-noise ratio (SNR). You also know that RFI filtering strips away much of the obvious garbage—but what remains is still far too redundant for humans or downstream models to treat as independent detections. This chapter explains why one physical signal spawns many hits, why simple merge rules fail, and how HDBSCAN (Hierarchical DBSCAN)—a density-based clustering method—groups those hits into single representative candidates. We close with anomaly / out-of-distribution (OOD) scoring, which asks whether a surviving candidate looks like anything the machine has seen before, and we preview how that feeds the interestingness score in the next chapter.

Throughout, keep one picture in mind: a map of houses. Suburbs are clusters—many homes close together. The open countryside is low density—scattered farmhouses with long gaps between them. Clustering is the act of deciding which houses "belong to the same town" and which are just isolated noise on the landscape.

Figure 5.1 — The houses-on-a-map analogy. Dense suburbs are clusters; isolated farmhouses are noise. Density-based clustering draws boundaries between crowded regions and the gaps between them.

1. The Clustering Problem

A single narrowband transmission, after de-Doppler search and thresholding, rarely appears as one row in a candidate table. In practice it produces dozens or hundreds of separate hits. Why?

First, channelization. Spectral data are binned into narrow channels. A carrier that is not infinitely narrow in frequency spreads power across several adjacent bins. Each bin can cross the detection threshold independently, so one emitter becomes a comb of correlated peaks at nearly the same drift.

Second, the drift grid. De-Doppler search evaluates many trial drift rates. The true drift is unlikely to land exactly on one grid point; nearby trial drifts can all produce similar integrated SNR. You therefore get a ridge of hits in (frequency, drift) space—not one point, but a tight swarm.

Third, noise and sidelobes. Real pipelines integrate over finite time, apply windowing, and sit in a sea of residual RFI and statistical fluctuations. Weak copies, harmonics, and threshold "fringes" add extra hits around the same physical feature.

Without clustering, a downstream system might report 500 "detections" when the sky actually contained three distinct emitters (or one emitter seen in three places). That breaks human triage, follow-up scheduling, and machine learning—models trained on "one example per true signal" would see a distorted label distribution.

Goal: merge hits that are spatially and physically associated in the search space, and output one representative candidate per cluster (plus a principled way to discard isolated junk).

2. Naive Approaches (and Their Problems)

Before density-based clustering, pipelines often used ad hoc geometry.

Fixed-radius merge

Idea: two hits belong together if their frequency channels differ by at most Δf and their drift rates differ by at most Δd.

Problem: there is no universal (Δf, Δd) that works. A very narrow window splits extended features: wide carriers or drift ridges stay fragmented, and you still over-count. A very wide window merges distinct signals that happen to sit near each other in parameter space—classic over-merging. The "right" scale depends on resolution, SNR, and local source density, not one global constant.

On the houses map: a fixed-radius rule says "everyone within 500 meters of me is in my neighborhood." In a dense city that might lump two different blocks; in rural areas it might miss a loose village whose houses are 600 meters apart along a road.

K-means

Idea: partition points into k groups by minimizing distance to k centroids.

Problem: you must choose k in advance. In SETI you do not know how many independent signals produced today's hit list—k is unknown and time-varying. K-means also tends to favor round, equal-sized blobs; real hit swarms can be elongated along drift or frequency.

DBSCAN (preview)

DBSCAN improves on k-means: it is density-based and does not need k. But it still needs a fixed neighborhood radius ε (epsilon). That single global scale is fragile when cluster density varies—we return to this in §4–5.

3. What Is Density-Based Clustering?

★

Key Concept — Density-Based Clustering

Density-based clustering defines a cluster as a region where many points live close together, separated by regions where points are sparse. Unlike k-means (which assumes spherical clusters around centroids), density-based methods can follow arbitrary shapes—curved ridges, banana-shaped swarms, multiple arms—as long as local density stays high along the structure.

Points in low-density areas are typically labeled noise: they are not close enough to enough neighbors to be part of any "town." That is powerful for SETI: spurious single-pixel spikes and one-off threshold glitches should not force the algorithm to invent a cluster.

Houses on a map (again): walk the landscape and, at each house, count how many other houses fall within a short walking distance. In a suburb, counts are high—you are surrounded. On an empty plain, counts are low. Density-based clustering draws boundaries between high-count regions and the gaps between them. A lone farmhouse miles from anyone is noise, not a "cluster of size one" unless you deliberately lower your standards.

This intuition—local crowding defines structure—is the foundation of DBSCAN and HDBSCAN.

4. DBSCAN: Precursor to HDBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) uses two main parameters:

Parameter	Role
ε (epsilon)	Radius of the neighborhood around each point
min_samples	Minimum number of points (including the center) required inside that radius for a point to be "dense enough"

Point roles

Figure 5.2 — DBSCAN point classification. Core points (blue) have ≥ min_samples neighbors within ε. Border points (green) touch a core but aren't dense themselves. Noise points (pink) are too isolated for any cluster.

Core point: at least min_samples points lie within distance ε (including itself). Core points are the "backbone" of a cluster.
Border point: not core, but within ε of some core point. Border points belong to the cluster but sit on its edge.
Noise: neither core nor border—too isolated to attach to any dense region.

Clusters

A cluster is a connected component of core points, plus all border points reachable from them. Different clusters are separated by low-density gaps wider than ε.

Operationally, DBSCAN starts from an unvisited core point, expands the cluster by pulling in all points density-reachable through a chain of core points, attaches border neighbors, then moves to the next unvisited core. The process is deterministic given ε and min_samples (up to ordering tie-breaks). On the map, once you stand on a core house, you flood-fill through other core houses within walking distance until you hit the suburb's edge; border houses touch the flood but are not themselves hubs.

The ε problem

ε is global: one ruler for the entire dataset. If your map has a compact city and a sprawling, loose exurb with the same true semantic "town," no single ε fits both. Tight ε fragments the exurb; loose ε bridges separate cities. SETI hit lists often mix tight high-SNR knots and looser faint ridges—exactly the failure mode of fixed ε.

So we need a method that adapts to varying density. That is where HDBSCAN enters.

5. HDBSCAN — The Upgrade

HDBSCAN stands for Hierarchical DBSCAN. The name reflects the core trick: instead of committing to one ε, the algorithm considers many scales at once and builds a hierarchy of what would be clusters if you shrank or grew ε.

You do not need to implement it by hand to use MitraSETI, but the conceptual pipeline (as in Campello, Moulavi, and Sander, building on earlier hierarchical density work) is:

★

Key Concept — Mutual Reachability Distance

For each pair of points, HDBSCAN defines a distance that blends ordinary separation with how "deep" each point sits inside a dense region. Plain Euclidean distance between two outlying hits can be small even when neither sits in a dense region—two isolated farmhouses might be neighbors on the map but do not define a town. Mutual reachability inflates such edges so the minimum spanning tree prefers linking through denser "cores," making the hierarchy reflect cohesive structure rather than accidental proximity.

Mutual reachability distance. For each pair of points, define a distance that blends ordinary separation with how "deep" each point must be inside a dense region. This dampens the effect of outliers pulling ε too wide and makes the next step stable.
Minimum spanning tree (MST). Treat points as vertices; mutual reachability distances are edge weights. Build an MST—think of connecting all houses with roads of minimum total length where "length" encodes density-aware separation.
Cluster hierarchy (dendrogram). Sort MST edges by decreasing distance and remove them one by one. Each cut splits a component into smaller pieces. This yields a tree of nested clusterings: at coarse scales you see only the densest agglomerations; at fine scales you see substructure.
Condensation with min_cluster_size. Very small branches are unstable: a handful of points should not count as a robust cluster. min_cluster_size prunes the hierarchy so only groups with enough members survive as candidates for selection.
Stable cluster selection. From the condensed tree, HDBSCAN picks clusters that persist across meaningful density levels—intuitively, "suburbs" that do not evaporate when you slightly tighten or loosen the notion of neighborhood.

Figure 5.3 — HDBSCAN builds a hierarchy of clusterings across all density scales, then selects stable clusters. Small branches below min_cluster_size are pruned as noise.

Result: clusters can have different densities in different parts of the space. Points that never join a sufficiently stable, large group are labeled −1 (noise).

★

Key Concept — min_cluster_size

The key user-facing parameter is min_cluster_size, not a single ε. This aligns much better with SETI, where you care about minimum evidence ("don't call three pixels a cluster unless you must") rather than picking one magical radius in Hz or channels. In the houses analogy: min_cluster_size says "don't promote a cul-de-sac with two houses to city status."

Why mutual reachability? Plain Euclidean distance between two outlying hits can be small even when neither sits in a dense region—two isolated farmhouses might be neighbors on the map but do not define a town. Mutual reachability inflates such edges so the MST prefers linking through denser "cores," which makes the hierarchy reflect cohesive structure rather than accidental proximity.

Normalization in practice: Real pipelines (including MitraSETI variants) often standardize channel index, drift, and log-SNR to zero mean and unit variance before clustering so no single axis dominates purely because of units or dynamic range. The physics lives in the original headers; the metric lives in a balanced feature space.

6. How MitraSETI Uses HDBSCAN

MitraSETI clusters de-Doppler hits in a feature space designed to reflect how duplicates arise:

(frequency_channel, drift_rate, log₁₀(SNR))

Why log(SNR)? SNR can span orders of magnitude—say 10 to tens of thousands on arbitrary units. Raw SNR would let a few screaming outliers dominate the distance metric; weaker but physically related hits would look artificially far away. Taking log₁₀ (or an equivalent log transform) compresses that dynamic range so bright and moderate members of the same physical cluster can sit near each other in the feature vector.

min_cluster_size is chosen adaptively, for example:

min_cluster_size = max(3, int(len(hits) × 0.01))

So larger hit lists demand larger minimum groups, reducing spurious micro-clusters when the sky is crowded with threshold crossings; tiny lists keep a floor so you do not over-cluster noise into singleton "clusters."

Representative selection: within each non-noise cluster, MitraSETI keeps the highest-SNR hit as the single candidate that stands for the whole group. That matches the scientific preference: the strongest measurement is usually the best constrained for follow-up and for ML feature extraction.

Noise (label −1): hits that HDBSCAN cannot attach to any stable dense region are treated as spurious or insufficiently corroborated and are discarded from the clustered candidate list (subject to any additional pipeline stages you enable).

Note: In a given MitraSETI version, exact formulas (e.g. log1p vs log10, or the precise min_cluster_size rule) may differ slightly from the illustrative values above—always check the pipeline configuration for the release you run.

7. Before vs After: A Concrete Example

Consider a dense post–de-Doppler hit list—roughly 500 threshold crossings after RFI masking.

Figure 5.4 — Before clustering: hundreds of redundant hits from three physical signals plus noise. After HDBSCAN: one representative per cluster, noise discarded.

Greedy fixed-window merge (sort and merge by proximity only) might collapse them to on the order of ~45 "candidates." That sounds better than 500, but many entries can still be duplicate representations of the same ridge, or weak fringe hits that a rigid tolerance failed to separate cleanly from true clusters.

HDBSCAN, by respecting local density and stability, might yield on the order of ~12 cluster representatives, with ~200 points classified as noise and dropped. Those 12 are the distinct, corroborated groupings; the 200 are isolated or loosely coupled flukes that should not consume reviewer attention.

The exact numbers move with dataset, thresholds, and band, but the pattern is what matters: density-aware clustering attacks multiplicity at the source—one signal, one row—while noise labels provide a second line of defense beyond SNR alone.

8. Graceful Fallback

HDBSCAN needs enough points to estimate density. For very small hit lists (for example fewer than five detections), the hierarchy is degenerate and distance estimates unstable—the algorithm has little to work with, much like trying to define a "suburb" from three houses total.

MitraSETI therefore uses a hybrid strategy:

Enough hits: run HDBSCAN in the feature space above, keep best-in-cluster representatives, drop noise.
Too few hits: fall back to a simple greedy merge: sort by SNR (or frequency), then merge hits that fall within 10 channels and 10% drift tolerance (relative to local drift scale as implemented). That preserves deterministic, understandable behavior on sparse chunks of data without invoking density hierarchy machinery.

This combination is typical of production pipelines: sophisticated when data support it, robust when they do not.

9. Anomaly Detection and Out-of-Distribution (OOD) Scoring

Clustering answers "which hits are the same thing?" It does not, by itself, answer "is that thing familiar?"

◆

Insight — OOD Scoring Concept

MitraSETI's ML stages include out-of-distribution (OOD) detection: the model assigns an OOD score—how far this example lies from the manifold of training examples, not just which labeled class wins a softmax. When high OOD coincides with high SNR and low RFI confidence, the triage system surfaces maximum-interest targets: not just loud, but qualitatively different from the zoo of known false positives.

Think of OOD as a cartographer's question: clustering told you which houses form a hamlet; OOD asks whether that hamlet's architecture matches any town you've surveyed before, or whether you should flag it for a site visit.

MitraSETI's machine learning stages (detailed in the next chapter) include classification against known signal morphologies and interference types. A complementary idea is out-of-distribution (OOD) detection: the model assigns an OOD score—how far this example lies from the manifold of training examples, not just which labeled class wins a softmax.

Interpretation:

Low OOD: the candidate resembles something seen during training—could be RFI class, astrophysical class, or benign narrowband structure. Still worth rules-based checks, but not automatically "new."
High OOD: the candidate is structurally unusual relative to the training corpus.

When high OOD coincides with high SNR and low RFI confidence (having survived spectral kurtosis, catalogs, cadence tests, etc.), the triage system surfaces maximum-interest targets: not just loud, but qualitatively different from the zoo of known false positives. That directly connects to the interestingness score in Chapter 06 — Machine Learning Pipeline, where multiple evidence channels are fused into a single ranking for humans and telescopes.

10. Other Clustering Approaches (Comparison)

GLOBULAR (2025, Astronomical Journal) describes a modern narrowband search that also employs HDBSCAN-style density clustering to collapse redundant detections, extended with geographic / site metadata to reason about interference environment. The philosophical alignment with MitraSETI—density-based dedup after sensitive search—is strong; MitraSETI's feature space emphasizes channel, drift, and SNR rather than globe-aware context, but the clustering motivation is the same.

turboSETI historically relied on simpler, tree- or merge-based grouping of hits along frequency and drift—fast and transparent, but more sensitive to fixed tolerances and uneven density than HDBSCAN.

hyperseti (and some other experimental stacks) may omit explicit clustering as a separate stage, folding deduplication into other logic or leaving multiplicity for downstream consumers. The tradeoff is implementation simplicity versus explicit control over false multiplicity.

Summary

One physical signal yields many de-Doppler hits because of channel width, drift sampling, and residual artifacts; without clustering you inflate detection counts.
Fixed-radius merges and k-means are brittle; DBSCAN improves matters but needs a global ε.
Density-based clustering treats crowded regions as structure and sparse points as noise—the houses on a map picture.
HDBSCAN builds a hierarchy over scales, selects stable clusters, and uses min_cluster_size instead of a single radius.
MitraSETI clusters in (channel, drift, log SNR), picks max-SNR representatives, drops noise, and falls back to greedy merge on tiny lists.
OOD scoring flags genuinely unusual survivors for high-priority follow-up, feeding the interestingness framework next.

Try it in the Cloud

HDBSCAN clustering and anomaly scoring run on every job. Try it with your own observation data.

Open MitraSETI Cloud →