Splitting a Forensic-Finance Monolith into Four Repos: Why and How

Source: kr-forensic-core · kr-dart-pipeline · kr-anomaly-scoring · kr-stat-tests

For most of 2025 these four repositories were one. The single project — then called kr-forensic-finance — held the DART ingestion code, the M-Score and CB/BW scoring functions, the 14 statistical validation scripts, and a growing pile of constants that everything else imported. It worked. It also accumulated the predictable monolith failure modes: every change touched everything; the constants module became a circular-import minefield; running the test suite required the full data pipeline to have completed first; and external users who wanted just the M-Score code had to install the entire dependency stack to get it.

In March 2026 it was split into four. Each piece is independently installable, independently tested, independently versioned, and importable without dragging the rest of the stack along.

This post is about why those four pieces, and not three or five, and what each one does.


The Dependency Graph, Honestly Drawn

The split was driven by the actual dependency relationships rather than by surface-level grouping. The pre-split monolith had hundreds of internal imports; the question was where the natural cut points were.

Three cut points held up under inspection:

  1. Constants and schemas vs. everything else. The Beneish thresholds, the parquet column contracts, the path conventions — all of them were imported by every other layer. Pulling them into a zero-dependency leaf package eliminated the worst circular import patterns and made the constants citable on their own.

  2. Data ingestion vs. data analysis. The ETL layer (DART API calls, KRX fetches, parquet writes) had nothing analytical in it. It produces files; that is its entire contract. Splitting it into its own repo meant the data files became the interface between ingestion and analysis, which is much cleaner than function-call coupling.

  3. Scoring vs. statistical validation. Scoring is “compute a flag for each company.” Statistical validation is “compute the methodological evidence for the scoring choices.” They shared no code paths in practice — they only shared input parquets. Splitting them meant the validation suite could evolve methodologically without forcing scoring releases.

The result is a directed graph with kr-forensic-core at the bottom and three siblings reading from kr-dart-pipeline’s parquet outputs:

kr-forensic-core  ← constants, schemas, paths (zero deps)

    ├── kr-dart-pipeline       (ETL: writes parquets)
    ├── kr-anomaly-scoring     (reads parquets, scores)
    └── kr-stat-tests          (reads parquets, validates)

The delivery layer (krff-shell) sits above all four and is documented separately.


kr-forensic-core: The Foundation

Zero external dependencies. The package is roughly 166 lines of Python — constants modules, a schema registry (PARQUET_TABLES), a path-resolution function (data_dir()), and the canonical column lists for each downstream parquet (BENEISH_SCORES_COLUMNS, CB_BW_EVENTS_COLUMNS, etc.).

It exists because every other repository in the platform — kr-dart-pipeline, kr-anomaly-scoring, kr-stat-tests, and krff-shell — auto-installs it as a transitive dependency. When kr-anomaly-scoring needs to know the Beneish threshold (-1.78), it imports BENEISH_THRESHOLD from kr_forensic_core. When kr-dart-pipeline writes the cb_bw_events.parquet file, it validates the column set against kr_forensic_core.schemas.CB_BW_EVENTS_COLUMNS. When krff-shell needs to find the data directory, it calls data_dir() from kr_forensic_core.

That centralization is the entire point. Constants and schemas are exactly the kind of thing that drift if every consumer keeps its own copy. Putting them in a leaf package with zero dependencies makes the consequences of a constant change visible everywhere it is used, in one bump.

10 tests. Mostly schema validation and constant invariants — nothing complex, but enough to catch a renamed column or a typo’d threshold before the change cascades downstream.


kr-dart-pipeline: The 15 Extractors

This is the data ingestion layer. The interface is files: it produces standardized parquets in a conventional directory layout, and that is the entire contract with downstream consumers.

15 extractors, each writing one parquet, drawing from five Korean data sources:

  • DART (FSS): extract_dart (annual financials), extract_officer_holdings (board-member shareholdings), extract_major_holders (5%+ ownership reports), extract_disclosures (filing history), extract_corp_actions (capital structure events), extract_corp_ticker_map (corp_code ↔ ticker), and three sub-document parsers — extract_bondholder_register, extract_revenue_schedule, extract_depreciation_schedule — that pull structured tables out of DART 사업보고서 sub-documents via regex and BeautifulSoup.
  • KRX: extract_krx and extract_price_volume for OHLCV time series.
  • SEIBRO (KSD): extract_seibro and extract_seibro_repricing for CB/BW events and repricing histories. Currently blocked — the public API has been returning resultCode=99 since early 2026.
  • KFTC: extract_kftc for chaebol cross-shareholding data.
  • FSC: build_isin_map for the bond ISIN ↔ corp_code mapping.

The two interesting design choices: each extractor is independently runnable (no implicit pipeline ordering), and each writes its parquet atomically (a temp file that gets renamed on success). That combination means the pipeline can be re-run partially after individual failures without redoing the expensive successful stages.

29 tests. One known failure (test_cli_module_importable — a typer dependency issue that does not block the actual pipeline runs). The SEIBRO extractors have stub implementations that fail loudly when the API is broken; the rest are operational.


kr-anomaly-scoring: The 4 CB/BW Flags Plus Network Centrality

This is where the parquets become signals. The package reads from the data directory kr-dart-pipeline writes to and produces per-company anomaly scores.

The CB/BW screen has four flags, each tied to a published threshold:

FlagDescriptionThreshold
repricing_below_marketRepricing adjusted below 95% of market0.95
exercise_at_peakExercise within 5 calendar days of price peak5 days
volume_surgeVolume above 3x pre-event baseline3.0
holdings_decreaseOfficer holdings drop ≥5% post-exercise0.95

Each flag returns a per-event boolean; the screen aggregates them per company. Companies with multiple flags are the priority queue for human review — the four flags individually are noisy, but their conjunction is a strong forensic signal.

There is also an officer-network module that builds a cross-company directorship graph (nodes are people, edges are simultaneous board seats) and computes graph centrality metrics. Companies whose officers occupy unusually central positions in the network are flagged for review as a separate signal — the rationale ties to the JFIA literature on chaebol governance.

13 tests. The public API is two functions: score_cb_bw_events and score_disclosures. The interesting integration testing happens in kr-stat-tests.


kr-stat-tests: The 14 Methodology Validation Scripts

This is the layer that exists so the scoring choices are defensible. Every threshold, every flag definition, every detectlet has methodological validation backing it — and those validations need to be re-runnable, version-controlled, and citable.

The 14 scripts split across eight methodologies:

  • PCA (pca_beneish.py) on the 8-dimensional Beneish ratio space.
  • Bootstrap (bootstrap_threshold.py, bootstrap_centrality.py) for confidence intervals on the Korean M-Score threshold and on officer-network centrality scores.
  • LASSO (lasso_beneish.py) for sparse feature selection over the M-Score components.
  • Random forest (rf_feature_importance.py) for non-parametric importance ranking.
  • Permutation tests (permutation_repricing_peak.py) on the repricing-at-peak signal.
  • FDR correction (fdr_timing_anomalies.py, fdr_disclosure_leakage.py) using Benjamini-Hochberg to control false discovery rate when running ~3,600 tests across 900 companies × 4 flags.
  • Survival analysis (survival_repricing.py) on time-to-repricing in CB/BW events.
  • Imputation (impute_financials.py) for the missing-financial pattern that K-IFRS nature-of-expense filers create.
  • Plus clustering (cluster_peers.py), outlier classification (classify_extreme_outliers.py), cross-screen analysis (cross_screen_analysis.py), and label coverage (label_coverage_analysis.py) — the last of which matches the 86 enforcement cases from kr-enforcement-cases against the 900+ KOSDAQ universe to estimate per-signal recall.

Each script writes a standalone CSV. The audit trail for any methodological claim made by the platform points to one of these output files.

5 tests. The low test count is deliberate: each of the 14 scripts requires real parquet inputs to run, and the upstream parquets are produced by the live pipeline (not synthetic fixtures). Validating the scripts themselves is a downstream integration concern; the unit tests only verify package imports, script registry consistency, and the methodology DAG.


Why Four Repos Instead of One Monorepo

Three reasons that mattered in practice.

Independent install paths. A researcher who only wants the Beneish threshold and the parquet column contracts installs kr-forensic-core (zero deps). A journalist who only wants a CB/BW dilution screen installs kr-anomaly-scoring. Neither of them needs the full ETL stack or the statistical validation suite. The monolith forced everyone to take everything.

Independent release cycles. The methodology validation scripts in kr-stat-tests evolve as new statistical questions arise. The ETL layer in kr-dart-pipeline evolves as DART exposes new endpoints. The scoring functions in kr-anomaly-scoring evolve as forensic literature publishes new signals. None of those evolution rates should constrain each other.

Honest dependency boundaries. The monolith had implicit data-ordering dependencies between modules that broke silently when something was reordered. Splitting the data layer into a separate repo whose interface is files (not Python imports) made those dependencies explicit. If kr-anomaly-scoring reads cb_bw_events.parquet and the column it needs is missing, the failure is at parquet read time, not at import time, and the diagnosis is immediate.

The four repos together are the same code that was in the monolith. They are also a much better test of whether the dependency model the architecture document claimed actually holds — because if it did not hold, the split would not have worked.

The repositories are at the GitHub links above. All MIT-licensed. Together they are roughly 700 lines of public-API code, supported by a few thousand lines of pipeline scripts and validation tests.