Korean Forensic Accounting Toolkit
A 13-repo, 749-test forensic accounting platform that screens Korean DART filings for earnings manipulation and convertible bond dilution — from raw government data to investigative-grade output, automated end to end. Each component is independently published, MIT-licensed, and accompanied by a stand-alone write-up.
Overview
A coordinated ecosystem of 13 Python projects built for systematic forensic analysis of Korean capital markets. The toolkit covers four layers: foundation libraries for identity resolution, trading calendar math, and earnings manipulation scoring; analysis libraries for CB/BW option pricing and forensic signal detection; a platform ETL pipeline ingesting five government data sources into 15 standardized Parquet files; and a delivery shell with CLI, DuckDB query layer, FastAPI/MCP server, and HTML reports with AI-generated narrative. A 4-tier GitHub Actions agent system runs continuous quality gates — convention audits, doc drift detection, cross-repo test validation, and autonomous fix PRs.
Problem
Two problems run in parallel. First: Korean convertible bonds (전환사채) and bonds with warrants (신주인수권부사채) are a known vector for minority shareholder dilution on KOSDAQ. A company issues a CB with a conversion price set below the current stock price, meaning the bondholder profits at issuance before any repricing — but detecting this systematically across all DART-listed companies requires both option pricing capability and a cross-market data join that no open tool provided. Second: Korean capital markets data is fragmented across agencies — DART (FSS), KRX, NTS, and the Ministry of Justice — each assigning different identifiers to the same company, with no official crosswalk. Without an identifier resolution layer, connecting a company's financial disclosures to its procurement contracts or corporate registry entry requires commercial data licenses.
Constraints
- No commercial data sources — all inputs must be free or public APIs to remain reproducible and MIT-licensable
- DART rate limits and response size require a caching strategy across 3,949+ companies and 15 endpoint types; brute-force re-fetching is not viable
- SEIBRO's public API (the bondholder register) was returning errors with no fix timeline — core CB/BW analysis could not wait for an external dependency to stabilize
- Must run on a single machine without cloud infrastructure; output must be usable by researchers without any running server
Approach
Layered multi-repo architecture with strict dependency boundaries. Foundation libraries (kr-company-registry, kr-trading-calendar, kr-beneish, jfia-catalog) have zero cross-imports and are independently installable. Analysis libraries consume data files, not code imports. The platform layer (kr-forensic-core, kr-dart-pipeline, kr-anomaly-scoring, kr-stat-tests, krff-shell) forms a directed graph with shared constants and schemas at the base. SEIBRO dependency was bypassed by modeling CB/BW embedded options as European calls using Black-Scholes on DART-disclosed conversion prices and KRX market prices — covering the same signal without proprietary bondholder data. A coordination hub repo runs ecosystem-wide CI and a 4-tier agent system (Haiku for classification, Sonnet for synthesis) that operates autonomously between sessions.
Key Decisions
Multi-repo over monorepo
Each repo can be independently tested, published, and used. A journalist who only needs the company identifier crosswalk installs one package. A researcher who needs M-Score computation installs another. Dependencies are explicit and versioned. Cross-repo blockers surface in a dedicated hub rather than disappearing into a monorepo's noise.
- Monorepo with namespace packages — simpler coordination, but forces the full dependency set on every user
- Single package with optional extras — familiar pattern but makes independent release cycles impossible
Black-Scholes for CB/BW dilution screening rather than waiting for SEIBRO
SEIBRO's public API (KSD bondholder register) was returning resultCode=99 with no ETA for a fix. Waiting would have blocked the core use case indefinitely. DART discloses conversion prices and terms at issuance; KRX provides daily prices. Black-Scholes on these inputs detects whether a CB was issued in-the-money — the primary dilution signal — without needing the bondholder register at all.
- Wait for SEIBRO API stabilization — unacceptable timeline dependency on an external agency
- Purchase commercial SEIBRO data access — incompatible with MIT licensing and reproducibility requirements
corp_code as the canonical primary key across all 13 components
The only identifier that is permanent — it survives corporate restructuring, relisting, and SPAC mergers. BRN changes on restructuring; KRX tickers change on relisting and backdoor listings; CRN is permanent but requires paid court registry access. corp_code is the only stable, API-accessible key and is assigned by DART, the primary data source.
- BRN (National Tax Service) — stable for most purposes but changes on corporate restructuring and invalid for foreign-listed companies
- KRX ticker — changes at every SPAC merger, relisting, and backdoor listing (우회상장)
Tech Stack
- Python ≥3.11
- uv, hatchling, pytest
- DART OpenAPI (primary filings data)
- pykrx (KRX price and volume)
- DuckDB, pandas, Parquet
- Black-Scholes / scipy / numpy
- FastAPI, MCP protocol
- GitHub Actions (4-tier CI/CD)
- Claude API (Haiku + Sonnet)
- Pydantic
Result & Impact
- 13Repositories
- 749Total tests
- 5Data sources
- 15ETL extractors
A complete open-source forensic finance platform for Korean capital markets — the first to screen the full DART dataset for CB/BW dilution using Black-Scholes without commercial SEIBRO data. Covers the full analytical stack: company identity resolution, trading calendar math, Beneish M-Score earnings manipulation detection, CB/BW option pricing, enforcement case labeling, statistical validation, and a delivery shell for human-in-the-loop review.
Learnings
- Multi-repo ecosystems need a coordination hub from day one. Cross-repo blockers do not surface in any single repo's CI — they require a hub that scans across the ecosystem. Adding the hub late means retroactively discovering problems that accumulated silently.
- External API failures are a design constraint, not an incident. SEIBRO's breakdown forced the Black-Scholes approach, which turned out to be analytically preferable to waiting. Design each component so data source failures degrade gracefully rather than blocking the pipeline.
- Automated agents reduce maintenance overhead but require careful model routing. Haiku is sufficient for classification tasks (triage ranking, doc drift detection); Sonnet is needed for synthesis (convention audit, fix generation). Misrouting adds cost without improving output.
- Identifier stability determines join reliability across the entire pipeline. Any join on KRX tickers silently breaks at every SPAC merger and backdoor listing. Establishing corp_code as the canonical key at the start prevented a class of data integrity bugs that would have been expensive to diagnose later.
Component Write-Ups
Every component repository ships with its own stand-alone write-up. Read whichever matches the question you arrived with.
Foundation libraries
- 3,949 Companies. Four Numbering Systems. One Table. —
kr-company-registry. The DART/KRX/BRN/CRN crosswalk that joins the four Korean government identifier systems no agency officially links. - 60 Calendar Days Is 38 Trading Days. —
kr-trading-calendar. Three functions for KRX trading-day arithmetic, fixing the off-by-37% bug that recurs in Korean market code. - The Beneish M-Score, Reimplemented for Korean IFRS. —
kr-beneish. The earnings-manipulation screen, with the structural adjustments K-IFRS demands. - Sixteen Years of Forensic Accounting Research, in One JSON File. —
jfia-catalog. 469 JFIA articles indexed for programmatic search.
Analysis libraries
- Pricing Convertible-Bond Dilution Without SEIBRO. —
kr-derivatives. Black-Scholes on the embedded conversion option, with honest accounting of the sigma-fallback and board-date caveats. - Detectlets: Compiling Forensic Accounting Research Into Computable Detection Modules. —
jfia-forensic. The schema and four reference detectlets that turn published rules into runnable screens. - 240 Korean Accounting Violations, Coded Once So Researchers Don’t Have to Code Them Again. —
kr-enforcement-cases. FSS and SFC enforcement decisions extracted, taxonomized, and DART-linked.
Platform layer
- Splitting a Forensic-Finance Monolith into Four Repos. —
kr-forensic-core,kr-dart-pipeline,kr-anomaly-scoring,kr-stat-tests. The architecture story behind the platform repositories. - An MCP Server for Korean Forensic Finance. —
krff-shell. The delivery layer — eleven MCP tools, per-company HTML reports, DuckDB query layer.
All thirteen repositories live under the pon00050 GitHub account and are MIT-licensed. The orchestration hub — cross-repo CI, ecosystem status, dependency graph — is forensic-accounting-toolkit.