Ongoing

Korean Forensic Accounting Toolkit

Builder · 2026 · 6 min read

A 13-repo, 749-test forensic accounting platform that screens Korean DART filings for earnings manipulation and convertible bond dilution — from raw government data to investigative-grade output, automated end to end. Each component is independently published, MIT-licensed, and accompanied by a stand-alone write-up.

Overview

A coordinated ecosystem of 13 Python projects built for systematic forensic analysis of Korean capital markets. The toolkit covers four layers: foundation libraries for identity resolution, trading calendar math, and earnings manipulation scoring; analysis libraries for CB/BW option pricing and forensic signal detection; a platform ETL pipeline ingesting five government data sources into 15 standardized Parquet files; and a delivery shell with CLI, DuckDB query layer, FastAPI/MCP server, and HTML reports with AI-generated narrative. A 4-tier GitHub Actions agent system runs continuous quality gates — convention audits, doc drift detection, cross-repo test validation, and autonomous fix PRs.

Problem

Two problems run in parallel. First: Korean convertible bonds (전환사채) and bonds with warrants (신주인수권부사채) are a known vector for minority shareholder dilution on KOSDAQ. A company issues a CB with a conversion price set below the current stock price, meaning the bondholder profits at issuance before any repricing — but detecting this systematically across all DART-listed companies requires both option pricing capability and a cross-market data join that no open tool provided. Second: Korean capital markets data is fragmented across agencies — DART (FSS), KRX, NTS, and the Ministry of Justice — each assigning different identifiers to the same company, with no official crosswalk. Without an identifier resolution layer, connecting a company's financial disclosures to its procurement contracts or corporate registry entry requires commercial data licenses.

Constraints

No commercial data sources — all inputs must be free or public APIs to remain reproducible and MIT-licensable
DART rate limits and response size require a caching strategy across 3,949+ companies and 15 endpoint types; brute-force re-fetching is not viable
SEIBRO's public API (the bondholder register) was returning errors with no fix timeline — core CB/BW analysis could not wait for an external dependency to stabilize
Must run on a single machine without cloud infrastructure; output must be usable by researchers without any running server

Approach

Layered multi-repo architecture with strict dependency boundaries. Foundation libraries (kr-company-registry, kr-trading-calendar, kr-beneish, jfia-catalog) have zero cross-imports and are independently installable. Analysis libraries consume data files, not code imports. The platform layer (kr-forensic-core, kr-dart-pipeline, kr-anomaly-scoring, kr-stat-tests, krff-shell) forms a directed graph with shared constants and schemas at the base. SEIBRO dependency was bypassed by modeling CB/BW embedded options as European calls using Black-Scholes on DART-disclosed conversion prices and KRX market prices — covering the same signal without proprietary bondholder data. A coordination hub repo runs ecosystem-wide CI and a 4-tier agent system (Haiku for classification, Sonnet for synthesis) that operates autonomously between sessions.

Key Decisions

Multi-repo over monorepo

Reasoning:

Each repo can be independently tested, published, and used. A journalist who only needs the company identifier crosswalk installs one package. A researcher who needs M-Score computation installs another. Dependencies are explicit and versioned. Cross-repo blockers surface in a dedicated hub rather than disappearing into a monorepo's noise.

Alternatives considered:

Monorepo with namespace packages — simpler coordination, but forces the full dependency set on every user
Single package with optional extras — familiar pattern but makes independent release cycles impossible

Black-Scholes for CB/BW dilution screening rather than waiting for SEIBRO

Reasoning:

SEIBRO's public API (KSD bondholder register) was returning resultCode=99 with no ETA for a fix. Waiting would have blocked the core use case indefinitely. DART discloses conversion prices and terms at issuance; KRX provides daily prices. Black-Scholes on these inputs detects whether a CB was issued in-the-money — the primary dilution signal — without needing the bondholder register at all.

Alternatives considered:

Wait for SEIBRO API stabilization — unacceptable timeline dependency on an external agency
Purchase commercial SEIBRO data access — incompatible with MIT licensing and reproducibility requirements

corp_code as the canonical primary key across all 13 components

Reasoning:

The only identifier that is permanent — it survives corporate restructuring, relisting, and SPAC mergers. BRN changes on restructuring; KRX tickers change on relisting and backdoor listings; CRN is permanent but requires paid court registry access. corp_code is the only stable, API-accessible key and is assigned by DART, the primary data source.

Alternatives considered:

BRN (National Tax Service) — stable for most purposes but changes on corporate restructuring and invalid for foreign-listed companies
KRX ticker — changes at every SPAC merger, relisting, and backdoor listing (우회상장)

Tech Stack

Python ≥3.11
uv, hatchling, pytest
DART OpenAPI (primary filings data)
pykrx (KRX price and volume)
DuckDB, pandas, Parquet
Black-Scholes / scipy / numpy
FastAPI, MCP protocol
GitHub Actions (4-tier CI/CD)
Claude API (Haiku + Sonnet)
Pydantic

Result & Impact

13

Repositories
749

Total tests
5

Data sources
15

ETL extractors

A complete open-source forensic finance platform for Korean capital markets — the first to screen the full DART dataset for CB/BW dilution using Black-Scholes without commercial SEIBRO data. Covers the full analytical stack: company identity resolution, trading calendar math, Beneish M-Score earnings manipulation detection, CB/BW option pricing, enforcement case labeling, statistical validation, and a delivery shell for human-in-the-loop review.

Learnings

Multi-repo ecosystems need a coordination hub from day one. Cross-repo blockers do not surface in any single repo's CI — they require a hub that scans across the ecosystem. Adding the hub late means retroactively discovering problems that accumulated silently.
External API failures are a design constraint, not an incident. SEIBRO's breakdown forced the Black-Scholes approach, which turned out to be analytically preferable to waiting. Design each component so data source failures degrade gracefully rather than blocking the pipeline.
Automated agents reduce maintenance overhead but require careful model routing. Haiku is sufficient for classification tasks (triage ranking, doc drift detection); Sonnet is needed for synthesis (convention audit, fix generation). Misrouting adds cost without improving output.
Identifier stability determines join reliability across the entire pipeline. Any join on KRX tickers silently breaks at every SPAC merger and backdoor listing. Establishing corp_code as the canonical key at the start prevented a class of data integrity bugs that would have been expensive to diagnose later.

Component Write-Ups

Every component repository ships with its own stand-alone write-up. Read whichever matches the question you arrived with.

Foundation libraries

3,949 Companies. Four Numbering Systems. One Table. — kr-company-registry. The DART/KRX/BRN/CRN crosswalk that joins the four Korean government identifier systems no agency officially links.
60 Calendar Days Is 38 Trading Days. — kr-trading-calendar. Three functions for KRX trading-day arithmetic, fixing the off-by-37% bug that recurs in Korean market code.
The Beneish M-Score, Reimplemented for Korean IFRS. — kr-beneish. The earnings-manipulation screen, with the structural adjustments K-IFRS demands.
Sixteen Years of Forensic Accounting Research, in One JSON File. — jfia-catalog. 469 JFIA articles indexed for programmatic search.

Analysis libraries

Pricing Convertible-Bond Dilution Without SEIBRO. — kr-derivatives. Black-Scholes on the embedded conversion option, with honest accounting of the sigma-fallback and board-date caveats.
Detectlets: Compiling Forensic Accounting Research Into Computable Detection Modules. — jfia-forensic. The schema and four reference detectlets that turn published rules into runnable screens.
240 Korean Accounting Violations, Coded Once So Researchers Don’t Have to Code Them Again. — kr-enforcement-cases. FSS and SFC enforcement decisions extracted, taxonomized, and DART-linked.

Platform layer

Splitting a Forensic-Finance Monolith into Four Repos. — kr-forensic-core, kr-dart-pipeline, kr-anomaly-scoring, kr-stat-tests. The architecture story behind the platform repositories.
An MCP Server for Korean Forensic Finance. — krff-shell. The delivery layer — eleven MCP tools, per-company HTML reports, DuckDB query layer.

All thirteen repositories live under the pon00050 GitHub account and are MIT-licensed. The orchestration hub — cross-repo CI, ecosystem status, dependency graph — is forensic-accounting-toolkit.

All projects

Overview

Problem

Constraints

Approach

Key Decisions

Multi-repo over monorepo

Black-Scholes for CB/BW dilution screening rather than waiting for SEIBRO

corp_code as the canonical primary key across all 13 components

Tech Stack

Result & Impact

Learnings

Component Write-Ups

Manuals