Ongoing

Korean Forensic Accounting Toolkit

Builder · 2026 · 6 min read

A 13-repo, 749-test forensic accounting platform that screens Korean DART filings for earnings manipulation and convertible bond dilution — from raw government data to investigative-grade output, automated end to end. Each component is independently published, MIT-licensed, and accompanied by a stand-alone write-up.

Overview

A coordinated ecosystem of 13 Python projects built for systematic forensic analysis of Korean capital markets. The toolkit covers four layers: foundation libraries for identity resolution, trading calendar math, and earnings manipulation scoring; analysis libraries for CB/BW option pricing and forensic signal detection; a platform ETL pipeline ingesting five government data sources into 15 standardized Parquet files; and a delivery shell with CLI, DuckDB query layer, FastAPI/MCP server, and HTML reports with AI-generated narrative. A 4-tier GitHub Actions agent system runs continuous quality gates — convention audits, doc drift detection, cross-repo test validation, and autonomous fix PRs.

Problem

Two problems run in parallel. First: Korean convertible bonds (전환사채) and bonds with warrants (신주인수권부사채) are a known vector for minority shareholder dilution on KOSDAQ. A company issues a CB with a conversion price set below the current stock price, meaning the bondholder profits at issuance before any repricing — but detecting this systematically across all DART-listed companies requires both option pricing capability and a cross-market data join that no open tool provided. Second: Korean capital markets data is fragmented across agencies — DART (FSS), KRX, NTS, and the Ministry of Justice — each assigning different identifiers to the same company, with no official crosswalk. Without an identifier resolution layer, connecting a company's financial disclosures to its procurement contracts or corporate registry entry requires commercial data licenses.

Constraints

  • No commercial data sources — all inputs must be free or public APIs to remain reproducible and MIT-licensable
  • DART rate limits and response size require a caching strategy across 3,949+ companies and 15 endpoint types; brute-force re-fetching is not viable
  • SEIBRO's public API (the bondholder register) was returning errors with no fix timeline — core CB/BW analysis could not wait for an external dependency to stabilize
  • Must run on a single machine without cloud infrastructure; output must be usable by researchers without any running server

Approach

Layered multi-repo architecture with strict dependency boundaries. Foundation libraries (kr-company-registry, kr-trading-calendar, kr-beneish, jfia-catalog) have zero cross-imports and are independently installable. Analysis libraries consume data files, not code imports. The platform layer (kr-forensic-core, kr-dart-pipeline, kr-anomaly-scoring, kr-stat-tests, krff-shell) forms a directed graph with shared constants and schemas at the base. SEIBRO dependency was bypassed by modeling CB/BW embedded options as European calls using Black-Scholes on DART-disclosed conversion prices and KRX market prices — covering the same signal without proprietary bondholder data. A coordination hub repo runs ecosystem-wide CI and a 4-tier agent system (Haiku for classification, Sonnet for synthesis) that operates autonomously between sessions.

Key Decisions

Multi-repo over monorepo

Reasoning:

Each repo can be independently tested, published, and used. A journalist who only needs the company identifier crosswalk installs one package. A researcher who needs M-Score computation installs another. Dependencies are explicit and versioned. Cross-repo blockers surface in a dedicated hub rather than disappearing into a monorepo's noise.

Alternatives considered:
  • Monorepo with namespace packages — simpler coordination, but forces the full dependency set on every user
  • Single package with optional extras — familiar pattern but makes independent release cycles impossible

Black-Scholes for CB/BW dilution screening rather than waiting for SEIBRO

Reasoning:

SEIBRO's public API (KSD bondholder register) was returning resultCode=99 with no ETA for a fix. Waiting would have blocked the core use case indefinitely. DART discloses conversion prices and terms at issuance; KRX provides daily prices. Black-Scholes on these inputs detects whether a CB was issued in-the-money — the primary dilution signal — without needing the bondholder register at all.

Alternatives considered:
  • Wait for SEIBRO API stabilization — unacceptable timeline dependency on an external agency
  • Purchase commercial SEIBRO data access — incompatible with MIT licensing and reproducibility requirements

corp_code as the canonical primary key across all 13 components

Reasoning:

The only identifier that is permanent — it survives corporate restructuring, relisting, and SPAC mergers. BRN changes on restructuring; KRX tickers change on relisting and backdoor listings; CRN is permanent but requires paid court registry access. corp_code is the only stable, API-accessible key and is assigned by DART, the primary data source.

Alternatives considered:
  • BRN (National Tax Service) — stable for most purposes but changes on corporate restructuring and invalid for foreign-listed companies
  • KRX ticker — changes at every SPAC merger, relisting, and backdoor listing (우회상장)

Tech Stack

  • Python ≥3.11
  • uv, hatchling, pytest
  • DART OpenAPI (primary filings data)
  • pykrx (KRX price and volume)
  • DuckDB, pandas, Parquet
  • Black-Scholes / scipy / numpy
  • FastAPI, MCP protocol
  • GitHub Actions (4-tier CI/CD)
  • Claude API (Haiku + Sonnet)
  • Pydantic

Result & Impact

  • 13
    Repositories
  • 749
    Total tests
  • 5
    Data sources
  • 15
    ETL extractors

A complete open-source forensic finance platform for Korean capital markets — the first to screen the full DART dataset for CB/BW dilution using Black-Scholes without commercial SEIBRO data. Covers the full analytical stack: company identity resolution, trading calendar math, Beneish M-Score earnings manipulation detection, CB/BW option pricing, enforcement case labeling, statistical validation, and a delivery shell for human-in-the-loop review.

Learnings

  • Multi-repo ecosystems need a coordination hub from day one. Cross-repo blockers do not surface in any single repo's CI — they require a hub that scans across the ecosystem. Adding the hub late means retroactively discovering problems that accumulated silently.
  • External API failures are a design constraint, not an incident. SEIBRO's breakdown forced the Black-Scholes approach, which turned out to be analytically preferable to waiting. Design each component so data source failures degrade gracefully rather than blocking the pipeline.
  • Automated agents reduce maintenance overhead but require careful model routing. Haiku is sufficient for classification tasks (triage ranking, doc drift detection); Sonnet is needed for synthesis (convention audit, fix generation). Misrouting adds cost without improving output.
  • Identifier stability determines join reliability across the entire pipeline. Any join on KRX tickers silently breaks at every SPAC merger and backdoor listing. Establishing corp_code as the canonical key at the start prevented a class of data integrity bugs that would have been expensive to diagnose later.

Component Write-Ups

Every component repository ships with its own stand-alone write-up. Read whichever matches the question you arrived with.

Foundation libraries

Analysis libraries

Platform layer

All thirteen repositories live under the pon00050 GitHub account and are MIT-licensed. The orchestration hub — cross-repo CI, ecosystem status, dependency graph — is forensic-accounting-toolkit.