240 Korean Accounting Violations, Coded Once So Researchers Don't Have to Code Them Again - Writing

Source data → github.com/pon00050/kr-enforcement-cases

Korean accounting fraud enforcement is unusually well-documented and unusually hard to use. The Financial Supervisory Service (FSS, 금융감독원) publishes annual reviews of accounting violations it finds during routine company examinations. The Securities & Futures Commission (SFC, 증권선물위원회) publishes its quarterly decisions imposing sanctions on companies the FSS recommends for action. Both are open. Both are in the public record. Neither is structured.

The FSS publishes its quarterly enforcement reports as PDFs in Korean. The longer-form FSS named-company sanctions arrive as HWP (한글) files. The SFC publishes its quarterly meeting minutes as ZIP files containing PDFs. Anyone wanting to do empirical research on Korean accounting fraud has to extract the violations from these documents themselves — once for the PDF parsing, once for the Korean text normalization, once for the violation taxonomy, and once again for the cross-reference to DART filings if they want to compute Beneish ratios for the named companies.

This dataset is the result of doing that work once and publishing it.

What Is in the Data

Three CSVs, each with a specific role.

reports/violations.csv is the primary output: 240 rows, one per violation per case. Each row captures the violation type from a closed-set taxonomy, the scheme type, the forensic signals (linked to detectlet vocabulary), the relevant Beneish components, and the underlying FSS or SFC document the violation was extracted from.

reports/beneish_ratios.csv is the analytical layer: 60 rows of the seven Beneish components computed from DART financials for the named-company subset where DART data was available. Joining this to violations.csv on company gives “what did the regulator find” alongside “what did the financial statements look like in the relevant year.”

data/curated/dart_matches.csv is the linkage layer: 86 named companies matched to DART corp_codes, with a roughly 90% match rate against the original named-company list.

Behind these three artifacts is a fourth file — reports/scored_index.csv, 229 rows — which scores all FSS-anonymized cases by forensic relevance into Tier 1, Tier 2, and Tier 3 buckets. The PDFs for Tier 1 and Tier 2 cases were prioritized for full-text extraction; Tier 3 cases were enriched from metadata only.

How the Violations Are Coded

Six violation types, applied consistently across both regulators:

Type	Count
`asset_inflation`	71
`revenue_fabrication`	47
`disclosure_fraud`	43
`liability_suppression`	20
`related_party`	12
`cost_distortion`	6
Unlabeled	41
Total	240

The 41 unlabeled rows are cases where the LLM enrichment pass did not yield a confident type assignment — typically because the source text was vague, the case involved multiple violation types, or the violation pattern fell outside the canonical six. These rows are preserved with their full extracted text rather than dropped, so they are reviewable.

The taxonomy was not picked arbitrarily. It went through a five-phase bias validation protocol — cohort splitting (do similar cases get coded similarly?), blind prompt stripping (does the LLM rely on cues it should not?), cross-model replication (do Sonnet and Haiku agree?), and two phases of repair on systematic disagreements. The methodology lives in docs/model_delegation_matrix.md for anyone who wants to audit it.

Where the taxonomy proved most reliable — defensible “post-repair” mappings — the precision against held-out validation cases was: SGI to revenue_fabrication at 95%, AQI to asset_inflation at 74%, LVGI to liability_suppression at 73%, DSRI to revenue_fabrication at 86% (supporting). Other Beneish components have weaker mappings and the README documents them as such.

The Three Source Datasets

The 240 violations come from three regulator publications, each with a different document format and a different level of company identification.

FSS 심사·감리지적사례 (Source 3, 229 cases, anonymized). The FSS’s quarterly review of audit-quality findings. Companies are anonymized as “A사”, “B사”, “C사” — there is no way to map these back to named entities. 200 of the 229 were enriched via LLM; 65 had full PDF text extracted; the remainder were enriched from metadata only. These are the largest source by case count and the richest source for taxonomy work.

FSS 회계감리결과제재 (Source 2, 71 cases, named). The FSS’s named-company sanction decisions. Published as HWP files. 64 of the 71 were matched to DART corp_codes; 49 had Beneish ratios computed. This is the source that makes the dataset cross-referenceable to DART.

SFC 증선위의결정보 (Source 1, 28 cases, mixed identification). The Securities & Futures Commission’s quarterly decisions. Published in ZIP files containing PDFs. 15 of the 28 are redacted; 13 are named. Six were DART-matched; 11 had Beneish ratios computed.

What the Data Cannot Do

Several limitations that should be visible to anyone using the dataset.

The anonymized cases stay anonymized. Source 3’s 229 cases use “A사” rather than company names. There is no key. Cross-referencing them to DART or to specific company financials is not possible. The richer taxonomy work is necessarily on the anonymized cohort; the company-level analytical work is on the smaller named cohort.

The dataset is not a complete enforcement census. It captures three of the eight enforcement data sources identified during scoping. The remaining five — data.go.kr structured CSV, the third-party CaseNote database, auditor-side findings, audit-firm context data, and FSC press releases — are documented in docs/data_sources.md as v2.0 candidates. Using this dataset for “all Korean accounting enforcement” would understate the universe.

Beneish coverage is partial. Only 60 company-years have computed Beneish ratios, against 240 violations. The constraint is twofold: many cases are anonymized (no DART link possible), and many DART-linked companies have insufficient consecutive-year financial history for the ratio computation. For ML work that requires a labeled dataset with both violation outcomes and financial features, the effective sample size is the 60-row Beneish set.

The named-cohort sample is too small for matched-control inference. A proper supervised model needs a non-fraud control group with comparable Beneish ratios. Building that control group requires joining kr-company-registry (the DART corp_code crosswalk) and kr-beneish (the Beneish implementation) to compute Beneish for a representative non-fraud sample. The methodology and effort estimate are documented in reports/ml-feasibility-and-next-steps.md. The dataset is enough for descriptive empirical work today; supervised classification work needs the matched-control construction first.

Why Code It Once

Most empirical research on Korean accounting fraud lives in published academic papers that built their own enforcement dataset, used it for one paper, and never released it. The next researcher does the work again. The next research group does the work again. The aggregate cost is enormous; the marginal benefit of doing it well once is high.

This dataset is that “do it well once” attempt. The taxonomy is documented and bias-validated. The pipeline that produced the CSVs is fully reproducible — every stage idempotent, every step documented in the README. When the FSS publishes its next quarterly report, re-running scrape_fss_cases through build_violation_db updates the dataset. When new cases need the violation taxonomy applied, the LLM enrichment is a single command. The dataset is meant to be a living artifact, not a 2026 snapshot.

The data is at github.com/pon00050/kr-enforcement-cases. MIT license. 65 tests. Documented at length in docs/ and reports/. The full reproduction pipeline, including DART API setup, is in the README.

What Is in the Data

How the Violations Are Coded

The Three Source Datasets

What the Data Cannot Do

Why Code It Once

Part of