Attribution

Last updated: 2026-05-10

Tadorimichi separates its data into two layers: an originally compiled core covering the words and characters Chinese-speaking JLPT learners use most, and a JMdict / KANJIDIC2 fallback that catches long-tail words pasted into the reader. This page lays out exactly what comes from where, and under which license.

Originally compiled by Tadorimichi

vocab_entries — 8,047 vocabulary core: Chinese / English / learner notes / examples / collocations / register / domain / 30+ metadata fields per entry, generated by Claude (Anthropic) from the Japanese surface forms and reviewed for quality. Entry ids are our own (tdrm-XXXXX), not inherited from any external source.
kanji — 2,682 kanji core: Chinese / English meanings, phonetic group analysis, component decomposition, mnemonics, look-alikes, compound word selections, learner notes, all originally generated.
grammar_points — 723 JLPT N1–N5 grammar points: pattern, usage rules, similar-pattern disambiguation, examples, all written from scratch.
articles — JLPT-graded reading material: every article body, summary, vocabulary annotation, grammar annotation, and metadata is Tadorimichi original work.
UI, design, code, learning algorithms, mock exam logic: all owned by Mason AI Lab.

Factual data layer

Some data classifications are objective grammatical or linguistic facts — not protected by copyright in any jurisdiction we operate in — and are compiled into our tables alongside the originally generated content:

Kanji surface forms (the characters themselves), stroke counts, on/kun readings
Kangxi radical assignments (a public-domain classical Chinese reference)
Part-of-speech classifications (verb / adjective / noun, etc.)
Word frequency rankings derived from public corpora (JPDB, Mainichi)
JLPT level tags compiled from publicly available exam syllabi

JMdict / KANJIDIC2 fallback

For Japanese surface forms outside our 8,047-word core, the reader's hover lookup falls back to the JMdict (Japanese-Multilingual Dictionary) and KANJIDIC2 community projects. This covers cold-tail words such as proper nouns, archaic forms, technical jargon, and words readers paste in from the wild.

JMdict — © James William Breen and the Electronic Dictionary Research and Development Group, licensed under CC BY-SA 4.0. We import the dictionary in full and store the long-tail subset (208,286 entries not in our core) in our fallback tables.
KANJIDIC2 — © EDRDG, also licensed under CC BY-SA 4.0. Used for kanji factual data (strokes, radical, on/kun readings, JLPT tier).

Modifications to the imported data: only the long-tail subset is retained (entries already covered by our compiled core are not duplicated). No other transformations are applied to JMdict / KANJIDIC2 records as stored.

Other community resources

JLPT vocabulary level tagging — derived from yomitan-jlpt-vocab (community-maintained, attribution preserved).
JLPT N4 kanji level backfill — sourced from davidluzgouveia/kanji-data (MIT, jlpt_new field). Used to populate the new-system N4 tier that KANJIDIC2's legacy 4-tier system doesn't expose.
Pitch accent dictionary — derived from mifunetoshiro/kanjium (NHK 日本語発音アクセント新辞典 community compilation, freely redistributable).
Word frequency — JPDB Mainichi-derived corpus.
Tokenization — kuromoji.js (Apache 2.0).

Audio synthesis (TTS)

Vocabulary, kanji, grammar, and reader audio is pre-generated with VOICEVOX, an open-source Japanese speech synthesis engine. The voice character used is VOICEVOX:四国めたん (Shikoku Metan, normal style). VOICEVOX exposes per-mora accent control that aligns with the pitch numbers shown on each card — Web Speech engines cannot do this, which is why we pre-generate.

Engine — VOICEVOX/voicevox_engine, LGPL-3.0.
Voice character — VOICEVOX:四国めたん. The character has its own terms of use which permit commercial redistribution of generated audio with credit. The character itself remains the property of its rights holders; we make no claim to the character or its likeness.

License of derived data

Per the share-alike obligation in CC BY-SA 4.0, any redistribution of the JMdict / KANJIDIC2 long-tail subset stored in our jmdict_fallback* tables is licensed under the same CC BY-SA 4.0 terms. Our originally compiled vocab_entries, kanji, grammar_points, and articles tables are not redistributed and remain proprietary.

Contact

Questions about attribution, licensing, or commercial use of our originally compiled content: [email protected].