dictpert

AI-Assisted Research Across Czech Language Dictionaries

Martin Povolný — 2026-06-21

The Problem

Seven major Czech language resources exist — primarily dialectological, with standard and historical dictionaries for full context:

Resource Scope
ČJA Czech Language Atlas — 477 survey points, geographic forms
SNČJ Czech dialect dictionary — definitions, examples, attestations (A–D only)
SPJMS Minor place names — Moravia & Silesia (field names, pronunciation variants)
SPJČ Minor place names — Bohemia
PSJČ Historical dictionary of Czech (1935–1957)
ASSC Contemporary academic Czech — phraseology
SSJČ Standard Czech — 114 000+ entries

A researcher answering one question about Moravian dialect forms needs to open seven browser tabs, issue queries in seven different interfaces, and synthesize the results by hand.

Why Not Just Ask an LLM?

General-purpose LLMs (GPT-4, Claude) have broad knowledge but unreliable dialect geography:

  • Confuse geographic distributions — Silesian forms attributed to Brněnsko
  • Invent survey-point attributions with false confidence
  • Cannot report what is absent from dictionaries
  • Cannot cite a specific dictionary entry, only training-data impressions

A model that searches dictionaries has a fundamentally different error profile than one that recalls from training. When it finds nothing, it can say so.

System Architecture

center

AI/ML — Agentic Tool-Use Loop

The core AI technique: LLM as a retrieval-directing agent.

The model outputs structured directives in free text:

I will look up dialectal forms for "holka" in Moravian sources.

[LOOKUP: holka | cja]
[LOOKUP: holka | sncj]
[FTSEARCH: dívka děvče holka | spjms]

The system:

  1. Intercepts directives with regex parsers
  2. Executes against SQLite / ChromaDB
  3. Injects results as a new turn
  4. Repeats — up to 5 rounds

No native function-calling API required. The protocol is text-based, model-agnostic, and fully auditable.

AI/ML — Three Retrieval Modes

center

[ optional ] What is Hugging Face?

Hugging Face is an open-source AI platform and model hub — the GitHub of machine learning models.

  • Hosts 500 000+ pre-trained models, freely downloadable
  • Home of the sentence-transformers library we use
  • Models are published with training code, datasets, and benchmarks
  • Standard way to share and reuse NLP research

What we use from it:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "paraphrase-multilingual-mpnet-base-v2"
)
embeddings = model.encode(["holka", "děvče", "dívka"])
# → three 768-dimensional vectors, close together

The model is downloaded once, runs locally.

Why it matters for research:

  • Models trained by top labs (Google, Microsoft, Meta) published openly
  • Researchers can use state-of-the-art NLP without training from scratch
  • The multilingual model we use was published by UKP Lab (TU Darmstadt) — downloaded by 15 million+ researchers

Hugging Face is to NLP models what CRAN is to R packages, or PyPI is to Python libraries — a trusted, versioned registry.

[ optional ] What Are Embeddings?

An embedding maps a word or sentence to a point in high-dimensional space, such that meaning ≈ proximity.

"holka"   → [0.12, -0.34, 0.87, … ]  (768 numbers)
"děvče"   → [0.11, -0.31, 0.85, … ]  ← nearby
"auto"    → [0.92,  0.54, -0.23, … ] ← far away

You can do arithmetic on meaning:

kingman + womanqueen

This is not a trick — it reflects statistical regularities learned from billions of words of text.

Sentence embeddings extend the same idea to whole sentences and paragraphs, enabling semantic search: "find entries that mean something similar to this query."

Origin — Tomáš Mikolov, Brno, 2013

Word2Vec (Mikolov et al., 2013) was the first practical dense word embedding method. Published while Mikolov was at Google Brain; his doctoral research was done in Brno.

Mikolov did his MSc (2007) and PhD (2012) at FIT VUT — Faculty of Information Technology, Brno University of Technology.

From words to sentences:

Year Method What changed
2013 Word2Vec Word-level embeddings
2018 BERT Context-aware; same word, different vector per context
2019 Sentence-BERT Full sentence → single vector; enables fast similarity search
2020 Multilingual SBERT 50+ languages in one shared space

We use Multilingual SBERT (the 2020 step) — that's paraphrase-multilingual-mpnet-base-v2.

Reference: Mikolov et al. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781.

[ optional ] What is a Vector Database?

A vector database stores and searches high-dimensional numerical vectors — not text or rows.

Core operation — similarity search:

query: "animal sounds"
  → embed → [0.21, -0.43, 0.87, …]  (768 dims)
  → find nearest vectors in the index
  → return top-k most similar entries

Unlike SQL (WHERE text LIKE '…'), similarity search finds conceptually related items even if they share no words.

How it works internally:

  • Approximate Nearest Neighbour (ANN) index (HNSW, IVF…)
  • Cosine or dot-product distance
  • Scales to millions of vectors with sub-millisecond queries

Typical use cases:

Use case What's stored
Semantic search Document / paragraph embeddings
Recommendation User & item embeddings
RAG retrieval Knowledge-base chunk embeddings
Image search CNN feature vectors

What we use: ChromaDB — lightweight, embedded, no server required. Runs in-process alongside the app; vectors stored on disk as a persistent collection.

collection.query(
    query_embeddings=[query_vec],
    n_results=10,
    where={"dict_id": "cja"},   # metadata filter
)

163 000+ dictionary entries indexed; per-dictionary filtering via metadata so a VSEARCH can target one or all sources.

AI/ML — Dense Retrieval

Vector index built with paraphrase-multilingual-mpnet-base-v2:

  • 768-dimensional embeddings
  • Trained on 50+ languages
  • 163 000+ entries embedded at index time
  • Stored in ChromaDB with dict_id metadata for filtered search
  • Query embedded at runtime → cosine similarity → top-k

VSEARCH → LOOKUP chain:

[VSEARCH: povel pro koně]
  → hý, hyja, čmel, hop, br, prr
[LOOKUP: hý | cja]
[LOOKUP: hyja | sncj]
  → confirmed entries with full text

Why multilingual?

Czech dialect entries mix:

  • Czech base forms
  • German loanwords (fěrtoch, lajbl, šněrovačka)
  • Slovak cognates
  • Latin glosses

A multilingual model handles these in a shared embedding space without language-switching.

See optional slides: What is Hugging Face / What are Embeddings

AI/ML — Grounding & Citation Verification

Every [[DICT:word]] link in the final answer is verified against the lookup log before delivery.

Citation check pass

After the main loop, the system:

  1. Extracts all [[DICT:word]] links from the draft answer
  2. Cross-references each against the session's lookup results
  3. If any link has no lookup hit → model is asked to correct or remove it
unverified = [link for link in citations if link not in lookup_hits]
if unverified:
    inject_citation_check_message(unverified)
    # model revises before final delivery

Cumulative lookup history

Every round receives the full history of (term, dict) → hit-count pairs from all prior rounds, preventing the model from repeating lookups it already issued.

AI/ML — Leakage Detection

Four categories of output defects detected automatically after each answer:

Category Pattern Example
Tool leak [LOOKUP:…] in final answer Model forgot to convert directive to prose
Broken citation [[ASSC:[stoupnout…]]] ASSC headwords contain [; breaks markup parser
Instruction leak "v tomto sezení", "kolo N/N" System-prompt internals surfacing in user text
ID leak `cja` instead of "ČJA" Internal dict key used instead of display name
# representative patterns — full alternation omitted
TOOL_RE        = re.compile(r'\[(LOOKUP|FTSEARCH|VSEARCH):[^\]]*\]')
BROKEN_CITE_RE = re.compile(r'\[\[(ČJA|SNČJ|…):\[')
INSTR_RE       = re.compile(r'v tomto sezení|kolo\s+\d+/\d+|…')
ID_LEAK_RE     = re.compile(r'`(cja|sncj|spjms|…)`')

AI/ML — Evaluation Methodology

center

Per-run metrics

  • Zero-result rate — fraction of lookups returning 0 hits (lower = better)
  • Phrase search rate — multi-word LOOKUPs (should be 0)
  • Rounds used — efficiency (fewer rounds for same quality = better)
  • Leakage flags — tool / broken-cite / instruction / id-leak

Prompt changes labelled P1–P16: P7 → phrase violation rate: 11→1 across batch

[ optional ] Claude Code — Custom Slash Commands

Commands live in .claude/commands/ as plain Markdown files:

.claude/commands/
  automated-eval.md   ← /automated-eval
  analyze-conversation.md
  run-assc-scraper.md

Typing /automated-eval in Claude Code:

  1. Reads the .md file into the agent's context
  2. The Markdown IS the prompt
  3. Agent executes autonomously using its tools:
    • Bash — run shell commands, query SQLite
    • Read / Write / Edit — manage files and reports
    • WebSearch — fetch external references

Version-controlled alongside the codebase. Evolve the procedure by editing the file.

## Step 4 — Analyze each conversation

For each conversation ID, run the analysis
block below. Synthesise into a report
in Step 5.

python
import sqlite3, json
CONV_ID = "<conversation_id>"
conn = sqlite3.connect(DB)
...

**Leakage scan — run on every final
assistant message:**
...

The Markdown IS the agent's prompt —
350 lines of structured procedure,
a coding agent's SOP.

AI-SDLC in practice: development workflows encoded as version-controlled agent procedures — eval, scraping, analysis — invokable on demand, improving alongside the codebase.

Agentic Eval Loop — /automated-eval

Human reviews report → approves fixes → re-run.

16 prompt iterations (P1–P16) driven by successive eval runs.
The command file itself evolves too — leakage detection added after finding output defects in eval results.

Agent executes 7 steps autonomously:

  1. Read query corpus → find answerable questions
  2. Run all queries: test_prompt.py --dict-mode --save
  3. Export PDF per conversation
  4. Analyze: zero-result rate · phrase searches · leakage scan · rounds used
  5. Write report.md with issue catalogue
  6. Update eval question table with new conv IDs
  7. Sanity check output directory

The procedure is the same every run — reproducible, comparable across versions.

LLM as a Judge — Agents Evaluating Agents

The paradigm (Zheng et al., 2023):
Use a capable LLM to evaluate LLM outputs at scale — faster and more consistent than human annotation.

Our implementation — hybrid approach:

Layer Method LLM?
Zero-result rate DB log counts No
Phrase search rate Regex on lookup terms No
Leakage detection 4 regex patterns No
Coverage & grounding Qualitative analysis ✅ Claude
Issue catalogue Pattern synthesis ✅ Claude

Structured metrics ground the LLM's qualitative judgment — it cannot rate an answer highly if the DB logs show 60% zero-result lookups.

The meta-angle:

The evaluator (Claude Code running /automated-eval) and the system being evaluated (dictpert's multi-round LLM retrieval loop) are both large language models.

We use one LLM to systematically improve another.

Prompt changes are proposed by Claude,
reviewed by a human, implemented by Claude,
and verified by Claude.
The human's role is judgment and direction —
not execution.

Reference: Zheng et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023.

Technique — Geographic Correlation

Problem: ČJA encodes survey points as numeric codes (84, 818, 701). The LLM had no way to map these to villages, leading to geographic hallucination (Silesian forms attributed to Brněnsko).

Solution: Inline locality substitution

# Before: raw entry text
"ďevčo 84, 818; děvčo 701"

# After: _substitute_localities() replaces codes inline
"ďevčo Český Těšín/Karviná, Velké Petrovice/Náchod; děvčo Šumperk/Šumperk"

How codes were resolved:

  1. SPJMS obce table → fuzzy village-name join (70% coverage)
  2. OSM Nominatim geocoding API (rate-limited 1 req/s) → remaining 30%
  3. Manual correction of 5 Nominatim errors using ČJA geographic context

Result: Geographic error eliminated. Model now correctly places Silesian forms in Karviná/Náchod.

Technique — Data Enrichment & Denormalization

SPJMS field cards

Raw database: cardsobce (village codes) → objekty (informant categories)

Denormalized at query time:

BOUDA, bóda  [búda]  (střední vrstva)  — Lhota u Vsetína  (pozn: starší)
BOUDA        [bouda] (inteligence)     — Zlín

Effect on Q17 (bouda variants in Moravia): 4 rounds → 2 rounds, 26% → 0% zero-result rate.

FTS content enrichment

SPJMS FTS index JOINs village names into the indexed text via correlated subquery — village names are searchable even without knowing headwords.

Features

  • Natural language queries across all 7 dictionaries simultaneously
  • Three retrieval modes per query: exact, full-text, semantic vector
  • Geographic synthesis: ČJA survey points automatically resolved to Obec/Okres
  • Cross-dictionary comparison: one answer draws from ČJA, SNČJ, SPJMS, ASSC in parallel
  • Grounded citations: every [[DICT:word]] link verified before delivery
  • Gap discovery: zero-result queries logged; identifies undocumented lexical territory
  • PDF export of conversations with live dictionary links
  • Conversation history with per-user isolation (Google OAuth)
  • Automated eval framework: structured quality measurement across query corpus
  • LLM backend: Claude (claude-sonnet-4-6 / claude-haiku-4-5) via Anthropic; Qwen3 via OpenCode.ai

Results — Eval Comparison

Improvement from baseline (June 2026) across 11 dialectological queries:

Query Change Key improvement
Q8 — horse commands 4 rounds → 3, 39% → 0% zero VSEARCH found interjection headwords semantically
Q15 — girl in Brněnsko Geographic error eliminated Inline locality sub placed Silesian forms correctly
Q16 — Brněnsko attestations 33% → 9% zero-result 3 rounds instead of 1; locality codes resolved
Q17 — bouda variants 4 rounds → 2, 26% → 0% zero SPJMS card denorm returned all variants in 1 call
Q11 — German loanwords 27% → 22% zero VSEARCH surfaced loanword headwords by meaning

Phrase search violations (multi-word LOOKUPs that always return 0):
11 → 1 across the batch after P7 prompt change.

Gap Discovery

The zero-result log is structured data about what is missing from Czech dialectological documentation.

Term All dicts Finding
fěrtoch 0 hits Folk-costume terms not independently mapped in ČJA
koc (Brno slang for girl) 0 hits Hantec register absent from all 7 dictionaries
SNČJ entries E–Z N/A SNČJ dump covers only A–D (~4 000 of 45 000 entries)

A funded project could take the zero-result map and use it to commission targeted editorial work in the dictionaries with incomplete coverage. The system identifies where the gaps are; domain experts fill them.

What's Next

Near term

  • R3 — HyDE query rewriting: Generate a hypothetical dictionary entry and embed it for vector search — embeds closer to real entries than keyword queries (Gao et al. 2022)
  • SNČJ complete scrape: Full 45 000 entries (currently A–D only)
  • Coverage audit: Systematic per-dict comparison of web UI ↔ SQLite ↔ LLM output ↔ vector index ↔ FTS

Research opportunities

  • ÚJČ collaboration: API access to IJP, MJČ, Vokabulář webový
  • Crowdsourced gap-filling: Surface zero-result queries to domain experts as missing-entry candidates
  • Annotation layer: Allow linguists to rate and correct LLM answers; build a fine-tuning dataset for Czech dialectology

Summary

dictpert is a grounded, multi-source RAG system for Czech dialectological research.

The key insight is not the LLM — it's the retrieval and verification layer that makes LLM output auditable and traceable to sources.

Techniques: agentic tool-use loop · dense retrieval · FTS5 · geographic enrichment · citation verification · leakage detection · systematic evaluation

The entire project — scrapers, backends, eval framework, prompt iteration, this presentation — was built with Claude Code. The underlying dictionaries are the work of decades of human linguists; the AI amplifies access to that knowledge.

It doesn't replace expertise — it reduces the friction between a researcher's question and the evidence that exists to answer it, and it makes the limits of that evidence explicit.

Source code: Python · Streamlit · SQLite · ChromaDB · sentence-transformers · Playwright

B1 — Model: paraphrase-multilingual-mpnet-base-v2

Architecture: MPNet (Microsoft, 2020) — combines masked language modelling with permuted language modelling for stronger token dependency modelling than BERT.

Fine-tuning: sentence-transformers paraphrase training on 50+ languages, ~1 billion sentence pairs (Reimers & Gurevych 2020). The model learns to map meaning-equivalent sentences to nearby points in 768-dimensional space regardless of surface form or language.

Why this model for Czech dialectology:

Property Value Relevance
Embedding dim 768 Good capacity for domain-specific semantic space
Languages 50+ Czech, Slovak, German, Latin all in shared space
Max tokens 128 Sufficient for dictionary entry snippets
Model size 278 MB Runs on CPU without GPU hardware

Alternative considered: czert-b (Czech-only BERT) — better Czech morphology but no multilingual embedding space, rules out cross-lingual loanword matching.

Reference: Reimers & Gurevych (2020). Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation. EMNLP 2020.

B2 — HyDE: Hypothetical Document Embeddings (planned)

Problem with keyword queries for vector search:
The user asks "povel pro koně" (command for horses). Embedding this question puts it near other questions, not near dictionary entries about horse commands.

HyDE approach (Gao et al. 2022):

Step 1 — Generate hypothetical entry:
  "hý: citoslovce. Povel pro koně k zastavení nebo otočení. 
   Varianty: hýja, híja (Morava). Doklady: ..."

Step 2 — Embed the hypothetical entry (not the question)

Step 3 — Retrieve real entries by cosine similarity to the hypothesis

The hypothesis embeds much closer to real dictionary entries than a plain question does — because it has the same syntactic structure, register, and vocabulary as the target.

Expected gain: Higher precision in top-k retrieval for semantic queries, reducing the need for follow-up LOOKUP rounds.

Reference: Gao et al. (2022). Precise Zero-Shot Dense Retrieval Without Relevance Labels. arXiv:2212.10496.