dictpert

Resource	Scope
ČJA	Czech Language Atlas — 477 survey points, geographic forms
SNČJ	Czech dialect dictionary — definitions, examples, attestations (A–D only)
SPJMS	Minor place names — Moravia & Silesia (field names, pronunciation variants)
SPJČ	Minor place names — Bohemia
PSJČ	Historical dictionary of Czech (1935–1957)
ASSC	Contemporary academic Czech — phraseology
SSJČ	Standard Czech — 114 000+ entries

[ optional ] What Are Embeddings?

An embedding maps a word or sentence to a point in high-dimensional space, such that meaning ≈ proximity.

"holka"   → [0.12, -0.34, 0.87, … ]  (768 numbers)
"děvče"   → [0.11, -0.31, 0.85, … ]  ← nearby
"auto"    → [0.92,  0.54, -0.23, … ] ← far away

You can do arithmetic on meaning:

king − man + woman ≈ queen

This is not a trick — it reflects statistical regularities learned from billions of words of text.

Sentence embeddings extend the same idea to whole sentences and paragraphs, enabling semantic search: "find entries that mean something similar to this query."

Origin — Tomáš Mikolov, Brno, 2013

Word2Vec (Mikolov et al., 2013) was the first practical dense word embedding method. Published while Mikolov was at Google Brain; his doctoral research was done in Brno.

Mikolov did his MSc (2007) and PhD (2012) at FIT VUT — Faculty of Information Technology, Brno University of Technology.

From words to sentences:

Year	Method	What changed
2013	Word2Vec	Word-level embeddings
2018	BERT	Context-aware; same word, different vector per context
2019	Sentence-BERT	Full sentence → single vector; enables fast similarity search
2020	Multilingual SBERT	50+ languages in one shared space

We use Multilingual SBERT (the 2020 step) — that's paraphrase-multilingual-mpnet-base-v2.

Reference: Mikolov et al. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781.

Use case	What's stored
Semantic search	Document / paragraph embeddings
Recommendation	User & item embeddings
RAG retrieval	Knowledge-base chunk embeddings
Image search	CNN feature vectors

Category	Pattern	Example
Tool leak	`[LOOKUP:…]` in final answer	Model forgot to convert directive to prose
Broken citation	`[[ASSC:[stoupnout…]]]`	ASSC headwords contain `[`; breaks markup parser
Instruction leak	"v tomto sezení", "kolo N/N"	System-prompt internals surfacing in user text
ID leak	`cja` instead of "ČJA"	Internal dict key used instead of display name

Layer	Method	LLM?
Zero-result rate	DB log counts	No
Phrase search rate	Regex on lookup terms	No
Leakage detection	4 regex patterns	No
Coverage & grounding	Qualitative analysis	Claude
Issue catalogue	Pattern synthesis	Claude

Query	Change	Key improvement
Q8 — horse commands	4 rounds → 3, 39% → 0% zero	VSEARCH found interjection headwords semantically
Q15 — girl in Brněnsko	Geographic error eliminated	Inline locality sub placed Silesian forms correctly
Q16 — Brněnsko attestations	33% → 9% zero-result	3 rounds instead of 1; locality codes resolved
Q17 — bouda variants	4 rounds → 2, 26% → 0% zero	SPJMS card denorm returned all variants in 1 call
Q11 — German loanwords	27% → 22% zero	VSEARCH surfaced loanword headwords by meaning

Term	All dicts	Finding
`fěrtoch`	0 hits	Folk-costume terms not independently mapped in ČJA
`koc` (Brno slang for girl)	0 hits	Hantec register absent from all 7 dictionaries
SNČJ entries E–Z	N/A	SNČJ dump covers only A–D (~4 000 of 45 000 entries)

dictpert

AI-Assisted Research Across Czech Language Dictionaries

The Problem

Why Not Just Ask an LLM?

System Architecture

AI/ML — Agentic Tool-Use Loop

AI/ML — Three Retrieval Modes

[ optional ] What is Hugging Face?

[ optional ] What Are Embeddings?

[ optional ] What is a Vector Database?

AI/ML — Dense Retrieval

AI/ML — Grounding & Citation Verification

Citation check pass

Cumulative lookup history

AI/ML — Leakage Detection

AI/ML — Evaluation Methodology

Per-run metrics

[ optional ] Claude Code — Custom Slash Commands

Agentic Eval Loop — `/automated-eval`

LLM as a Judge — Agents Evaluating Agents

Technique — Geographic Correlation

Technique — Data Enrichment & Denormalization

SPJMS field cards

FTS content enrichment

Features

Results — Eval Comparison

Gap Discovery

What's Next

Near term

Research opportunities

Summary

B1 — Model: paraphrase-multilingual-mpnet-base-v2

B2 — HyDE: Hypothetical Document Embeddings (planned)

Property	Value	Relevance
Embedding dim	768	Good capacity for domain-specific semantic space
Languages	50+	Czech, Slovak, German, Latin all in shared space
Max tokens	128	Sufficient for dictionary entry snippets
Model size	278 MB	Runs on CPU without GPU hardware

dictpert

AI-Assisted Research Across Czech Language Dictionaries

The Problem

Why Not Just Ask an LLM?

System Architecture

AI/ML — Agentic Tool-Use Loop

AI/ML — Three Retrieval Modes

[ optional ] What is Hugging Face?

[ optional ] What Are Embeddings?

[ optional ] What is a Vector Database?

AI/ML — Dense Retrieval

AI/ML — Grounding & Citation Verification

Citation check pass

Cumulative lookup history

AI/ML — Leakage Detection

AI/ML — Evaluation Methodology

Per-run metrics

[ optional ] Claude Code — Custom Slash Commands

Agentic Eval Loop — /automated-eval

LLM as a Judge — Agents Evaluating Agents

Technique — Geographic Correlation

Technique — Data Enrichment & Denormalization

SPJMS field cards

FTS content enrichment

Features

Results — Eval Comparison

Gap Discovery

What's Next

Near term

Research opportunities

Summary

B1 — Model: paraphrase-multilingual-mpnet-base-v2

B2 — HyDE: Hypothetical Document Embeddings (planned)

Agentic Eval Loop — `/automated-eval`