PLainBench - Polish Text Simplification Leaderboard
This benchmark evaluates how well LLMs simplify difficult Polish texts - drawn from legal/administrative (BIP/GOV), finance, and science domains - while preserving the original meaning. Each model simplifies 210 source texts under 5 simplification prompts (1050 outputs per model). Outputs are scored on readability indices, fine-grained difficulty markers (lexical, syntactic, morphological), meaning preservation (NLI entailment, QuestEval QA consistency, named-entity retention), and instruction following (IFEval include/exclude). The per-category scores are fused into an overall Final RRF ranking.
Final model ranking via Reciprocal Rank Fusion (k=60) over per-category RRF scores. Each category ranks models by its own RRF score; those ranks are then fused into a single Final RRF score. Higher = better overall simplification. The PLCC column shows the model's score on the external PLCC Polish-language-competence benchmark for reference only - it does not affect the ranking (blank where unavailable).
N = number of prompt × text evaluations per model.
Metrics
Readability indices
All indices are adapted for Polish syllable counting via pyhyphen (pl_PL
dictionary) and counted on surface (orthographic) word forms.
Δ is the absolute change (after − before); Δ% is the average percentage change from the original text to the simplified text.
| Metric | Formula | Interpretation |
|---|---|---|
| Flesch Reading Ease | 206.835 − 1.015 × (words/sentences) − 84.6 × (syllables/words) |
Higher → easier text (0–100 typical range). Desired Δ%: positive (+) |
| Gunning Fog | 0.4 × [(words/sentences) + 100 × (complex_words/words)] |
School years needed (complex = ≥ 4 syllables). Lower → easier. Desired Δ%: negative (−) |
| Coleman-Liau | 0.0588 × L − 0.296 × S − 15.8 |
Character-based grade level. Lower → easier. Desired Δ%: negative (−) |
Difficulty markers
Fine-grained syntactic, morphological, and lexical features. Δ is absolute change; Δ% is percentage change. Difficult words are defined as not a named entity, ≥ 4 syllables, counted on the surface (orthographic) form.
| Marker | Description | Desired Δ% |
|---|---|---|
| Avg word syllables | Mean syllable count per word | − (shorter words) |
| Difficult word ratio (orth) | Difficult words / all words (surface, excl. NEs) | − |
| Difficult noun ratio (orth) | Difficult nouns / all tokens (surface, excl. NEs) | − |
| Verb ratio | Verbs / all tokens | + (more verbal, less nominal) |
| Avg sentence length | Mean tokens per sentence | − (shorter sentences) |
| Mean dep. distance | Avg linear head-dependent distance (syntax complexity) | − (flatter syntax) |
| Subordination index | Subordinate clauses / total clauses | − |
| Adverbial participle ratio | Adverbial participles (converbs, e.g. czytając, przeczytawszy) / all tokens | − |
| Gerund ratio | Gerunds / all tokens | − |
| Impersonal verb ratio | Impersonal verb forms (modals należy/trzeba, -no/-to passives, reflexive się, infinitives) / all verbs | − |
| Genitive noun ratio | Nouns in genitive case / all tokens | − |
| Avg genitive chain | Mean length of consecutive genitive noun phrases | − |
| Verbo-nominal ratio | Light-verb + noun periphrases (dokonać wpłaty, podjąć decyzję); administrative style | − |
| OSC noun ratio | Abstract -ość nouns (możliwość, konieczność) / all nouns | − |
Similarity metrics
Reference-based metrics comparing simplified text against the original.
| Metric | Description | Direction |
|---|---|---|
| NLI P / R / F1 | NLI consistency via stella embeddings + mDeBERTa cross-encoder | Higher = stronger entailment |
| NE Retention | Fraction of named entities from the original kept in the simplified text | Higher = more entities preserved |
Only NLI F1 feeds the RRF score; P and R are shown for context.
QuestEval - QA consistency
| Metric | Description | Direction |
|---|---|---|
| QuestEval P | Backward precision - grounding of simplified claims | Higher = fewer hallucinations |
| QuestEval R | Forward recall - information preserved | Higher = less content dropped |
| QuestEval F1 | Harmonic mean of P and R | Higher = better meaning preservation |
| Answerable (fwd) | Fraction of forward questions answerable | Higher = stays on-topic |
| Answerable (bwd) | Fraction of backward questions answerable | Higher = claims traceable to original |
Only QuestEval F1 feeds the RRF score; the other rows are shown for context.
IFEval - instruction following
| Metric | Description | Direction |
|---|---|---|
| IFEval include | Fraction of include constraints (terms the simplification must keep) satisfied | Higher = better |
| IFEval exclude | Fraction of exclude constraints (terms the simplification must avoid) satisfied | Higher = better |
Simplification prompts
The five prompt templates every model is run with - these are the values of the Simplification prompt filter above. Each source text is simplified once per prompt, so they range from a bare one-line instruction to full plain-language guidelines. <text> marks where the source text is inserted.