PLainBench - Polish Text Simplification Leaderboard

This benchmark evaluates how well LLMs simplify difficult Polish texts - drawn from legal/administrative (BIP/GOV), finance, and science domains - while preserving the original meaning. Each model simplifies 210 source texts under 5 simplification prompts (1050 outputs per model). Outputs are scored on readability indices, fine-grained difficulty markers (lexical, syntactic, morphological), meaning preservation (NLI entailment, QuestEval QA consistency, named-entity retention), and instruction following (IFEval include/exclude). The per-category scores are fused into an overall Final RRF ranking.

Text category
Filter the RRF rankings to one source-text category.
Simplification prompt
Filter the RRF rankings to one simplification prompt.
Size limit
Keep only models up to this many parameters.
Model type
Filter by open- vs closed-weights models.

Final model ranking via Reciprocal Rank Fusion (k=60) over per-category RRF scores. Each category ranks models by its own RRF score; those ranks are then fused into a single Final RRF score. Higher = better overall simplification. The PLCC column shows the model's score on the external PLCC Polish-language-competence benchmark for reference only - it does not affect the ranking (blank where unavailable).

N = number of prompt × text evaluations per model.

Metrics

Readability indices

All indices are adapted for Polish syllable counting via pyhyphen (pl_PL dictionary) and counted on surface (orthographic) word forms.

Δ is the absolute change (after − before); Δ% is the average percentage change from the original text to the simplified text.

Metric Formula Interpretation
Flesch Reading Ease 206.835 − 1.015 × (words/sentences) − 84.6 × (syllables/words) Higher → easier text (0–100 typical range). Desired Δ%: positive (+)
Gunning Fog 0.4 × [(words/sentences) + 100 × (complex_words/words)] School years needed (complex = ≥ 4 syllables). Lower → easier. Desired Δ%: negative (−)
Coleman-Liau 0.0588 × L − 0.296 × S − 15.8 Character-based grade level. Lower → easier. Desired Δ%: negative (−)

Difficulty markers

Fine-grained syntactic, morphological, and lexical features. Δ is absolute change; Δ% is percentage change. Difficult words are defined as not a named entity, ≥ 4 syllables, counted on the surface (orthographic) form.

Marker Description Desired Δ%
Avg word syllables Mean syllable count per word − (shorter words)
Difficult word ratio (orth) Difficult words / all words (surface, excl. NEs)
Difficult noun ratio (orth) Difficult nouns / all tokens (surface, excl. NEs)
Verb ratio Verbs / all tokens + (more verbal, less nominal)
Avg sentence length Mean tokens per sentence − (shorter sentences)
Mean dep. distance Avg linear head-dependent distance (syntax complexity) − (flatter syntax)
Subordination index Subordinate clauses / total clauses
Adverbial participle ratio Adverbial participles (converbs, e.g. czytając, przeczytawszy) / all tokens
Gerund ratio Gerunds / all tokens
Impersonal verb ratio Impersonal verb forms (modals należy/trzeba, -no/-to passives, reflexive się, infinitives) / all verbs
Genitive noun ratio Nouns in genitive case / all tokens
Avg genitive chain Mean length of consecutive genitive noun phrases
Verbo-nominal ratio Light-verb + noun periphrases (dokonać wpłaty, podjąć decyzję); administrative style
OSC noun ratio Abstract -ość nouns (możliwość, konieczność) / all nouns

Similarity metrics

Reference-based metrics comparing simplified text against the original.

Metric Description Direction
NLI P / R / F1 NLI consistency via stella embeddings + mDeBERTa cross-encoder Higher = stronger entailment
NE Retention Fraction of named entities from the original kept in the simplified text Higher = more entities preserved

Only NLI F1 feeds the RRF score; P and R are shown for context.

QuestEval - QA consistency

Metric Description Direction
QuestEval P Backward precision - grounding of simplified claims Higher = fewer hallucinations
QuestEval R Forward recall - information preserved Higher = less content dropped
QuestEval F1 Harmonic mean of P and R Higher = better meaning preservation
Answerable (fwd) Fraction of forward questions answerable Higher = stays on-topic
Answerable (bwd) Fraction of backward questions answerable Higher = claims traceable to original

Only QuestEval F1 feeds the RRF score; the other rows are shown for context.

IFEval - instruction following

Metric Description Direction
IFEval include Fraction of include constraints (terms the simplification must keep) satisfied Higher = better
IFEval exclude Fraction of exclude constraints (terms the simplification must avoid) satisfied Higher = better

Simplification prompts

The five prompt templates every model is run with - these are the values of the Simplification prompt filter above. Each source text is simplified once per prompt, so they range from a bare one-line instruction to full plain-language guidelines. <text> marks where the source text is inserted.