PLainBench - Polish Text Simplification Leaderboard

This benchmark evaluates how well LLMs simplify difficult Polish texts - drawn from legal/administrative (BIP/GOV), finance, and science domains - while preserving the original meaning. Each model simplifies 210 source texts under 5 simplification prompts (1050 outputs per model). Outputs are scored on readability indices, fine-grained difficulty markers (lexical, syntactic, morphological), meaning preservation (NLI entailment, QuestEval QA consistency, named-entity retention), and instruction following (IFEval include/exclude). The per-category scores are fused into an overall Final RRF ranking.

Text category

Filter the RRF rankings to one source-text category.

Simplification prompt

Filter the RRF rankings to one simplification prompt.

Size limit

Keep only models up to this many parameters.

Model type

Filter by open- vs closed-weights models.

Final model ranking via Reciprocal Rank Fusion (k=60) over per-category RRF scores. Each category ranks models by its own RRF score; those ranks are then fused into a single Final RRF score. Higher = better overall simplification. The PLCC column shows the model's score on the external PLCC Polish-language-competence benchmark for reference only - it does not affect the ranking (blank where unavailable).

Rank	Model	Params	N	Final RRF	PLCC	Readability	Lexical Difficulty	Syntactic	Morphological	Meaning Preservation
10	google/gemma-4-26b-a4b-it [reasoning: high]	120.4B	1050	0.1185	32.33	18	15	18	18	16

N = number of prompt × text evaluations per model.

Metrics

Readability indices

All indices are adapted for Polish syllable counting via pyhyphen (pl_PL dictionary) and counted on surface (orthographic) word forms.

Δ is the absolute change (after − before); Δ% is the average percentage change from the original text to the simplified text.

Metric	Formula	Interpretation
Flesch Reading Ease	`206.835 − 1.015 × (words/sentences) − 84.6 × (syllables/words)`	Higher → easier text (0–100 typical range). Desired Δ%: positive (+)
Gunning Fog	`0.4 × [(words/sentences) + 100 × (complex_words/words)]`	School years needed (complex = ≥ 4 syllables). Lower → easier. Desired Δ%: negative (−)
Coleman-Liau	`0.0588 × L − 0.296 × S − 15.8`	Character-based grade level. Lower → easier. Desired Δ%: negative (−)

Difficulty markers

Fine-grained syntactic, morphological, and lexical features. Δ is absolute change; Δ% is percentage change. Difficult words are defined as not a named entity, ≥ 4 syllables, counted on the surface (orthographic) form.

Marker	Description	Desired Δ%
Avg word syllables	Mean syllable count per word	− (shorter words)
Difficult word ratio (orth)	Difficult words / all words (surface, excl. NEs)	−
Difficult noun ratio (orth)	Difficult nouns / all tokens (surface, excl. NEs)	−
Verb ratio	Verbs / all tokens	+ (more verbal, less nominal)
Avg sentence length	Mean tokens per sentence	− (shorter sentences)
Mean dep. distance	Avg linear head-dependent distance (syntax complexity)	− (flatter syntax)
Subordination index	Subordinate clauses / total clauses	−
Adverbial participle ratio	Adverbial participles (converbs, e.g. czytając, przeczytawszy) / all tokens	−
Gerund ratio	Gerunds / all tokens	−
Impersonal verb ratio	Impersonal verb forms (modals należy/trzeba, -no/-to passives, reflexive się, infinitives) / all verbs	−
Genitive noun ratio	Nouns in genitive case / all tokens	−
Avg genitive chain	Mean length of consecutive genitive noun phrases	−
Verbo-nominal ratio	Light-verb + noun periphrases (dokonać wpłaty, podjąć decyzję); administrative style	−
OSC noun ratio	Abstract -ość nouns (możliwość, konieczność) / all nouns	−

Similarity metrics

Reference-based metrics comparing simplified text against the original.

Metric	Description	Direction
NLI P / R / F1	NLI consistency via stella embeddings + mDeBERTa cross-encoder	Higher = stronger entailment
NE Retention	Fraction of named entities from the original kept in the simplified text	Higher = more entities preserved

Only NLI F1 feeds the RRF score; P and R are shown for context.

QuestEval - QA consistency

Metric	Description	Direction
QuestEval P	Backward precision - grounding of simplified claims	Higher = fewer hallucinations
QuestEval R	Forward recall - information preserved	Higher = less content dropped
QuestEval F1	Harmonic mean of P and R	Higher = better meaning preservation
Answerable (fwd)	Fraction of forward questions answerable	Higher = stays on-topic
Answerable (bwd)	Fraction of backward questions answerable	Higher = claims traceable to original

Only QuestEval F1 feeds the RRF score; the other rows are shown for context.

IFEval - instruction following

Metric	Description	Direction
IFEval include	Fraction of include constraints (terms the simplification must keep) satisfied	Higher = better
IFEval exclude	Fraction of exclude constraints (terms the simplification must avoid) satisfied	Higher = better

Simplification prompts

The five prompt templates every model is run with - these are the values of the Simplification prompt filter above. Each source text is simplified once per prompt, so they range from a bare one-line instruction to full plain-language guidelines. <text> marks where the source text is inserted.