Four CNR-ILC papers accepted at ACL 2025

Four papers of the ItaliaNLP Lab of the Cnr-Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC) have been accepted at ACL 2025, the 63rd Annual Meeting of the Association for Computational Linguistics (ACL).

“From Human Reading to NLM Understanding: Evaluating the Role of Eye-Tracking Data in Encoder-Based Models” (Luca Dini, Lucia Domenichelli, Dominique Brunato, Felice Dell’Orletta)
This work, accepted at the main conference, explores how integrating human eye-tracking data into language models affects task performance, attention mechanisms, and representation space – all together for the first time.

Key insights:

ET signals make model attention more aligned with human gaze, even after downstream fine-tuning
they compress the model’s representation space – more efficient, yet still strong on tasks
full fine-tuning strategies are the most robust across all dimensions, while partial tuning maximizes attention alignment

This study highlights the potential of cognitive signals to build more interpretable, efficient, and human-aligned AI systems.

“Evaluating Lexical Proficiency in Neural Language Models” (Cristiano Ciaccio, Alessio Miaschi, Felice Dell’Orletta)
This study, accepted at the main conference, proposes a novel and unified framework to evaluate lexical proficiency in Transformer-based language models, testing their ability to generate, define, and use words across a range of lexical categories: commonly lexicalized words, recent neologisms and nonce words with an emphasis on the creative aspects of this last category.

Key contributions:

proposal of a novel framework to assess lexical abilities across diverse tasks and word types.
developement of a new lexical resource for Italian, with curated definitions and usage examples.
evaluation on how model size, multilinguality and linguistic features affect lexical generalization.
a human evaluation based on the Optimal Innovation Hypothesis to assess the plausibility and creativity of generated nonce words

The findings show that Transformer-based models can handle lexical composition and meaning inference to some extent, effectively producing and interpreting plausible lexical innovations, although with a substantial drop in performance compared to standard lexical items.

Code & dataset: https://github.com/snizio/Lexical-Proficiency

“Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models” (Cristiano Ciaccio, Marta Sartor, Alessio Miaschi, Felice Dell’Orletta).
Language Models (LMs) typically operate on subword tokens and lack explicit access to characters. Despite so, they show a limited, but surprising, ability to recognize spelling-level patterns – a phenomenon known as the Spelling Miracle.

This work, accepted at the Findings of ACL, takes a systematic look at when, where, and how such character-level awareness emerges. The study proposes a controlled binary task – no prompting, no probing – asking models whether a substring appears in a word. Using the MorphoLex database, it evaluate models from the Pythia family across:

substring position and length
morphemic vs. non-morphemic substrings (prefixes, suffixes, roots)
pre-training checkpoints, including from scratch

Key findings:

larger models develop more robust substring awareness
morphemes – especially suffixes and roots – are recognized better than meaningless substrings
awareness emerges early for suffixes and roots, later for non-morphemic units, especially in the middle of words
linguistic features like productivity, word frequence and tokenization shape this ability.

This work opens up new directions for analyzing character knowledge in LMs providing both conceptual and empirical grounding for the Spelling Miracle.

“Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors” (Andrea Pedrotti, Michele Papucci, Cristiano Ciaccio, Alessio Miaschi, Giovanni Puccetti, Felice Dell’Orletta, Andrea Esuli)
As LLMs become increasingly capable of producing human-like text, the task of detecting Machine-Generated Text (MGT) is more important – and more difficult -than ever.

This work, accepted at the Findings of ACL, shows just how fragile current state-of-the-art MGT detectors really are. The authors fine-tune LLMs using Direct Preference Optimization (DPO) to subtly shift the style of synthetic text toward human-written text (HWT), creating adversarial examples that significantly reduce detection accuracy. The result? Detection performance drops dramatically, even with minor stylistic alignment.

Other contributions:

analysis of the linguistic “shortcuts” used by detectors (and how easily they can be bypassed)
comparison of how humans vs. machines detect MGTs – and how little they overlap
release code, models, and data to help the community build more robust and realistic benchmarks.

This work underscores the urgent need to move beyond shallow heuristics in detection – and to build systems that generalize across domains and stylistic shifts.

The preprint of the paper is available at the following link: https://arxiv.org/pdf/2505.24523