Linguistic preprocessing
Automatic linguistic annotation of texts collected within the project. Existing linguistic annotation tools will be specialized to handle the varieties of language to be dealt with (students’ language, educational materials).
Feature extraction
Texts enriched with multi-level linguistic annotation will be further inspected to extract and weigh the features of interest. We will start from a battery of tools developed by project partners and already tested in different projects and application scenarios, namely:
- the Morpho Complexity Tool
- TSOM (Temporal Self-Organizing Map) [TBC]
- Profiling-UD
These tools will be specialized to meet the project requirements in terms of both the set of features to be tracked and the metrics used to weigh them.
Modelling LC/PD
The features identified during the preliminary data analysis, their interactions and their relationships with Processing Difficulty (PD) for different user profiles, will be further investigated by specializing READ-IT.
Based on a wide range of linguistic features automatically extracted from texts, READ-IT, following a machine learning approach, assigns a weighted value to each feature to produce an overall estimation of the difficulty of a text, as well as a fine-grained estimation of the level of difficulty within each considered linguistic sub-domain.
To unravel the connection between Processing Difficulty (PD) and Linguistic Complexity (LC), multi-task learning approaches will be tested to explore the way the online cognitive metrics collected from the investigated target populations, such as gazing behaviour, can help improve models for readability assessment.