Annotated corpora

ISST-TANL Corpus

It is a manually annotated corpus, encoded in the standard CoNLL format and including PoS tagging and syntactic dependency annotation. Jointly developed by CNR-ILC and University of Pisa, it exemplifies the general use of the language and consists of articles extracted from newspapers and periodicals, selected to cover a high variety of topics. This corpus was used for training and testing in the shared activity “Domain Adaptation for Dependency Analysis” of EVALITA 2011.

Unannotated corpora

CLIC

Lexica

PAROLE-SIMPLE-CLIPS

It is a four-level general-purpose lexicon that has been developed in three different projects. The morphological and syntactic lexicon core was built within the European project “Preparatory Action for the Organisation of Language Resources for Language Engineering” (LE-PAROLE). The language model and the semantic lexicon core were developed within the European project “Semantic Information for Multifunctional Multilingual Lexicons” (LE-SIMPLE). The phonological level of description and the extent of lexical coverage were produced in the context of the Italian project “Corpora e Lessici dell’Italiano Parlato e Scritto” (CLIPS). It comprises a total of 387,267 phonetic units, 53,044 morphological units (53,044 lemmas), 37,406 syntactic units (28,111 lemmas) and 28,346 semantic units (19,216 lemmas). It has been semantically coded in full compliance with the international standards specified in the PAROLE-SIMPLE model and based on EAGLES. The syntactic and semantic encoding was carried out in collaboration with Thamus (Consortium for Multilingual Documentary Engineering), which is responsible for 25,000 additional entries.

SIMPLE LOD

It is the RDF serialisation of all nouns extracted from the PAROLE-SIMPLE-CLIPS lexicon. Lexical entries are serialised in Lemon, while semantic relations are modelled according to SIMPLE’s OWL.

ItalWordNet LOD

datahub; ilc

FrameNet

GeoDomainWordNet

datahub; ILC for English; ILC for Italian The concepts of the GeoNames ontology, with their English labels and glosses, in Italian have been transformed into a WordNet-like resource, and have been duly linked to the generic WordNets of both languages. This resource is published in RDF in accordance with the W3C and the Lemon schema.

AncientGreekWordNet LOD

Linked open data related to the ‘AncientGreekWordNet’ section of CoPhiWordNet.

Sentiment Lexicon LOD

The Italian Sentiment Lexicon (in LMF format) was developed semi-automatically by ItalWordNet from a manually checked list of 1,000 keywords. It contains 24,293 lexical entries annotated with positive/negative/neutral polarity.

Domain Terminologies

IMAG-Act

It is an interlingual action ontology. Using speech corpora, 1,010 high-frequency action concepts were identified and visually represented with prototypical scenes. The ontology allows the definition of interlingual correspondences between verbs and actions in English, Italian, Chinese and Spanish. Thanks to the visual representation of the identified action concepts, IMAG-Act can potentially be extended to any language.

FiscalDB

SindacDB

Mariterm

Biolessico

Ontologies

Other resources

The ILC4CLARIN repository hosts a constantly updated collection of language resources developed by the Cnr-Istituto di Linguistica Computazionale “Antonio Zampolli”. These resources are deposited and made available in accordance with the FAIR (Findable, Accessible, Interoperable, Reusable) principles.

BROWSE THE COLLECTION

RESOURCES

Annotated corpora

ISST-TANL Corpus

Unannotated corpora

CLIC

Lexica

PAROLE-SIMPLE-CLIPS

SIMPLE LOD

ItalWordNet LOD

FrameNet

GeoDomainWordNet

AncientGreekWordNet LOD

Sentiment Lexicon LOD

Domain Terminologies

IMAG-Act

FiscalDB

SindacDB

Mariterm

Biolessico

Ontologies

Other resources