Annotated corpora

Unannotated corpora

CLIC

Lexica

PAROLE-SIMPLE-CLIPS

It is a four-level general-purpose lexicon that has been developed in three different projects. The morphological and syntactic lexicon core was built within the European project “Preparatory Action for the Organisation of Language Resources for Language Engineering” (LE-PAROLE). The language model and the semantic lexicon core were developed within the European project “Semantic Information for Multifunctional Multilingual Lexicons” (LE-SIMPLE). The phonological level of description and the extent of lexical coverage were produced in the context of the Italian project “Corpora e Lessici dell’Italiano Parlato e Scritto” (CLIPS). It comprises a total of 387,267 phonetic units, 53,044 morphological units (53,044 lemmas), 37,406 syntactic units (28,111 lemmas) and 28,346 semantic units (19,216 lemmas). It has been semantically coded in full compliance with the international standards specified in the PAROLE-SIMPLE model and based on EAGLES. The syntactic and semantic encoding was carried out in collaboration with Thamus (Consortium for Multilingual Documentary Engineering), which is responsible for 25,000 additional entries.

SIMPLE LOD

It is the RDF serialisation of all nouns extracted from the PAROLE-SIMPLE-CLIPS lexicon. Lexical entries are serialised in Lemon, while semantic relations are modelled according to SIMPLE’s OWL.

ItalWordNet LOD

datahub; ilc

Italian Word Embeddings

Two sets of word embeddings trained starting from two different corpora: itWaC and Twitter.
Learn more: Italian Word Embeddings.

FrameNet

GeoDomainWordNet

datahub; ILC for English; ILC for Italian The concepts of the GeoNames ontology, with their English labels and glosses, in Italian have been transformed into a WordNet-like resource, and have been duly linked to the generic WordNets of both languages. This resource is published in RDF in accordance with the W3C and the Lemon schema.

AncientGreekWordNet LOD

Linked open data related to the ‘AncientGreekWordNet’ section of CoPhiWordNet.

Sentiment Lexicon LOD

The Italian Sentiment Lexicon (in LMF format) was developed semi-automatically by ItalWordNet from a manually checked list of 1,000 keywords. It contains 24,293 lexical entries annotated with positive/negative/neutral polarity.

Twitter for Sentiment Analysis

The corpus “Twitter for Sentiment Analysis” is a collection of tweets containing text and images collected from July to December 2016. Each tweet has been labeled according to the sentiment polarity of the text. The tweets having the most confident textual sentiment predictions have been selected to build a Twitter for Sentiment Analysis (T4SA) dataset.
Learn more: Twitter for Sentiment Analysis

Domain Terminologies

IMAG-Act

It is an interlingual action ontology. Using speech corpora, 1,010 high-frequency action concepts were identified and visually represented with prototypical scenes. The ontology allows the definition of interlingual correspondences between verbs and actions in English, Italian, Chinese and Spanish. Thanks to the visual representation of the identified action concepts, IMAG-Act can potentially be extended to any language.

FiscalDB

SindacDB

Mariterm

Biolessico

Ontologies

Other resources

The ILC4CLARIN repository hosts a constantly updated collection of language resources developed by the Cnr-Istituto di Linguistica Computazionale “Antonio Zampolli”. These resources are deposited and made available in accordance with the FAIR (Findable, Accessible, Interoperable, Reusable) principles.

BROWSE THE COLLECTION

RESOURCES

Annotated corpora

Unannotated corpora

CLIC

Lexica

PAROLE-SIMPLE-CLIPS

SIMPLE LOD

ItalWordNet LOD

Italian Word Embeddings

FrameNet

GeoDomainWordNet

AncientGreekWordNet LOD

Sentiment Lexicon LOD

Twitter for Sentiment Analysis

Domain Terminologies

IMAG-Act

FiscalDB

SindacDB

Mariterm

Biolessico

Ontologies

Other resources