Resources

Corpora
   Annotated Corpora

ISST-TANL Corpus
It is a manually annotated corpus, encoded in the CoNLL standard format and including PoS tagging and syntactic dependency annotation. Jointly developed by ILC-CNR and the University of Pisa, it exemplifies a general language usage and consists of articles from newspapers and periodicals, selected to cover a high variety of topics. This corpus was used for training and testing in the shared task "Domain Adaptation for Dependency Parsing" of EVALITA 2011.

   Non-annotated Corpora

CLIC

Lexica

PAROLE-SIMPLE-CLIPS
It is a four-level general purpose lexicon that has been elaborated over three different projects. The kernel of the morphological and syntactic lexicons was built in the framework of the European project "Preparatory Action for Linguistic Resources Organisation for Language Engineering" (LE-PAROLE). The linguistic model and the core of the semantic lexicon were elaborated within the European project "Semantic Information for Multifunctional Plurilingual Lexica" (LE-SIMPLE). The phonological level of the description and the extension of the lexical coverage were produced in the context of the Italian project "Corpora e Lessici dell'Italiano Parlato e Scritto" (CLIPS). It comprises a total of 387,267 phonetic units, 53,044 morphological units (53,044 lemmas), 37,406 syntactic units (28,111 lemmas) and 28,346 semantic units (19,216 lemmas). It was encoded at the semantic level, in full accordance with the international standards set out in the PAROLE-SIMPLE model and based on EAGLES. Syntactic and semantic encodings were performed jointly with Thamus (Consortium for Multilingual Documentary Engineering), which is responsible for 25,000 extra entries.

SIMPLE LOD
It is the RDF serialization of all nouns extracted from the PAROLE-SIMPLE-CLIPS lexicon. Lexical entries are serialized in Lemon, while semantic relations are modeled according to the SIMPLE OWL.

ItalWordNet LOD
 - datahub: http://datahub.io/dataset/iwn
 - ilc: http://www.languagelibrary.eu/owl/italWordNet15/schema/synset

FrameNet

GeodomainWordNet
 - datahub: http://datahub.io/dataset/geodomainwn
 - ilc per l'inglese: http://www.languagelibrary.eu/owl/geodomainWN/eng/geonames-synset
 - ilc per l'italiano: http://www.languagelibrary.eu/owl/geodomainWN/ita/geonames-synset
GeoNames ontology concepts, with their English labels and glosses, in Italian have been transformed into a WordNet-like resource, and have been duly linked to the generic WordNets of both languages. This resource is published in RDF according to the W3C and Lemon schema.

AncientGreekWordNet LOD
Linked Open Data related to the section "AncientGreekWordNet" of CoPhiWordNet.

Sentiment Lexicon LOD
https://github.com/opener-project/public-sentiment-lexicons/tree/master/propagation_lexicons/it (in LMF format)
The Italian Sentiment Lexicon was developed in a semi-automated way from ItalWordNet starting from a list of 1,000 manually checked seeds. It contains 24,293 lexical entries annotated with positive/negative/neutral polarity.

Domain Terminologies

FiscalDB

SindacDB

MARITERM

Biolessico

Ontologies

IMAG-Act
It is a cross-linguistic ontology of action. Using spoken corpora, 1,010 high-frequency action concepts have been identified and visually represented with prototypical scenes. The ontology allows the definition of cross-linguistic correspondences between verbs and actions in English, Italian, Chinese and Spanish. Thanks to the visual representation of the action concepts identified, IMAG-Act can be potentially extended to any language.


[work in progress]