OLAC Record oai:dspace-clarin-it.ilc.cnr.it:20.500.11752/ILC-984 |
Metadata | ||
Title: | TrAVaSI_GDLI-quotation corpus | |
Bibliographic Citation: | http://hdl.handle.net/20.500.11752/ILC-984 | |
Creator: | Favaro, Manuel | |
Guadagnini, Elisa | ||
Sassolini, Eva | ||
Biffi, Marco | ||
Montemagni, Simonetta | ||
Date (W3CDTF): | 2023-01-09T08:40:48Z | |
Date Available: | 2023-01-09T08:40:48Z | |
Description: | The TrAVaSI_GDLI-quotation corpus (TrAVaSI_GDLI-QC) is a first nucleus of a diachronic corpus for Italian collecting a sample of the quotations of a historical dictionary, namely the "Grande Dizionario della Lingua Italiana" (GDLI) by Salvatore Battaglia, which includes a huge collection of quotations covering the entire history of the Italian language, ranging from the Middle Ages to the present day. Different criteria guided the composition of the corpus. Among the most cited authors, those who guaranteed to cover the widest chronological span were selected. Representativeness of different text typologies (e.g. chronicle, literary prose, poetry, treatises) was also taken into account. The resulting TrAVaSI_GDLI-QC consists of two balanced sub-corpora, with quotations from works written between 14th and 20th century: one collecting 1500 prose quotes from 15 authors (100 each) for a total of about 35.000 tokens, and the other gathering 500 poetry quotes from 10 authors (50 each) for a total of about 10.000 tokens. TrAVaSI_GDLI-QC is morpho-syntactically annotated and lemmatized. The annotation, conforming to the Universal Dependencies standard (UD, De Marneffe et al. 2021), has been carried out semi-automatically. First, both sub-corpora were automatically annotated with the Stanza “combined” model for Italian. Automatic annotation was then manually revised. The resulting corpus has also been used to retrain Stanza to deal with historical varieties of the Italian language: achieved results are encouraging. | |
Identifier (URI): | http://hdl.handle.net/20.500.11752/ILC-984 | |
Language: | Italian | |
Language (ISO639): | ita | |
Publisher: | Istituto di Linguistica Computazionale “A. Zampolli” - Consiglio Nazionale delle Ricerche (ILC-CNR) | |
Accademia della Crusca | ||
Rights: | Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) | |
http://creativecommons.org/licenses/by-nc-nd/4.0/ | ||
Subject: | historical annotated corpora | |
linguistic annotation | ||
Universal Dependencies | ||
Type: | corpus | |
Type (DCMI): | Text | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | ILC-CNR for CLARIN-IT repository hosted at Institute for Computational Linguistics "A. Zampolli", National Research Council, in Pisa | |
Description: | http://www.language-archives.org/archive/dspace-clarin-it.ilc.cnr.it | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:dspace-clarin-it.ilc.cnr.it:20.500.11752/ILC-984 | |
DateStamp: | 2023-01-09 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Favaro, Manuel; Guadagnini, Elisa; Sassolini, Eva; Biffi, Marco; Montemagni, Simonetta. 2023. Istituto di Linguistica Computazionale “A. Zampolli” - Consiglio Nazionale delle Ricerche (ILC-CNR). | |
Terms: | area_Europe country_IT dcmi_Text iso639_ita olac_primary_text |