Sample Metadata Record

oai:clarin.eurac.edu:20.500.12124/3


XML format

<olac:olac>
<dc:title>PAISÀ Corpus of Italian Web Text</dc:title>
<dc:creator>Lyding, Verena</dc:creator>
<dc:creator>Stemle, Egon</dc:creator>
<dc:creator>Borghetti, Claudia</dc:creator>
<dc:creator>Brunello, Marco</dc:creator>
<dc:creator>Castagnoli, Sara</dc:creator>
<dc:creator>Dell’Orletta, Felice</dc:creator>
<dc:creator>Dittmann, Henrik</dc:creator>
<dc:creator>Lenci, Alessandro</dc:creator>
<dc:creator>Pirrelli, Vito</dc:creator>
<dc:date xsi:type="dcterms:W3CDTF">2018-05-29T11:06:34Z</dc:date>
<dcterms:available>2018-05-29T11:06:34Z</dcterms:available>
<dc:description>The Paisà corpus is a large collection of Italian web texts, licensed under Creative Commons (Attribution-ShareAlike and Attribution-Noncommercial-ShareAlike). It has been created in the context of the project PAISÀ. All documents were selected in two different ways. A part of the corpus was constructed using a method inspired by the WaCky project. We created 50,000 word pairs by randomly combining terms from an Italian basic vocabulary list, and used the pairs as queries to the Yahoo! search engine in order to retrieve candidate pages. We limited hits to pages in Italian with a Creative Commons license of type: CC-Attribution, CC-Attribution-Sharealike, CC-Attribution-Sharealike-Non-commercial, and CC-Attribution-Non-commercial. Pages that were wrongly tagged as CC-licensed were eliminated using a black-list that was populated by manual inspection of earlier versions of the corpus. The retrieved pages were automatically cleaned using the KrdWrd system. The remaining pages in the PAISÀ corpus come from the Italian versions of various Wikimedia Foundation projects, namely: Wikipedia, Wikinews, Wikisource, Wikibooks, Wikiversity, Wikivoyage. The official Wikimedia Foundation dumps were used, extracting text with Wikipedia Extractor. Once all materials were downloaded, the collection was filtered discarding empty documents or documents containing less than 150 words. The corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services.</dc:description>
<dc:identifier xsi:type="dcterms:URI">http://hdl.handle.net/20.500.12124/3</dc:identifier>
<dcterms:bibliographicCitation>http://hdl.handle.net/20.500.12124/3</dcterms:bibliographicCitation>
<dc:language xsi:type="olac:language" olac:code="ita"/>
<dc:publisher>Institute for Applied Linguistics, Eurac Research</dc:publisher>
<dc:rights>Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)</dc:rights>
<dc:rights>https://creativecommons.org/licenses/by-nc-sa/4.0/</dc:rights>
<dc:subject>web corpus</dc:subject>
<dc:subject>language learning</dc:subject>
<dc:type>corpus</dc:type>
<dc:type xsi:type="dcterms:DCMIType">Text</dc:type>
<dc:type xsi:type="olac:linguistic-type" olac:code="primary_text"/>
</olac:olac>

Display format

 Title  PAISÀ Corpus of Italian Web Text
 Creator  Lyding, Verena
 Creator  Stemle, Egon
 Creator  Borghetti, Claudia
 Creator  Brunello, Marco
 Creator  Castagnoli, Sara
 Creator  Dell’Orletta, Felice
 Creator  Dittmann, Henrik
 Creator  Lenci, Alessandro
 Creator  Pirrelli, Vito
 Date  (W3CDTF)  2018-05-29T11:06:34Z
 Available  2018-05-29T11:06:34Z
 Description  The Paisà corpus is a large collection of Italian web texts, licensed under Creative Commons (Attribution-ShareAlike and Attribution-Noncommercial-ShareAlike). It has been created in the context of the project PAISÀ. All documents were selected in two different ways. A part of the corpus was constructed using a method inspired by the WaCky project. We created 50,000 word pairs by randomly combining terms from an Italian basic vocabulary list, and used the pairs as queries to the Yahoo! search engine in order to retrieve candidate pages. We limited hits to pages in Italian with a Creative Commons license of type: CC-Attribution, CC-Attribution-Sharealike, CC-Attribution-Sharealike-Non-commercial, and CC-Attribution-Non-commercial. Pages that were wrongly tagged as CC-licensed were eliminated using a black-list that was populated by manual inspection of earlier versions of the corpus. The retrieved pages were automatically cleaned using the KrdWrd system. The remaining pages in the PAISÀ corpus come from the Italian versions of various Wikimedia Foundation projects, namely: Wikipedia, Wikinews, Wikisource, Wikibooks, Wikiversity, Wikivoyage. The official Wikimedia Foundation dumps were used, extracting text with Wikipedia Extractor. Once all materials were downloaded, the collection was filtered discarding empty documents or documents containing less than 150 words. The corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services.
 Identifier (URI)  http://hdl.handle.net/20.500.12124/3
 Bibliographic Citation  http://hdl.handle.net/20.500.12124/3
 Language (ISO639-3)  Italian [ita]
 Publisher  Institute for Applied Linguistics, Eurac Research
 Rights  Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
 Rights  https://creativecommons.org/licenses/by-nc-sa/4.0/
 Subject  web corpus
 Subject  language learning
 Type  corpus
 Type (DCMI)  Text
 Type (OLAC)  Linguistic type: Primary text

Metadata quality analysis

OLAC metadata records are scored for metadata quality on a 10-point scale explained in OLAC Metadata Metrics. The score for the above record (along with comments on changes that could improve the score) is as follows:

Component + - Comments
Title   1   0 
Date   1   0 
Agent   1   0 
About   1   0 
Depth   1   0 
Content Language   1   0 
Subject Language   1   0 
OLAC Type   1   0 
DCMI Type   1   0 
Precision   0.67   0.33  For the full score, make use of at least one more encoding scheme in addition to the ones counted explicitly in other components of the score. For instance,
  • olac:role on dc:creator or dc:contributor
  • use dcterms:URI when the value of an element is a URL
  • use dcterms:IMT on dc:format
Quality score  9.67