OLAC Record: PAISÀ Corpus of Italian Web Text

OLAC Record
oai:clarin.eurac.edu:20.500.12124/3

Metadata

Title: PAISÀ Corpus of Italian Web Text

Bibliographic Citation: http://hdl.handle.net/20.500.12124/3

Creator: Lyding, Verena

Stemle, Egon

Borghetti, Claudia

Brunello, Marco

Castagnoli, Sara

Dell’Orletta, Felice

Dittmann, Henrik

Lenci, Alessandro

Pirrelli, Vito

Date (W3CDTF): 2018-05-29T11:06:34Z

Date Available: 2018-05-29T11:06:34Z

Description: The Paisà corpus is a large collection of Italian web texts, licensed under Creative Commons (Attribution-ShareAlike and Attribution-Noncommercial-ShareAlike). It has been created in the context of the project PAISÀ. All documents were selected in two different ways. A part of the corpus was constructed using a method inspired by the WaCky project. We created 50,000 word pairs by randomly combining terms from an Italian basic vocabulary list, and used the pairs as queries to the Yahoo! search engine in order to retrieve candidate pages. We limited hits to pages in Italian with a Creative Commons license of type: CC-Attribution, CC-Attribution-Sharealike, CC-Attribution-Sharealike-Non-commercial, and CC-Attribution-Non-commercial. Pages that were wrongly tagged as CC-licensed were eliminated using a black-list that was populated by manual inspection of earlier versions of the corpus. The retrieved pages were automatically cleaned using the KrdWrd system. The remaining pages in the PAISÀ corpus come from the Italian versions of various Wikimedia Foundation projects, namely: Wikipedia, Wikinews, Wikisource, Wikibooks, Wikiversity, Wikivoyage. The official Wikimedia Foundation dumps were used, extracting text with Wikipedia Extractor. Once all materials were downloaded, the collection was filtered discarding empty documents or documents containing less than 150 words. The corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services.

Identifier (URI): http://hdl.handle.net/20.500.12124/3

Language: Italian

Language (ISO639): ita

Publisher: Institute for Applied Linguistics, Eurac Research

Rights: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

https://creativecommons.org/licenses/by-nc-sa/4.0/

Subject: web corpus

language learning

Type: corpus

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: Eurac Research CLARIN Centre

Description: http://www.language-archives.org/archive/clarin.eurac.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:clarin.eurac.edu:20.500.12124/3

DateStamp: 2023-03-17

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Lyding, Verena; Stemle, Egon; Borghetti, Claudia; Brunello, Marco; Castagnoli, Sara; Dell’Orletta, Felice; Dittmann, Henrik; Lenci, Alessandro; Pirrelli, Vito. 2018. Institute for Applied Linguistics, Eurac Research.
Terms: area_Europe country_IT dcmi_Text iso639_ita olac_primary_text

http://www.language-archives.org/item.php/oai:clarin.eurac.edu:20.500.12124/3
Up-to-date as of: Fri Oct 17 1:18:43 EDT 2025

Metadata
Title:		PAISÀ Corpus of Italian Web Text
Bibliographic Citation:		http://hdl.handle.net/20.500.12124/3
Creator:		Lyding, Verena
		Stemle, Egon
		Borghetti, Claudia
		Brunello, Marco
		Castagnoli, Sara
		Dell’Orletta, Felice
		Dittmann, Henrik
		Lenci, Alessandro
		Pirrelli, Vito
Date (W3CDTF):		2018-05-29T11:06:34Z
Date Available:		2018-05-29T11:06:34Z
Description:		The Paisà corpus is a large collection of Italian web texts, licensed under Creative Commons (Attribution-ShareAlike and Attribution-Noncommercial-ShareAlike). It has been created in the context of the project PAISÀ. All documents were selected in two different ways. A part of the corpus was constructed using a method inspired by the WaCky project. We created 50,000 word pairs by randomly combining terms from an Italian basic vocabulary list, and used the pairs as queries to the Yahoo! search engine in order to retrieve candidate pages. We limited hits to pages in Italian with a Creative Commons license of type: CC-Attribution, CC-Attribution-Sharealike, CC-Attribution-Sharealike-Non-commercial, and CC-Attribution-Non-commercial. Pages that were wrongly tagged as CC-licensed were eliminated using a black-list that was populated by manual inspection of earlier versions of the corpus. The retrieved pages were automatically cleaned using the KrdWrd system. The remaining pages in the PAISÀ corpus come from the Italian versions of various Wikimedia Foundation projects, namely: Wikipedia, Wikinews, Wikisource, Wikibooks, Wikiversity, Wikivoyage. The official Wikimedia Foundation dumps were used, extracting text with Wikipedia Extractor. Once all materials were downloaded, the collection was filtered discarding empty documents or documents containing less than 150 words. The corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services.
Identifier (URI):		http://hdl.handle.net/20.500.12124/3
Language:		Italian
Language (ISO639):		ita
Publisher:		Institute for Applied Linguistics, Eurac Research
Rights:		Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Rights:		https://creativecommons.org/licenses/by-nc-sa/4.0/
Subject:		web corpus
Subject:		language learning
Type:		corpus
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		Eurac Research CLARIN Centre
Description:		http://www.language-archives.org/archive/clarin.eurac.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:clarin.eurac.edu:20.500.12124/3
DateStamp:		2023-03-17
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Lyding, Verena; Stemle, Egon; Borghetti, Claudia; Brunello, Marco; Castagnoli, Sara; Dell’Orletta, Felice; Dittmann, Henrik; Lenci, Alessandro; Pirrelli, Vito. 2018. Institute for Applied Linguistics, Eurac Research.
Terms:		area_Europe country_IT dcmi_Text iso639_ita olac_primary_text