OLAC Record
oai:lindat.mff.cuni.cz:11372/LRT-4807

Metadata
Title:esCorpius: A Massive Spanish Crawling Corpus
Bibliographic Citation:http://hdl.handle.net/11372/LRT-4807
Creator:Asier, Gutiérrez-Fandiño
David, Pérez-Fernández
Jordi, Armengol-Estapé
David, Griol
Zoraida, Callejas
Date (W3CDTF):2022-08-03T13:32:00Z
Date Available:2022-08-03T13:32:00Z
Description:In the recent years, Transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license.
Identifier (URI):http://hdl.handle.net/11234/1-4807
http://hdl.handle.net/11372/LRT-4807
Language:Spanish
Language (ISO639):spa
Publisher:LHF Labs
Rights:Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
http://creativecommons.org/licenses/by-nc-nd/4.0/
Subject:spanish crawling corpus
crawling corpus
spanish corpus
massive corpus
large corpus
clean
deduplicated
Type:corpus
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:  http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:lindat.mff.cuni.cz:11372/LRT-4807
DateStamp:  2022-08-03
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Asier, Gutiérrez-Fandiño; David, Pérez-Fernández; Jordi, Armengol-Estapé; David, Griol; Zoraida, Callejas. 2022. LHF Labs.
Terms: area_Europe country_ES dcmi_Text iso639_spa olac_primary_text


http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11372/LRT-4807
Up-to-date as of: Thu Oct 5 0:43:20 EDT 2023