OLAC Record
oai:lindat.mff.cuni.cz:11372/LRT-2209

Metadata
Title:C4Corpus (publicdomain part)
Bibliographic Citation:http://hdl.handle.net/11372/LRT-2209
Creator:Gurevych, Iryna
Habernal, Ivan
Zayed, Omnia
Date (W3CDTF):2017-06-07T13:10:23Z
Date Available:2017-06-07T13:10:23Z
Description:A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Identifier (URI):http://hdl.handle.net/11372/LRT-2209
Language:Afrikaans
Arabic
Bulgarian
Czech
Danish
German
Modern Greek (1453-)
English
Estonian
Persian
Finnish
French
Croatian
Hungarian
Indonesian
Italian
Japanese
Korean
Latvian
Lithuanian
Dutch
Norwegian
Polish
Portuguese
Russian
Slovenian
Somali
Spanish
Swahili (macrolanguage)
Swedish
Tagalog
Thai
Turkish
Ukrainian
Undetermined
Vietnamese
Language (ISO639):afr
ara
bul
ces
dan
deu
ell
eng
est
fas
fin
fra
hrv
hun
ind
ita
jpn
kor
lav
lit
nld
nor
pol
por
rus
slv
som
spa
swa
swe
tgl
tha
tur
ukr
und
vie
Publisher:Technische Universität Darmstadt
Rights:Public Domain Mark (PD)
http://creativecommons.org/publicdomain/mark/1.0/
Subject:CommonCrawl
Creative Commons
Web corpus
Amazon Web Services
Type:corpus
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:  http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:lindat.mff.cuni.cz:11372/LRT-2209
DateStamp:  2021-06-29
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Gurevych, Iryna; Habernal, Ivan; Zayed, Omnia. 2017. Technische Universität Darmstadt.
Terms: area_Africa area_Asia area_Europe country_BG country_CZ country_DE country_DK country_ES country_FI country_FR country_GB country_GR country_HR country_HU country_ID country_IT country_JP country_KR country_LT country_NL country_NO country_PH country_PL country_PT country_RU country_SE country_SI country_SO country_TH country_TR country_UA country_VN country_ZA dcmi_Text iso639_afr iso639_ara iso639_bul iso639_ces iso639_dan iso639_deu iso639_ell iso639_eng iso639_est iso639_fas iso639_fin iso639_fra iso639_hrv iso639_hun iso639_ind iso639_ita iso639_jpn iso639_kor iso639_lav iso639_lit iso639_nld iso639_nor iso639_pol iso639_por iso639_rus iso639_slv iso639_som iso639_spa iso639_swa iso639_swe iso639_tgl iso639_tha iso639_tur iso639_ukr iso639_und iso639_vie olac_primary_text


http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11372/LRT-2209
Up-to-date as of: Thu Oct 5 0:40:45 EDT 2023