OLAC Record
oai:lindat.mff.cuni.cz:11234/1-4635

Metadata
Title:SYN v9: large corpus of written Czech
Bibliographic Citation:http://hdl.handle.net/11234/1-4635
Creator:Křen, Michal
Cvrček, Václav
Henyš, Jan
Hnátková, Milena
Jelínek, Tomáš
Kocek, Jan
Kováříková, Dominika
Křivan, Jan
Milička, Jiří
Petkevič, Vladimír
Procházka, Pavel
Skoumalová, Hana
Šindlerová, Jana
Škrabal, Michal
Date (W3CDTF):2022-01-11T16:52:48Z
Date Available:2022-01-11T16:52:48Z
Description:Corpus of contemporary written (printed) Czech sized 4.7 GW (i.e. 5.7 billion tokens). It covers mostly the 1990-2019 period and features rich metadata including detailed bibliographical information, text-type classification etc. SYN v9 contains a wide variety of text types (fiction, non-fiction, newspapers), but the newspapers prevail noticeably. The corpus is lemmatized and morphologically tagged by the new CNC tagset first utilized for the annotation of the SYN2020 corpus. SYN v9 is provided in a CoNLL-U-like vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query interface to the registered users of CNC at http://www.korpus.cz with one important exception: the corpus is shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) with ordering randomized within the given document.
Identifier (URI):http://hdl.handle.net/11234/1-4635
Language:Czech
Language (ISO639):ces
Publisher:Charles University, Faculty of Arts, Institute of the Czech National Corpus
Replaces (URI):http://hdl.handle.net/11234/1-1846
Rights:Czech National Corpus (Shuffled Corpus Data)
https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc
Subject:corpus
written language
Type:corpus
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:  http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:lindat.mff.cuni.cz:11234/1-4635
DateStamp:  2022-01-11
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Křen, Michal; Cvrček, Václav; Henyš, Jan; Hnátková, Milena; Jelínek, Tomáš; Kocek, Jan; Kováříková, Dominika; Křivan, Jan; Milička, Jiří; Petkevič, Vladimír; Procházka, Pavel; Skoumalová, Hana; Šindlerová, Jana; Škrabal, Michal. 2022. Charles University, Faculty of Arts, Institute of the Czech National Corpus.
Terms: area_Europe country_CZ dcmi_Text iso639_ces olac_primary_text


http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-4635
Up-to-date as of: Thu Oct 5 0:43:09 EDT 2023