OLAC Record

Title:Prague Czech-English Dependency Treebank 2.0
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Hajič, Jan , et al. Prague Czech-English Dependency Treebank 2.0 LDC2012T08. Web Download. Philadelphia: Linguistic Data Consortium, 2012
Contributor:Hajič, Jan
Hajičová, Eva
Panevová, Jarmila
Sgall, Petr
Cinková, Silvie
Fučíková, Eva
Mikulová, Marie
Pajas, Petr
Popelka, Jan
Semecký, Jiří
Šindlerová, Jana
Štěpánek, Jan
Toman, Josef
Urešová, Zdeňka
Žabokrtský, Zdeněk
Date (W3CDTF):2012
Date Issued (W3CDTF):2012-06-15
Description:*Introduction* Prague Czech-English Dependency Treebank (PCEDT) 2.0 was developed by the Institute of Formal and Applied Linguistics at Charles University in Prague, Czech Republic. It is a corpus of Czech-English parallel resources translated, aligned and manually annotated for dependency structure, semantic labeling, argument structure, ellipsis and anaphora resolution. This release updates Prague Czech-English Dependency Treebank 1.0 (LDC2004T25) by adding English newswire texts so that it now contains over two million words in close to 100,000 sentences. *Data* The principal new material in PCEDT 2.0 is the inclusion of the entire Wall Street Journal data from Treebank-3 (LDC99T42). Not included from PCEDT 1.0 are the Readers Digest material, the Czech monolingual corpus, and the English-Czech dictionary. Each section is enhanced with a comprehensive manual linguistic annotation in the Prague Dependency Treebank style (LDC2006T01, Prague Dependency Treebank 2.0). The main features of this annotation style are: * dependency structure of the content words and coordinating and similar structures (function words are attached as their attribute values) * semantic labeling of content words and types of coordinating structures * argument structure, including an argument structure (valency) lexicon for both languages * ellipsis and anaphora resolution This annotation style is called tectogrammatical annotation, and it constitutes the tectogrammatical layer in the corpus. Please consult the PCEDT website for more information and documentation. *Samples* Please follow this link for a sample of the data included. *Updates* None at this time.
Extent:Corpus size: 4446421 KB
ISBN: 1-58563-616-9
ISLRN: 443-974-834-414-7
DOI: 10.35111/mv82-j246
Language (ISO639):eng
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Rights Holder: Portions © 1987-1989 Dow Jones & Company, Inc., © 2002-2012 Charles University in Prague, Institute of Formal and Applied Linguistics, © 1999, 2004, 2012 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text


Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2012T08
DateStamp:  2020-11-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Hajič, Jan; Hajičová, Eva; Panevová, Jarmila; Sgall, Petr; Cinková, Silvie; Fučíková, Eva; Mikulová, Marie; Pajas, Petr; Popelka, Jan; Semecký, Jiří; Šindlerová, Jana; Štěpánek, Jan; Toman, Josef; Urešová, Zdeňka; Žabokrtský, Zdeněk. 2012. Linguistic Data Consortium.
Terms: area_Europe country_CZ country_GB dcmi_Text iso639_ces iso639_eng olac_primary_text

Up-to-date as of: Sun Jun 16 7:34:23 EDT 2024