OLAC Record: Prague Czech-English Dependency Treebank 2.0

OLAC Record
oai:www.ldc.upenn.edu:LDC2012T08

Metadata

Title: Prague Czech-English Dependency Treebank 2.0

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Hajič, Jan , et al. Prague Czech-English Dependency Treebank 2.0 LDC2012T08. Web Download. Philadelphia: Linguistic Data Consortium, 2012

Contributor: Hajič, Jan

Hajičová, Eva

Panevová, Jarmila

Sgall, Petr

Cinková, Silvie

Fučíková, Eva

Mikulová, Marie

Pajas, Petr

Popelka, Jan

Semecký, Jiří

Šindlerová, Jana

Štěpánek, Jan

Toman, Josef

Urešová, Zdeňka

Žabokrtský, Zdeněk

Date (W3CDTF): 2012

Date Issued (W3CDTF): 2012-06-15

Description: *Introduction* Prague Czech-English Dependency Treebank (PCEDT) 2.0 was developed by the Institute of Formal and Applied Linguistics at Charles University in Prague, Czech Republic. It is a corpus of Czech-English parallel resources translated, aligned and manually annotated for dependency structure, semantic labeling, argument structure, ellipsis and anaphora resolution. This release updates Prague Czech-English Dependency Treebank 1.0 (LDC2004T25) by adding English newswire texts so that it now contains over two million words in close to 100,000 sentences. *Data* The principal new material in PCEDT 2.0 is the inclusion of the entire Wall Street Journal data from Treebank-3 (LDC99T42). Not included from PCEDT 1.0 are the Readers Digest material, the Czech monolingual corpus, and the English-Czech dictionary. Each section is enhanced with a comprehensive manual linguistic annotation in the Prague Dependency Treebank style (LDC2006T01, Prague Dependency Treebank 2.0). The main features of this annotation style are: * dependency structure of the content words and coordinating and similar structures (function words are attached as their attribute values) * semantic labeling of content words and types of coordinating structures * argument structure, including an argument structure (valency) lexicon for both languages * ellipsis and anaphora resolution This annotation style is called tectogrammatical annotation, and it constitutes the tectogrammatical layer in the corpus. Please consult the PCEDT website for more information and documentation. *Samples* Please follow this link for a sample of the data included. *Updates* None at this time.

Extent: Corpus size: 4446421 KB

Identifier: LDC2012T08

https://catalog.ldc.upenn.edu/LDC2012T08

ISBN: 1-58563-616-9

ISLRN: 443-974-834-414-7

DOI: 10.35111/mv82-j246

Language: English

Czech

Language (ISO639): eng

ces

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Rights Holder: Portions © 1987-1989 Dow Jones & Company, Inc., © 2002-2012 Charles University in Prague, Institute of Formal and Applied Linguistics, © 1999, 2004, 2012 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2012T08

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Hajič, Jan; Hajičová, Eva; Panevová, Jarmila; Sgall, Petr; Cinková, Silvie; Fučíková, Eva; Mikulová, Marie; Pajas, Petr; Popelka, Jan; Semecký, Jiří; Šindlerová, Jana; Štěpánek, Jan; Toman, Josef; Urešová, Zdeňka; Žabokrtský, Zdeněk. 2012. Linguistic Data Consortium.
Terms: area_Europe country_CZ country_GB dcmi_Text iso639_ces iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2012T08
Up-to-date as of: Wed Oct 29 7:01:20 EDT 2025

Metadata
Title:		Prague Czech-English Dependency Treebank 2.0
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Hajič, Jan , et al. Prague Czech-English Dependency Treebank 2.0 LDC2012T08. Web Download. Philadelphia: Linguistic Data Consortium, 2012
Contributor:		Hajič, Jan
		Hajičová, Eva
		Panevová, Jarmila
		Sgall, Petr
		Cinková, Silvie
		Fučíková, Eva
		Mikulová, Marie
		Pajas, Petr
		Popelka, Jan
		Semecký, Jiří
		Šindlerová, Jana
		Štěpánek, Jan
		Toman, Josef
		Urešová, Zdeňka
		Žabokrtský, Zdeněk
Date (W3CDTF):		2012
Date Issued (W3CDTF):		2012-06-15
Description:		Introduction Prague Czech-English Dependency Treebank (PCEDT) 2.0 was developed by the Institute of Formal and Applied Linguistics at Charles University in Prague, Czech Republic. It is a corpus of Czech-English parallel resources translated, aligned and manually annotated for dependency structure, semantic labeling, argument structure, ellipsis and anaphora resolution. This release updates Prague Czech-English Dependency Treebank 1.0 (LDC2004T25) by adding English newswire texts so that it now contains over two million words in close to 100,000 sentences. Data The principal new material in PCEDT 2.0 is the inclusion of the entire Wall Street Journal data from Treebank-3 (LDC99T42). Not included from PCEDT 1.0 are the Readers Digest material, the Czech monolingual corpus, and the English-Czech dictionary. Each section is enhanced with a comprehensive manual linguistic annotation in the Prague Dependency Treebank style (LDC2006T01, Prague Dependency Treebank 2.0). The main features of this annotation style are: * dependency structure of the content words and coordinating and similar structures (function words are attached as their attribute values) * semantic labeling of content words and types of coordinating structures * argument structure, including an argument structure (valency) lexicon for both languages * ellipsis and anaphora resolution This annotation style is called tectogrammatical annotation, and it constitutes the tectogrammatical layer in the corpus. Please consult the PCEDT website for more information and documentation. Samples Please follow this link for a sample of the data included. Updates None at this time.
Extent:		Corpus size: 4446421 KB
Identifier:		LDC2012T08
		https://catalog.ldc.upenn.edu/LDC2012T08
		ISBN: 1-58563-616-9
		ISLRN: 443-974-834-414-7
		DOI: 10.35111/mv82-j246
Language:		English
Language:		Czech
Language (ISO639):		eng
Language (ISO639):		ces
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Rights Holder:		Portions © 1987-1989 Dow Jones & Company, Inc., © 2002-2012 Charles University in Prague, Institute of Formal and Applied Linguistics, © 1999, 2004, 2012 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2012T08
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Hajič, Jan; Hajičová, Eva; Panevová, Jarmila; Sgall, Petr; Cinková, Silvie; Fučíková, Eva; Mikulová, Marie; Pajas, Petr; Popelka, Jan; Semecký, Jiří; Šindlerová, Jana; Štěpánek, Jan; Toman, Josef; Urešová, Zdeňka; Žabokrtský, Zdeněk. 2012. Linguistic Data Consortium.
Terms:		area_Europe country_CZ country_GB dcmi_Text iso639_ces iso639_eng olac_primary_text