OLAC Record: Prague Czech-English Dependency Treebank 1.0

OLAC Record
oai:www.ldc.upenn.edu:LDC2004T25

Metadata

Title: Prague Czech-English Dependency Treebank 1.0

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Curin, Jan, et al. Prague Czech-English Dependency Treebank 1.0 LDC2004T25. Web Download. Philadelphia: Linguistic Data Consortium, 2004

Contributor: Curin, Jan

Cmejrek, Martin

Havelka, Jiří

Hajič, Jan

Kubon, Vladislav

Žabokrtský, Zdeněk

Date (W3CDTF): 2004

Date Issued (W3CDTF): 2004-11-19

Description: *Introduction* Prague Czech-English Dependency Treebank (PCEDT) 1.0 was produced by the Linguistic Data Consortium (LDC) and contains 74,600 parallel sentence in Czech and English, 21,600 of which are morphologically annotated and parsed into dependency structures. It also includes a large monolingual corpus of Czech with 2.4 million sentences and three dictionaries for translation between Czech and English. This corpus was developed at the Center for Computational Linguistics in cooperation with the Institute of Formal and Applied Linguistics. PCEDT 1.0 is a corpus of Czech-English parallel resources suitable for experiments in machine translation, with a special emphasis on dependency-based (structural) translation (with evaluation data provided for Czech-to-English systems). *Data* The core part of PCEDT 1.0 is a Czech translation of 21,600 English sentences from the Wall Street Journal, which are part of Treebank-3 (LDC99T42). Sentences of the Czech translation were automatically morphologically annotated and parsed into two levels (analytical and tectogrammatical) of dependency structures introduced in the theory of Functional Generative Description and closely related to Prague Dependency Treebank 1.0 (LDC2001T10). The original English sentences were transformed from the Penn Treebank phrase-structure trees into dependency representations. A heldout (development and evaluation) set of 515 sentence pairs was selected and manually annotated on a tectogrammatical level in both Czech and English; for the purposes of quantitative evaluation, this set has been retranslated from Czech into English by four different translation companies. PCEDT 1.0 also contains a parallel Czech-English corpus of plain text from Reader's Digest 1993-1996 consisting of 53,000 parallel sentences, and a large monolingual corpus of Czech (2.4 M sentences). The included Czech-English translation dictionary consists of 46,150 translation pairs in its lemmatized version and 496,673 pairs of word forms, where for each entry-translation pair all corresponding word form pairs have been generated. Also included is an English-Czech dictionary provided by Milan Svoboda under GNU/FDL license; this dictionary contains multi-word translations in 115,929 translation pairs. Prague Czech-English Dependency Treebank 2.0 (LDC2012T08) translates the whole Wall Street Journal part of the Penn Treebank. Please consult the PCEDT 2.0 website for more information and documentation. *Samples* For an example of the data in this corpus, please view this sample (TXT). *Sponsorship* PCEDT 1.0 has been supported by the following grants and projects: * Ministry of Education of the Czech Republic project No. LN00A063 (Center for Computational Linguistics) * National Science Foundation grant No. IIS-0121285 *Updates* None at this time.

Identifier: LDC2004T25

https://catalog.ldc.upenn.edu/LDC2004T25

ISBN: 1-58563-321-6

ISLRN: 557-838-231-104-8

DOI: 10.35111/yn25-st18

Language: Czech

English

Language (ISO639): ces

eng

License: Prague Czech-English Dependency Treebank 1.0: https://catalog.ldc.upenn.edu/license/prague-czech-english-dependency-treebank-1.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2004T25

Rights Holder: Portions © 1988-1989 Dow Jones & Company, Inc., © 1993-1996 Reader's Digest, © 1991-1995 Lidové noviny, © 2004 Milan Svoboda, © 2002-2004 Center for Computational Linguistics, Charles University in Prague, © 2004 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): lexicon

primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2004T25

DateStamp: 2022-03-11

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Curin, Jan; Cmejrek, Martin; Havelka, Jiří; Hajič, Jan; Kubon, Vladislav; Žabokrtský, Zdeněk. 2004. Linguistic Data Consortium.
Terms: area_Europe country_CZ country_GB dcmi_Text iso639_ces iso639_eng olac_lexicon olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2004T25
Up-to-date as of: Wed Oct 29 7:00:24 EDT 2025

Metadata
Title:		Prague Czech-English Dependency Treebank 1.0
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Curin, Jan, et al. Prague Czech-English Dependency Treebank 1.0 LDC2004T25. Web Download. Philadelphia: Linguistic Data Consortium, 2004
Contributor:		Curin, Jan
		Cmejrek, Martin
		Havelka, Jiří
		Hajič, Jan
		Kubon, Vladislav
		Žabokrtský, Zdeněk
Date (W3CDTF):		2004
Date Issued (W3CDTF):		2004-11-19
Description:		Introduction Prague Czech-English Dependency Treebank (PCEDT) 1.0 was produced by the Linguistic Data Consortium (LDC) and contains 74,600 parallel sentence in Czech and English, 21,600 of which are morphologically annotated and parsed into dependency structures. It also includes a large monolingual corpus of Czech with 2.4 million sentences and three dictionaries for translation between Czech and English. This corpus was developed at the Center for Computational Linguistics in cooperation with the Institute of Formal and Applied Linguistics. PCEDT 1.0 is a corpus of Czech-English parallel resources suitable for experiments in machine translation, with a special emphasis on dependency-based (structural) translation (with evaluation data provided for Czech-to-English systems). Data The core part of PCEDT 1.0 is a Czech translation of 21,600 English sentences from the Wall Street Journal, which are part of Treebank-3 (LDC99T42). Sentences of the Czech translation were automatically morphologically annotated and parsed into two levels (analytical and tectogrammatical) of dependency structures introduced in the theory of Functional Generative Description and closely related to Prague Dependency Treebank 1.0 (LDC2001T10). The original English sentences were transformed from the Penn Treebank phrase-structure trees into dependency representations. A heldout (development and evaluation) set of 515 sentence pairs was selected and manually annotated on a tectogrammatical level in both Czech and English; for the purposes of quantitative evaluation, this set has been retranslated from Czech into English by four different translation companies. PCEDT 1.0 also contains a parallel Czech-English corpus of plain text from Reader's Digest 1993-1996 consisting of 53,000 parallel sentences, and a large monolingual corpus of Czech (2.4 M sentences). The included Czech-English translation dictionary consists of 46,150 translation pairs in its lemmatized version and 496,673 pairs of word forms, where for each entry-translation pair all corresponding word form pairs have been generated. Also included is an English-Czech dictionary provided by Milan Svoboda under GNU/FDL license; this dictionary contains multi-word translations in 115,929 translation pairs. Prague Czech-English Dependency Treebank 2.0 (LDC2012T08) translates the whole Wall Street Journal part of the Penn Treebank. Please consult the PCEDT 2.0 website for more information and documentation. Samples For an example of the data in this corpus, please view this sample (TXT). Sponsorship PCEDT 1.0 has been supported by the following grants and projects: * Ministry of Education of the Czech Republic project No. LN00A063 (Center for Computational Linguistics) * National Science Foundation grant No. IIS-0121285 Updates None at this time.
Identifier:		LDC2004T25
		https://catalog.ldc.upenn.edu/LDC2004T25
		ISBN: 1-58563-321-6
		ISLRN: 557-838-231-104-8
		DOI: 10.35111/yn25-st18
Language:		Czech
Language:		English
Language (ISO639):		ces
Language (ISO639):		eng
License:		Prague Czech-English Dependency Treebank 1.0: https://catalog.ldc.upenn.edu/license/prague-czech-english-dependency-treebank-1.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2004T25
Rights Holder:		Portions © 1988-1989 Dow Jones & Company, Inc., © 1993-1996 Reader's Digest, © 1991-1995 Lidové noviny, © 2004 Milan Svoboda, © 2002-2004 Center for Computational Linguistics, Charles University in Prague, © 2004 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		lexicon
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2004T25
DateStamp:		2022-03-11
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Curin, Jan; Cmejrek, Martin; Havelka, Jiří; Hajič, Jan; Kubon, Vladislav; Žabokrtský, Zdeněk. 2004. Linguistic Data Consortium.
Terms:		area_Europe country_CZ country_GB dcmi_Text iso639_ces iso639_eng olac_lexicon olac_primary_text