OLAC Record: Prague Dependency Treebank 1.0

OLAC Record
oai:www.ldc.upenn.edu:LDC2001T10

Metadata

Title: Prague Dependency Treebank 1.0

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Hajič, Jan , et al. Prague Dependency Treebank 1.0 LDC2001T10. Web Download. Philadelphia: Linguistic Data Consortium, 2001

Contributor: Hajič, Jan

Hajičová, Eva

Pajas, Petr

Panevová, Jarmila

Sgall, Petr

Date (W3CDTF): 2001

Description: *Introduction* The Prague Dependency Treebank Version 1.0: * Morphologically and syntactically annotated Czech data, 1.8MW * Czech-English parallel Corpus, aligned, 0.9MW/1MW * Czech raw texts (newspaper and journals), over 30MW * Czech NLP tools (morphology, tagging) * General annotation tools (tree editors, tree viewer) (abridged version of the part of paper: E. Hajicova. Dependency-Based Underlying-Structure Tagging of a Very Large Czech Corpus) Since a group of Czech linguists (Institute of Formal and Applied Linguistics, Institute of Theoretical and Computational Linguistcs) from Charles University in Prague and Masaryk University in Brno first formulated the Czech National Corpus, it has been quite clear to all of us that for the outcome of our project to have broader relevance and multifaceted usage, we cannot confine ourselves to a mere compilation of a very large corpus of Czech texts. We have been aware that in order to make the corpus really useful for future users -- be they linguists or developers of natural language processing systems of any kind -- we have to design annotation schemes and develop tools that will allow us to add as much linguistic information as possible. Having the advantage of a long and fruitful tradition of theoretical and computational linguistics and inspired by the research resulting in the Penn Treebank, the project group decided to build the Prague Dependency Treebank (PDT). *Data* The following three points are characteristic for the theory underlying the PDT, fully visible at the highest, tectogrammatical level: (i) Its theoretical background is a dependency-based syntax (handling the sentence structure as concentrated around the verb and its valency, but containing a further dimension, namely coordination). Among the reasons for the choice of a dependency-based syntax, we primarily stress its relative economy and perspicuous, immediate correspondence to the empirical data. (ii) The nodes of the dependency tree (more precisely, of a multidimensional network) are labeled by complex symbols consisting of lexical, morphological and syntactic parts. Thus, the label of every node contains symbols expressing all of the information contained in the grammatical position of this word and is relevant for a semantic (semantico-pragmatic) interpretation. This makes the output representations, or the trees of our treebank, not only useful for practical applications such as parsing, but also for its inclusion into an integrated theoretical description encompassing all layers from the outer (phonetic or graphemic) shape of the sentence to its semantico-pragmatic representation, be it in the form of truth-conditionally based intensional semantics or in that of a framework paying more attention to the embedding of the sentence in context. (iii) The dependency tree is understood as projective. Its relationships to the morphemic representation of the sentence (a string of symbols, the order of which corresponds to the surface word order) are handled by means of specific rules. Prague Dependency Treebank as a project The Prague Dependency Treebank (PDT) is a long-term project with two major phases. In the first phase (1996-2000), the morphological and syntactic analytic layers of annotation have been completed and made together with the preview of tectogrammatical layer annotation available as PDT 1.0. During the second phase (2000 - 2004, Center for Computational Linguistics), the tectogrammatical layer of annotation will proceed and the PDT 2.0 will be available upon completion. The structure of the Prague Dependency Treebank (PDT) corresponds to a three-layer structure annotated corpus of Czech as a representative of inflectionally rich, free word-order languages: * Morphological layer (lowest) - Full morphological annotation * Analytic layer (middle) - Superficial (surface) syntactic annotation using dependency treebank with a level conceptually close to the syntactic annotation used in the Penn Treebank * Tectogrammatical layer (highest) - Level of linguistic meaning Text Sources The electronic text sources have been provided by the Institute of the Czech National Corpus.The text material contains samples from the following sources: * Lidové Noviny (daily newspapers), 1991, 1994, 1995 * Mladá fronta Dnes (daily newspapers), 1992 * Ceskomoravský Profit (business weekly), 1994 * Vesmír (scientific magazine), Academia Publishers, 1992, 1993 There is also a parallel Czech English corpus. Drawn from Readers Digest 1993-1996, it consists of 450 articles, 53,117 parallel sentences, 1,010,346 English tokens and 877,658 Czech tokens Inner format of PDT There are two internal formats employed in PDT: FS and CSTS. The former is an older format, still heavily used by some treebank tools. The latter, more general SGML-based encoding, is meant as the main PDT format (in the future, it will be followed by an XML version, probably already for PDT 2.0). See the description of the FS file format and documentation of the CSTS document type definition (csts.dtd). Prague Dependency Treebank Version 1.0 PDT 0.5 (half through) was released in 1998 and contains 456,705 tokens (words and punctuation) in 26,610 sentences. PDT 1.0 contains about three times more tokens and sentences than PDT 0.5. It is completely manually-annotated on the morphological and analytical levels and includes a preview of tectogrammatically annotated data as well. Future The Prague Dependency Treebank Version 2.0 will add the tectogrammatical layer of annotation to PDT 1.0. It will be available with a reduced amount of data as preliminary Version 1.5 during 2002. The final data volume will be reached at the end of 2004. Support The PDT 1.0 has been supported by the following grants and projects * Grant Agency of the Czech Republic No. 405/96/0198 (Treebank Definition and Procedures Specification) * Grant Agency of the Czech Republic No. 405/96/K214 (Tools and Morphological Layer Annotation) * Ministry of Education of the Czech Republic No. VS96151 (Tools and Structural Annotation on the Analytical Layer) * National Science Foundation No. IIS-9732388 (Version 0.5 Preparation for the Workshop 98) The PDT 2.0 will be supported by the project * Ministry of Education of the Czech Republic No. LN00A063 (Center for Computational Linguistics) *Updates* There are no updates at this time.

Identifier: LDC2001T10

https://catalog.ldc.upenn.edu/LDC2001T10

ISBN: 1-58563-212-0

ISLRN: 552-093-753-963-2

DOI: 10.35111/c3n5-8z64

Language: Czech

English

Language (ISO639): ces

eng

License: Prague Dependency Treebank 1.0: https://catalog.ldc.upenn.edu/license/prague-dependency-treebank-1.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2001T10

Rights Holder: Portions © 1993-1996 Readers Digest, © 1991, 1994, 1995 Lidové noviny daily newspapers, © 1992 Mladá fronta Dnes daily newspapers, © 1994 Ceskomoravský Profit business weekly, © 1992-1993 Vesmír scientific magazine, Academia Publishers, © 1996-2001 Institute of Formal and Applied Linguistics and Center for Computational Linguistics Faculty of Mathematics and Physics, Charles University, © 2001 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2001T10

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Hajič, Jan; Hajičová, Eva; Pajas, Petr; Panevová, Jarmila; Sgall, Petr. 2001. Linguistic Data Consortium.
Terms: area_Europe country_CZ country_GB dcmi_Text iso639_ces iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2001T10
Up-to-date as of: Wed Oct 29 7:00:08 EDT 2025

Metadata
Title:		Prague Dependency Treebank 1.0
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Hajič, Jan , et al. Prague Dependency Treebank 1.0 LDC2001T10. Web Download. Philadelphia: Linguistic Data Consortium, 2001
Contributor:		Hajič, Jan
		Hajičová, Eva
		Pajas, Petr
		Panevová, Jarmila
		Sgall, Petr
Date (W3CDTF):		2001
Description:		Introduction The Prague Dependency Treebank Version 1.0: * Morphologically and syntactically annotated Czech data, 1.8MW * Czech-English parallel Corpus, aligned, 0.9MW/1MW * Czech raw texts (newspaper and journals), over 30MW * Czech NLP tools (morphology, tagging) * General annotation tools (tree editors, tree viewer) (abridged version of the part of paper: E. Hajicova. Dependency-Based Underlying-Structure Tagging of a Very Large Czech Corpus) Since a group of Czech linguists (Institute of Formal and Applied Linguistics, Institute of Theoretical and Computational Linguistcs) from Charles University in Prague and Masaryk University in Brno first formulated the Czech National Corpus, it has been quite clear to all of us that for the outcome of our project to have broader relevance and multifaceted usage, we cannot confine ourselves to a mere compilation of a very large corpus of Czech texts. We have been aware that in order to make the corpus really useful for future users -- be they linguists or developers of natural language processing systems of any kind -- we have to design annotation schemes and develop tools that will allow us to add as much linguistic information as possible. Having the advantage of a long and fruitful tradition of theoretical and computational linguistics and inspired by the research resulting in the Penn Treebank, the project group decided to build the Prague Dependency Treebank (PDT). Data The following three points are characteristic for the theory underlying the PDT, fully visible at the highest, tectogrammatical level: (i) Its theoretical background is a dependency-based syntax (handling the sentence structure as concentrated around the verb and its valency, but containing a further dimension, namely coordination). Among the reasons for the choice of a dependency-based syntax, we primarily stress its relative economy and perspicuous, immediate correspondence to the empirical data. (ii) The nodes of the dependency tree (more precisely, of a multidimensional network) are labeled by complex symbols consisting of lexical, morphological and syntactic parts. Thus, the label of every node contains symbols expressing all of the information contained in the grammatical position of this word and is relevant for a semantic (semantico-pragmatic) interpretation. This makes the output representations, or the trees of our treebank, not only useful for practical applications such as parsing, but also for its inclusion into an integrated theoretical description encompassing all layers from the outer (phonetic or graphemic) shape of the sentence to its semantico-pragmatic representation, be it in the form of truth-conditionally based intensional semantics or in that of a framework paying more attention to the embedding of the sentence in context. (iii) The dependency tree is understood as projective. Its relationships to the morphemic representation of the sentence (a string of symbols, the order of which corresponds to the surface word order) are handled by means of specific rules. Prague Dependency Treebank as a project The Prague Dependency Treebank (PDT) is a long-term project with two major phases. In the first phase (1996-2000), the morphological and syntactic analytic layers of annotation have been completed and made together with the preview of tectogrammatical layer annotation available as PDT 1.0. During the second phase (2000 - 2004, Center for Computational Linguistics), the tectogrammatical layer of annotation will proceed and the PDT 2.0 will be available upon completion. The structure of the Prague Dependency Treebank (PDT) corresponds to a three-layer structure annotated corpus of Czech as a representative of inflectionally rich, free word-order languages: * Morphological layer (lowest) - Full morphological annotation * Analytic layer (middle) - Superficial (surface) syntactic annotation using dependency treebank with a level conceptually close to the syntactic annotation used in the Penn Treebank * Tectogrammatical layer (highest) - Level of linguistic meaning Text Sources The electronic text sources have been provided by the Institute of the Czech National Corpus.The text material contains samples from the following sources: * Lidové Noviny (daily newspapers), 1991, 1994, 1995 * Mladá fronta Dnes (daily newspapers), 1992 * Ceskomoravský Profit (business weekly), 1994 * Vesmír (scientific magazine), Academia Publishers, 1992, 1993 There is also a parallel Czech English corpus. Drawn from Readers Digest 1993-1996, it consists of 450 articles, 53,117 parallel sentences, 1,010,346 English tokens and 877,658 Czech tokens Inner format of PDT There are two internal formats employed in PDT: FS and CSTS. The former is an older format, still heavily used by some treebank tools. The latter, more general SGML-based encoding, is meant as the main PDT format (in the future, it will be followed by an XML version, probably already for PDT 2.0). See the description of the FS file format and documentation of the CSTS document type definition (csts.dtd). Prague Dependency Treebank Version 1.0 PDT 0.5 (half through) was released in 1998 and contains 456,705 tokens (words and punctuation) in 26,610 sentences. PDT 1.0 contains about three times more tokens and sentences than PDT 0.5. It is completely manually-annotated on the morphological and analytical levels and includes a preview of tectogrammatically annotated data as well. Future The Prague Dependency Treebank Version 2.0 will add the tectogrammatical layer of annotation to PDT 1.0. It will be available with a reduced amount of data as preliminary Version 1.5 during 2002. The final data volume will be reached at the end of 2004. Support The PDT 1.0 has been supported by the following grants and projects * Grant Agency of the Czech Republic No. 405/96/0198 (Treebank Definition and Procedures Specification) * Grant Agency of the Czech Republic No. 405/96/K214 (Tools and Morphological Layer Annotation) * Ministry of Education of the Czech Republic No. VS96151 (Tools and Structural Annotation on the Analytical Layer) * National Science Foundation No. IIS-9732388 (Version 0.5 Preparation for the Workshop 98) The PDT 2.0 will be supported by the project * Ministry of Education of the Czech Republic No. LN00A063 (Center for Computational Linguistics) Updates There are no updates at this time.
Identifier:		LDC2001T10
		https://catalog.ldc.upenn.edu/LDC2001T10
		ISBN: 1-58563-212-0
		ISLRN: 552-093-753-963-2
		DOI: 10.35111/c3n5-8z64
Language:		Czech
Language:		English
Language (ISO639):		ces
Language (ISO639):		eng
License:		Prague Dependency Treebank 1.0: https://catalog.ldc.upenn.edu/license/prague-dependency-treebank-1.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2001T10
Rights Holder:		Portions © 1993-1996 Readers Digest, © 1991, 1994, 1995 Lidové noviny daily newspapers, © 1992 Mladá fronta Dnes daily newspapers, © 1994 Ceskomoravský Profit business weekly, © 1992-1993 Vesmír scientific magazine, Academia Publishers, © 1996-2001 Institute of Formal and Applied Linguistics and Center for Computational Linguistics Faculty of Mathematics and Physics, Charles University, © 2001 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2001T10
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Hajič, Jan; Hajičová, Eva; Pajas, Petr; Panevová, Jarmila; Sgall, Petr. 2001. Linguistic Data Consortium.
Terms:		area_Europe country_CZ country_GB dcmi_Text iso639_ces iso639_eng olac_primary_text