OLAC Record: Penn Discourse Treebank Version 2.0

OLAC Record
oai:www.ldc.upenn.edu:LDC2008T05

Metadata

Title: Penn Discourse Treebank Version 2.0

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Prasad, Rashmi, et al. Penn Discourse Treebank Version 2.0 LDC2008T05. Web Download. Philadelphia: Linguistic Data Consortium, 2008

Contributor: Prasad, Rashmi

Lee, Alan

Dinesh, Nikhil

Miltsakaki, Eleni

Campion, Geraud

Joshi, Aravind

Webber, Bonnie

Date (W3CDTF): 2008

Date Issued (W3CDTF): 2008-02-18

Description: *Introduction* The Penn Discourse Treebank (PDTB) is an NSF funded project at the University of Pennsylvania. The goal of the project is to annotate the 1 million word Wall Street Journal corpus in Treebank-2 (LDC95T7) with discourse relations holding between the eventualities and propositions mentioned in text, which serve as the arguments to the relation. Discourse relations are assumed to have exactly two arguments. PDTB, version 2.0. is a continuation of PDTB, version 1.0. (made available freely in 2006 but no longer available). Following a lexically grounded approach to annotation, the PDTB annotates relations realized explicitly by Explicit connectives drawn from syntactically well-defined classes, as well as relations between adjacent sentences when no Explicit connective appears to relate the two. Arguments of relations are annotated in each case. For Explicit connectives, arguments are unconstrained in terms of their distance from the connective and can be found anywhere in the text. Between adjacent sentences where no Explicit connective appears, four scenarios hold: (a) the sentences may be related by a discourse relation that has no realization in the second sentence, in which case a connective (called an Implicit connective) is provided to express the inferred relation (b) the sentences may be related by a discourse relation that is realized by some alternative non-connective expression, in which case these alternative lexicalizations are annotated as the carriers of the relation (labelled as AltLex) (c) the sentences may be related not by a discourse relation, but merely by an entity-based coherence relation, in which case the presence of such a relation is labelled (as EntRel) and (d) the sentences may not be related at all, in which case they are labelled as such (NoRel). Note that LDC has also released Penn Discourse Treebank 3.0 (LDC2019T05). In addition to the argument structure of relations, the PDTB provides (a) sense annotations for each discourse relation while also capturing the polysemy of connectives, and (b) attribution annotations of relations and each of their arguments, with each instance of attribution providing the corresponding text span along with four features to capture the semantic contribution of the attribution. Both sense and attribution annotations are provided for Explicit, Implicit, and AltLex relations, but not for EntRel and NoRel. The lexically grounded approach in the PDTB exposes a clearly defined level of discourse structure which will support the extraction of a range of inferences associated with discourse connectives. To date, the PDTB group has carried out various experiments on the corpus, particularly examining the following issues: * alignment between syntax and discourse, particularly with regards to attribution * sense disambiguation of discourse connectives * complexity of dependencies in discourse The annotations in Penn Discourse Treebank Version 2.0 are linked to the Penn Treebank. The PDTB group will continue to explore these issues and to focus on more extended projects such as discourse parsing, automatic summarization, and natural language generation. Further work will also explore foundational issues in discourse. PDTB, version 2.0. annotates 40600 discourse relations, distributed into the following five types: * 18459 Explicit Relations * 16053 Implicit Relations * 624 Alternative Lexicalizations * 5210 Entity Relations * 254 No Relations *Samples* For an example of the data in this corpus, please review the sample below: ________________________________________________________ ____Explicit____ 544..551 4,2 #### Text #### however ############## #### Features #### Wr, Comm, Null, Null however, Comparison.Contrast ____Sup1____ 374..515 23 #### Text #### Its index inched up to 47.6% in October from 46% in September. Any reading below 50% suggests the manufacturing sector is generally declining ############## ____Arg1____ 288..372 1,3,1,1,1,1 #### Text #### that the manufacturing economy contracted in October for the sixth consecutive month ############## #### Features #### Ot, Comm, Null, Null 260..287 1,3,1,01,3,1,1,01,3,1,1,1,0 #### Text #### its latest survey indicated ############## ____Arg2____ 563..624 4,5,1 #### Text #### that orders turned up in October after four months of decline ############## #### Features #### Ot, Comm, Null, Null 519..542553..562 4,04,14,34,44,5,04,6 #### Text #### The purchasing managers also said ############## ________________________________________________________ *Updates* As of December 12, 2012 the developers of the Penn Discourse Treebank Version 2.0 LDC2008T05 have updated this release to add metadata to the Wall Street Journal (WSJ) news stories in the corpus. The goal is to aid understanding PDTB files as texts and to support distinguishing texts from different genres within the WSJ. The metadata includes of the below fields. Consult this metadata documentation for more information. * DD: the date the article appeared in the WSJ * AN: unique identifier for the article * HL: the column name (for regular features such as Whos News, Marketing & Media, Technology), its headline and by-line * SO: the source of the article * IN: manually-assigned codes or keywords for the article * CO: manually-assigned codes for companies or other organizations * DATELINE: normally the location where the article was filed, but sometimes has very unexpected contents * GV: Branch of Government or Government Agency mentioned in the article * SBREAKS: the byte position of section breaks present in the file * ARTICLEBREAK: separates files that contain more than one article This update may be of value to discourse researchers. The meta-data can, for example, enable the texts to be distinguished by genre (news reports, editorials, etc. [Webber, 2009]) or by topic [Petrenz and Webber, 2011]. These can then be used, for example, in text segmentation and text summarization, or in testing hypotheses about domain adaptation [Plank and van Noord, 2011]. The data, on the other hand, can allow researchers to distinguish separate texts within a single file (e.g. the four separate letters to the editor in file wsj_0105) and thereby avoid, for example, attempting to produce one summary for the entire file. As of February, 2017, 2,499 "raw" wsj files were added from Treebank-2 (LDC95T7), to make PDTB more useful. All downloads after these date will contain the complete, updated corpus. *Recognizing Textual Entailment Data* These data have been used to run the textual entailment experiments described in: Sara Tonelli and Elena Cabrio Hunting for Entailing Pairs in the Penn Discourse Treebank, in Proceedings of Coling 2012, Mumbai, India. The files contain Text - Hypothesis pairs in the standard RTE xml format (for more details, see http://www.nist.gov/tac/2011/RTE/), which have been manually annotated as entailing or not entailing. All sentence pairs have been extracted from the Penn Discourse Treebank and are therefore connected by a discourse relation label. For more information, consult the readme. The data are not included in the general release of Penn Discourse Treebank Version 2.0, but are freely available for download.

Extent: Corpus size: 33484 KB

Identifier: LDC2008T05

https://catalog.ldc.upenn.edu/LDC2008T05

ISBN: 1-58563-466-2

ISLRN: 488-589-036-315-2

DOI: 10.35111/nbvh-1n26

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2008T05

Rights Holder: Portions © 1989 Dow Jones & Company, Inc., © 2008, 2012 The Penn Discourse Treebank Group, © 2008, 2012 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2008T05

DateStamp: 2021-10-27

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Prasad, Rashmi; Lee, Alan; Dinesh, Nikhil; Miltsakaki, Eleni; Campion, Geraud; Joshi, Aravind; Webber, Bonnie. 2008. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2008T05
Up-to-date as of: Wed Oct 29 7:01:02 EDT 2025

Metadata
Title:		Penn Discourse Treebank Version 2.0
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Prasad, Rashmi, et al. Penn Discourse Treebank Version 2.0 LDC2008T05. Web Download. Philadelphia: Linguistic Data Consortium, 2008
Contributor:		Prasad, Rashmi
		Lee, Alan
		Dinesh, Nikhil
		Miltsakaki, Eleni
		Campion, Geraud
		Joshi, Aravind
		Webber, Bonnie
Date (W3CDTF):		2008
Date Issued (W3CDTF):		2008-02-18
Description:		Introduction The Penn Discourse Treebank (PDTB) is an NSF funded project at the University of Pennsylvania. The goal of the project is to annotate the 1 million word Wall Street Journal corpus in Treebank-2 (LDC95T7) with discourse relations holding between the eventualities and propositions mentioned in text, which serve as the arguments to the relation. Discourse relations are assumed to have exactly two arguments. PDTB, version 2.0. is a continuation of PDTB, version 1.0. (made available freely in 2006 but no longer available). Following a lexically grounded approach to annotation, the PDTB annotates relations realized explicitly by Explicit connectives drawn from syntactically well-defined classes, as well as relations between adjacent sentences when no Explicit connective appears to relate the two. Arguments of relations are annotated in each case. For Explicit connectives, arguments are unconstrained in terms of their distance from the connective and can be found anywhere in the text. Between adjacent sentences where no Explicit connective appears, four scenarios hold: (a) the sentences may be related by a discourse relation that has no realization in the second sentence, in which case a connective (called an Implicit connective) is provided to express the inferred relation (b) the sentences may be related by a discourse relation that is realized by some alternative non-connective expression, in which case these alternative lexicalizations are annotated as the carriers of the relation (labelled as AltLex) (c) the sentences may be related not by a discourse relation, but merely by an entity-based coherence relation, in which case the presence of such a relation is labelled (as EntRel) and (d) the sentences may not be related at all, in which case they are labelled as such (NoRel). Note that LDC has also released Penn Discourse Treebank 3.0 (LDC2019T05). In addition to the argument structure of relations, the PDTB provides (a) sense annotations for each discourse relation while also capturing the polysemy of connectives, and (b) attribution annotations of relations and each of their arguments, with each instance of attribution providing the corresponding text span along with four features to capture the semantic contribution of the attribution. Both sense and attribution annotations are provided for Explicit, Implicit, and AltLex relations, but not for EntRel and NoRel. The lexically grounded approach in the PDTB exposes a clearly defined level of discourse structure which will support the extraction of a range of inferences associated with discourse connectives. To date, the PDTB group has carried out various experiments on the corpus, particularly examining the following issues: * alignment between syntax and discourse, particularly with regards to attribution * sense disambiguation of discourse connectives * complexity of dependencies in discourse The annotations in Penn Discourse Treebank Version 2.0 are linked to the Penn Treebank. The PDTB group will continue to explore these issues and to focus on more extended projects such as discourse parsing, automatic summarization, and natural language generation. Further work will also explore foundational issues in discourse. PDTB, version 2.0. annotates 40600 discourse relations, distributed into the following five types: * 18459 Explicit Relations * 16053 Implicit Relations * 624 Alternative Lexicalizations * 5210 Entity Relations * 254 No Relations Samples For an example of the data in this corpus, please review the sample below: ________________________________________________________ ____Explicit____ 544..551 4,2 #### Text #### however ############## #### Features #### Wr, Comm, Null, Null however, Comparison.Contrast ____Sup1____ 374..515 23 #### Text #### Its index inched up to 47.6% in October from 46% in September. Any reading below 50% suggests the manufacturing sector is generally declining ############## ____Arg1____ 288..372 1,3,1,1,1,1 #### Text #### that the manufacturing economy contracted in October for the sixth consecutive month ############## #### Features #### Ot, Comm, Null, Null 260..287 1,3,1,01,3,1,1,01,3,1,1,1,0 #### Text #### its latest survey indicated ############## ____Arg2____ 563..624 4,5,1 #### Text #### that orders turned up in October after four months of decline ############## #### Features #### Ot, Comm, Null, Null 519..542553..562 4,04,14,34,44,5,04,6 #### Text #### The purchasing managers also said ############## ________________________________________________________ Updates As of December 12, 2012 the developers of the Penn Discourse Treebank Version 2.0 LDC2008T05 have updated this release to add metadata to the Wall Street Journal (WSJ) news stories in the corpus. The goal is to aid understanding PDTB files as texts and to support distinguishing texts from different genres within the WSJ. The metadata includes of the below fields. Consult this metadata documentation for more information. * DD: the date the article appeared in the WSJ * AN: unique identifier for the article * HL: the column name (for regular features such as Whos News, Marketing & Media, Technology), its headline and by-line * SO: the source of the article * IN: manually-assigned codes or keywords for the article * CO: manually-assigned codes for companies or other organizations * DATELINE: normally the location where the article was filed, but sometimes has very unexpected contents * GV: Branch of Government or Government Agency mentioned in the article * SBREAKS: the byte position of section breaks present in the file * ARTICLEBREAK: separates files that contain more than one article This update may be of value to discourse researchers. The meta-data can, for example, enable the texts to be distinguished by genre (news reports, editorials, etc. [Webber, 2009]) or by topic [Petrenz and Webber, 2011]. These can then be used, for example, in text segmentation and text summarization, or in testing hypotheses about domain adaptation [Plank and van Noord, 2011]. The data, on the other hand, can allow researchers to distinguish separate texts within a single file (e.g. the four separate letters to the editor in file wsj_0105) and thereby avoid, for example, attempting to produce one summary for the entire file. As of February, 2017, 2,499 "raw" wsj files were added from Treebank-2 (LDC95T7), to make PDTB more useful. All downloads after these date will contain the complete, updated corpus. Recognizing Textual Entailment Data These data have been used to run the textual entailment experiments described in: Sara Tonelli and Elena Cabrio Hunting for Entailing Pairs in the Penn Discourse Treebank, in Proceedings of Coling 2012, Mumbai, India. The files contain Text - Hypothesis pairs in the standard RTE xml format (for more details, see http://www.nist.gov/tac/2011/RTE/), which have been manually annotated as entailing or not entailing. All sentence pairs have been extracted from the Penn Discourse Treebank and are therefore connected by a discourse relation label. For more information, consult the readme. The data are not included in the general release of Penn Discourse Treebank Version 2.0, but are freely available for download.
Extent:		Corpus size: 33484 KB
Identifier:		LDC2008T05
		https://catalog.ldc.upenn.edu/LDC2008T05
		ISBN: 1-58563-466-2
		ISLRN: 488-589-036-315-2
		DOI: 10.35111/nbvh-1n26
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2008T05
Rights Holder:		Portions © 1989 Dow Jones & Company, Inc., © 2008, 2012 The Penn Discourse Treebank Group, © 2008, 2012 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2008T05
DateStamp:		2021-10-27
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Prasad, Rashmi; Lee, Alan; Dinesh, Nikhil; Miltsakaki, Eleni; Campion, Geraud; Joshi, Aravind; Webber, Bonnie. 2008. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text