OLAC Record: Penn Discourse Treebank Version 3.0

OLAC Record
oai:www.ldc.upenn.edu:LDC2019T05

Metadata

Title: Penn Discourse Treebank Version 3.0

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Prasad, Rashmi, et al. Penn Discourse Treebank Version 3.0 LDC2019T05. Web Download. Philadelphia: Linguistic Data Consortium, 2019

Contributor: Prasad, Rashmi

Webber, Bonnie

Lee, Alan

Joshi, Aravind

Date (W3CDTF): 2019

Date Issued (W3CDTF): 2019-03-15

Description: *Introduction* Penn Discourse Treebank (PDTB) Version 3.0 is the third release in the Penn Discourse Treebank project, the goal of which is to annotate the Wall Street Journal (WSJ) section of Treebank-2 (LDC95T7) with discourse relations. Penn Discourse Treebank Version 2 (LDC2008T05) contains over 40,600 tokens of annotated relations. In Version 3, an additional 13,000 tokens were annotated, certain pairwise annotations were standardized, new senses were included and the corpus was subject to a series of consistency checks. Details concerning the development of PDTB Version 3.0 can be found in the documentation accompanying this release. Largely because the PDTB project was based on the idea that discourse relations are grounded in an identifiable set of explicit words or phrases (discourse connectives) or simply in the adjacency of two sentences, the PTDB has been used by many researchers in the natural language processing community and more recently, by researchers in psycholinguistics. It has also stimulated the development of similar resources in other languages and domains. *Data* Annotations are provided in the form of separate text files (standoff annotation) that are byte-indexed into the raw WSJ text files in Treebank-2. The raw WSJ files are also included in this release. All text files are plain text, encoded in UTF-8. This corpus contains two tools: (1) The Annotator, used for annotation and adjudication, and which can also be used for viewing the corpus; and (2) The Conversion Tool for converting Version 2 annotation files into the Version 3 format. The documentation directory contains a manual describing what is new in Version 3 and how Version 3 differs from Version 2; the methods and guidelines used in annotating PDTB Version 3; and a range of statistics on the tokens, including the frequency of each connective, its sense labels and its modifiers. *Samples* One can see samples of the annotation of different types of discourse relations, along with their visualization in the Annotator tool at: * Explicit relations * Implicit relations * Altlex and AltLexC relations * Entity relations * Hypophora relations * NoRel (annotated only between adjacent sentences within a paragraph that are not linked to each other by a discourse relation) *Updates* Experiments carried out in Fall 2019 on the intra-sentential discourse relations in the PDTB-3 revealed two problems with the corpus: (1) the final versions of two gold files of "to clause" annotation had not been loaded, and (2) several tokens were inadvertently omitted on the assumption that they were duplicates, when they were not. Repairing these errors, and correcting a mis-labelled token in file wsj_1026, has added another 45 implicit intra-sentential relations to the corpus. Counts in the Annotation Manual have been adjusted to take these additional tokens into account. Specific changes/additions are recorded in the file "pdtb3-revision-jan-2020.txt". Downloads after February 3, 2020 contain the updated corpus. *Acknowledgment* This work has been funded by the National Science Foundation, under grant NSF IIS 1422186 to the University of Pennsylvania and grant NSF IIS 1421067 to the University of Wisconsin, Milwaukee. The content of this publication does not necessarily reflect the position or policy of the Government, and no official endorsement should be inferred.

Extent: Corpus size: 43056 KB

Identifier: LDC2019T05

https://catalog.ldc.upenn.edu/LDC2019T05

ISBN: 1-58563-877-3

ISLRN: 977-491-842-427-0

DOI: 10.35111/qebf-gk47

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2019T05

Rights Holder: Portions © 1987-1989 Dow Jones & Company, Inc., © 2008, 2012, 2019 The Penn Discourse Treebank Group, © 2008, 2012, 2019 Trustees of the University of Pennsylvania

Type (DCMI): Software

Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2019T05

DateStamp: 2025-01-31

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Prasad, Rashmi; Webber, Bonnie; Lee, Alan; Joshi, Aravind. 2019. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Software dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2019T05
Up-to-date as of: Wed Oct 29 7:01:51 EDT 2025

Metadata
Title:		Penn Discourse Treebank Version 3.0
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Prasad, Rashmi, et al. Penn Discourse Treebank Version 3.0 LDC2019T05. Web Download. Philadelphia: Linguistic Data Consortium, 2019
Contributor:		Prasad, Rashmi
		Webber, Bonnie
		Lee, Alan
		Joshi, Aravind
Date (W3CDTF):		2019
Date Issued (W3CDTF):		2019-03-15
Description:		Introduction Penn Discourse Treebank (PDTB) Version 3.0 is the third release in the Penn Discourse Treebank project, the goal of which is to annotate the Wall Street Journal (WSJ) section of Treebank-2 (LDC95T7) with discourse relations. Penn Discourse Treebank Version 2 (LDC2008T05) contains over 40,600 tokens of annotated relations. In Version 3, an additional 13,000 tokens were annotated, certain pairwise annotations were standardized, new senses were included and the corpus was subject to a series of consistency checks. Details concerning the development of PDTB Version 3.0 can be found in the documentation accompanying this release. Largely because the PDTB project was based on the idea that discourse relations are grounded in an identifiable set of explicit words or phrases (discourse connectives) or simply in the adjacency of two sentences, the PTDB has been used by many researchers in the natural language processing community and more recently, by researchers in psycholinguistics. It has also stimulated the development of similar resources in other languages and domains. Data Annotations are provided in the form of separate text files (standoff annotation) that are byte-indexed into the raw WSJ text files in Treebank-2. The raw WSJ files are also included in this release. All text files are plain text, encoded in UTF-8. This corpus contains two tools: (1) The Annotator, used for annotation and adjudication, and which can also be used for viewing the corpus; and (2) The Conversion Tool for converting Version 2 annotation files into the Version 3 format. The documentation directory contains a manual describing what is new in Version 3 and how Version 3 differs from Version 2; the methods and guidelines used in annotating PDTB Version 3; and a range of statistics on the tokens, including the frequency of each connective, its sense labels and its modifiers. Samples One can see samples of the annotation of different types of discourse relations, along with their visualization in the Annotator tool at: * Explicit relations * Implicit relations * Altlex and AltLexC relations * Entity relations * Hypophora relations * NoRel (annotated only between adjacent sentences within a paragraph that are not linked to each other by a discourse relation) Updates Experiments carried out in Fall 2019 on the intra-sentential discourse relations in the PDTB-3 revealed two problems with the corpus: (1) the final versions of two gold files of "to clause" annotation had not been loaded, and (2) several tokens were inadvertently omitted on the assumption that they were duplicates, when they were not. Repairing these errors, and correcting a mis-labelled token in file wsj_1026, has added another 45 implicit intra-sentential relations to the corpus. Counts in the Annotation Manual have been adjusted to take these additional tokens into account. Specific changes/additions are recorded in the file "pdtb3-revision-jan-2020.txt". Downloads after February 3, 2020 contain the updated corpus. Acknowledgment This work has been funded by the National Science Foundation, under grant NSF IIS 1422186 to the University of Pennsylvania and grant NSF IIS 1421067 to the University of Wisconsin, Milwaukee. The content of this publication does not necessarily reflect the position or policy of the Government, and no official endorsement should be inferred.
Extent:		Corpus size: 43056 KB
Identifier:		LDC2019T05
		https://catalog.ldc.upenn.edu/LDC2019T05
		ISBN: 1-58563-877-3
		ISLRN: 977-491-842-427-0
		DOI: 10.35111/qebf-gk47
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2019T05
Rights Holder:		Portions © 1987-1989 Dow Jones & Company, Inc., © 2008, 2012, 2019 The Penn Discourse Treebank Group, © 2008, 2012, 2019 Trustees of the University of Pennsylvania
Type (DCMI):		Software
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2019T05
DateStamp:		2025-01-31
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Prasad, Rashmi; Webber, Bonnie; Lee, Alan; Joshi, Aravind. 2019. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Software dcmi_Text iso639_eng olac_primary_text