OLAC Record: Treebank-2

OLAC Record
oai:www.ldc.upenn.edu:LDC95T7

Metadata

Title: Treebank-2

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Marcus, Mitchell P., Beatrice Santorini, and Mary Ann Marcinkiewicz. Treebank-2 LDC95T7. Web Download. Philadelphia: Linguistic Data Consortium, 1995

Contributor: Marcus, Mitchell P.

Santorini, Beatrice

Marcinkiewicz, Mary Ann

Date (W3CDTF): 1995

Description: Original release was included in LDC Catalog No. LDC93T1 Original Treebank Release This release contains over 1.6 million words of hand-parsed material from the Dow Jones News Service, plus an additional one million words tagged for part-of-speech. This material is a subset of the language model corpus for the DARPA CSR large-vocabulary speech recognition project. It also contains the first fully parsed version of the Brown Corpus, which has also been completely retagged using the Penn Treebank (PTB) tag set. Also included are tagged and parsed data from Department of Energy abstracts, IBM computer manuals, MUC-3 and ATIS. In addition, the release includes source code for programs that were used by the PTB project in creating portions of the data. Source code is also included for "tgrep," a program that permits the user to search for specific constituents in tree structures. All software is provided "as is." (We have learned since publication that the tgrep source code provided on the cd-rom is not readily portable, and compiling tgrep requires modification of the source files. Also included is a pre-compiled program file for tgrep, built for use on Sun sparc systems.) Release - 2 The PTB Project Release 2 features the new PTB-2 bracketing style, which is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied, along with a complete style manual explaining the bracketing and new versions of tools for searching and treating bracketed data. This release also contains all the annotated text material from the earlier Treebank Preliminary Release, including the Brown Corpus. While these materials have not all been converted to the newer bracketing style, they have been cleaned up to remove problems that had appeared in the earlier release. The contents of Treebank Release 2 are as follows: * One million words of 1989 Wall Street Journal material annotated in Treebank-2 style. * A small sample of ATIS-3 material annotated in Treebank-2 style. * 300-page style manual for Treebank-2 bracketing, as well as the part-of-speech tagging guidelines. * The contents of the previous Treebank release (Version 0.5), with cleaner versions of the WSJ, Brown Corpus, and ATIS material (annotated in Treebank-1 style). * Tools for processing Treebank data, including "tgrep," a tree-searching and manipulation package (note that usability of this release of tgrep is limited: users of Sun sparc systems should have no problem, but others may find the software to be difficult or impossible to port). In addition, the PTB Project has provided some updates, announcements and a discussion forum for users. A file of updates and further information is available via anonymous FTP from ftp.cis.upenn.edu, in pub/treebank/doc/update.cd2. The PTB project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42) releases of PTB. Treebank-2 & Treebank-3 both include the raw text for each story. Three "map" files are available in a compressed file (pennTB_tipster_wsj_map.tar.gz) as an additional download for users who have licensed Treebank-2 and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.

Extent: Corpus size: 198 KB

Identifier: LDC95T7

https://catalog.ldc.upenn.edu/LDC95T7

ISBN: 1-58563-054-3

ISLRN: 650-146-755-602-3

DOI: 10.35111/wf9p-g717

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC95T7

Rights Holder: Portions © 1987-1989 Dow Jones & Company, Inc., © 1993-1995 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC95T7

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Marcus, Mitchell P.; Santorini, Beatrice; Marcinkiewicz, Mary Ann. 1995. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC95T7
Up-to-date as of: Wed Oct 29 7:00:35 EDT 2025

Metadata
Title:		Treebank-2
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Marcus, Mitchell P., Beatrice Santorini, and Mary Ann Marcinkiewicz. Treebank-2 LDC95T7. Web Download. Philadelphia: Linguistic Data Consortium, 1995
Contributor:		Marcus, Mitchell P.
		Santorini, Beatrice
		Marcinkiewicz, Mary Ann
Date (W3CDTF):		1995
Description:		Original release was included in LDC Catalog No. LDC93T1 Original Treebank Release This release contains over 1.6 million words of hand-parsed material from the Dow Jones News Service, plus an additional one million words tagged for part-of-speech. This material is a subset of the language model corpus for the DARPA CSR large-vocabulary speech recognition project. It also contains the first fully parsed version of the Brown Corpus, which has also been completely retagged using the Penn Treebank (PTB) tag set. Also included are tagged and parsed data from Department of Energy abstracts, IBM computer manuals, MUC-3 and ATIS. In addition, the release includes source code for programs that were used by the PTB project in creating portions of the data. Source code is also included for "tgrep," a program that permits the user to search for specific constituents in tree structures. All software is provided "as is." (We have learned since publication that the tgrep source code provided on the cd-rom is not readily portable, and compiling tgrep requires modification of the source files. Also included is a pre-compiled program file for tgrep, built for use on Sun sparc systems.) Release - 2 The PTB Project Release 2 features the new PTB-2 bracketing style, which is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied, along with a complete style manual explaining the bracketing and new versions of tools for searching and treating bracketed data. This release also contains all the annotated text material from the earlier Treebank Preliminary Release, including the Brown Corpus. While these materials have not all been converted to the newer bracketing style, they have been cleaned up to remove problems that had appeared in the earlier release. The contents of Treebank Release 2 are as follows: * One million words of 1989 Wall Street Journal material annotated in Treebank-2 style. * A small sample of ATIS-3 material annotated in Treebank-2 style. * 300-page style manual for Treebank-2 bracketing, as well as the part-of-speech tagging guidelines. * The contents of the previous Treebank release (Version 0.5), with cleaner versions of the WSJ, Brown Corpus, and ATIS material (annotated in Treebank-1 style). * Tools for processing Treebank data, including "tgrep," a tree-searching and manipulation package (note that usability of this release of tgrep is limited: users of Sun sparc systems should have no problem, but others may find the software to be difficult or impossible to port). In addition, the PTB Project has provided some updates, announcements and a discussion forum for users. A file of updates and further information is available via anonymous FTP from ftp.cis.upenn.edu, in pub/treebank/doc/update.cd2. The PTB project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42) releases of PTB. Treebank-2 & Treebank-3 both include the raw text for each story. Three "map" files are available in a compressed file (pennTB_tipster_wsj_map.tar.gz) as an additional download for users who have licensed Treebank-2 and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.
Extent:		Corpus size: 198 KB
Identifier:		LDC95T7
		https://catalog.ldc.upenn.edu/LDC95T7
		ISBN: 1-58563-054-3
		ISLRN: 650-146-755-602-3
		DOI: 10.35111/wf9p-g717
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC95T7
Rights Holder:		Portions © 1987-1989 Dow Jones & Company, Inc., © 1993-1995 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC95T7
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Marcus, Mitchell P.; Santorini, Beatrice; Marcinkiewicz, Mary Ann. 1995. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text