OLAC Record oai:www.ldc.upenn.edu:LDC99T42 |
Metadata | ||
Title: | Treebank-3 | |
Access Rights: | Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining | |
Bibliographic Citation: | Marcus, Mitchell P., et al. Treebank-3 LDC99T42. Web Download. Philadelphia: Linguistic Data Consortium, 1999 | |
Contributor: | Marcus, Mitchell P. | |
Santorini, Beatrice | ||
Marcinkiewicz, Mary Ann | ||
Taylor, Ann | ||
Date (W3CDTF): | 1999 | |
Description: | *Introduction* This release contains the following Treebank-2 Material: * One million words of 1989 Wall Street Journal material annotated in Treebank II style. * A small sample of ATIS-3 material annotated in Treebank II style. * A fully tagged version of the Brown Corpus. and the following new material: * Switchboard tagged, dysfluency-annotated, and parsed text * Brown parsed text The Treebank bracketing style is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied. *Data* The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file (pennTB_tipster_wsj_map.tar.gz) as an additional download for users who have licensed Treebank-2 and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER. *Samples* Please view the following samples: * Part-of-Speech Tags * Dysfluency Annotation * Dysfluency Annotation & Part-of-Speech Tags * Dysfluency Annotation, Part-of-Speech Tags & Turns Joined * Syntactic Annotation * Syntactic Annotation & Part-of-Speech Tags *Updates* After publication, it was discovered that not all of the postscript (*.ps) files had been converted to pdfs and that some of the converted pdfs contained errors. For pdf copies of the documentation files, please go to addenda for a list of the files available. As of October 5, 2016 252 wsj files from Treebank-2 were added that were previously missing. As of February, 2017, 2,499 "raw" wsj files were added from Treebank-2 (LDC95T7). Corpus downoads after these dates will include these missing files. | |
Extent: | Corpus size: 264192 KB | |
Identifier: | LDC99T42 | |
https://catalog.ldc.upenn.edu/LDC99T42 | ||
ISBN: 1-58563-163-9 | ||
ISLRN: 141-282-691-413-2 | ||
DOI: 10.35111/gq1x-j780 | ||
Language: | English | |
Language (ISO639): | eng | |
License: | LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf | |
Medium: | Distribution: Web Download | |
Publisher: | Linguistic Data Consortium | |
Publisher (URI): | https://www.ldc.upenn.edu | |
Relation (URI): | https://catalog.ldc.upenn.edu/docs/LDC99T42 | |
Rights Holder: | Portions © 1987-1989 Dow Jones & Company, Inc., © 1993-1995, 1999 Trustees of the University of Pennsylvania | |
Type (DCMI): | Text | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | The LDC Corpus Catalog | |
Description: | http://www.language-archives.org/archive/www.ldc.upenn.edu | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:www.ldc.upenn.edu:LDC99T42 | |
DateStamp: | 2020-11-30 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Marcus, Mitchell P.; Santorini, Beatrice; Marcinkiewicz, Mary Ann; Taylor, Ann. 1999. Linguistic Data Consortium. | |
Terms: | area_Europe country_GB dcmi_Text iso639_eng olac_primary_text |