OLAC Record: BLLIP North American News Text, Complete

OLAC Record
oai:www.ldc.upenn.edu:LDC2008T13

Metadata

Title: BLLIP North American News Text, Complete

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: McClosky, David, Eugene Charniak, and Mark Johnson. BLLIP North American News Text, Complete LDC2008T13. Web Download. Philadelphia: Linguistic Data Consortium, 2008

Contributor: McClosky, David

Charniak, Eugene

Johnson, Mark

Date (W3CDTF): 2008

Date Issued (W3CDTF): 2008-08-19

Description: *Introduction* Brown Laboratory for Linguistic Information Processing (BLLIP) North American News Text, Complete contains a Penn Treebank-style parsing of approximately 24 million sentences from the North American News Text Corpus (LDC95T21). The North American News Text Corpus consists of English news text from the Los Angeles Times-Washington Post (1994-1997), the New York Times (1994-1996), Reuters News Service (1994-1996) and the Wall Street Journal (1994-1996). BLLIP North American News Text is released in two versions: BLLIP North American News Text, Complete (LDC2008T13), a members-only corpus that contains sentences from all sources in The North American News Text Corpus; and BLLIP North American News Text, General Release (LDC2008T14), a corpus available to nonmembers that does not include the Wall Street Journal data from The North American News Text Corpus. To complement the Complete and General Release versions of BLLIP North American News Text, LDC is re-releasing The North American News Text Corpus in two versions. North American News Text, Complete LDC2008T15, the members-only original version, is now available as a 2008 Membership Year corpus. North American News Text, General Release (LDC2008T16) (which does not include news text from the Wall Street Journal), is available to nonmembers for the first time. The directory structures of each of these publications has been restructured to be identical to the directory structure of the BLLIP releases. *Methodology* A key problem in natural language processing is syntactic ambiguity resulting from uncertain relationships between words and their connections to sentence clauses. Sentences that can be constructed with correct syntax in more than one way are ambiguous, and such sentences generate multiple parse trees when they are separated into clauses by parts of speech. Traditional parsing techniques, such as part-of-speech (POS) tagging, typically achieve a 90% accuracy rate because most sentences are not ambiguous. Resolving ambiguous sentences requires a probabilistic approach. Using the relative frequencies of grammar rules, statistical processing techniques assign probabilities for each clause. These probabilities are then summed up over each complete sentence parse and a probability is assigned for that sentence parse. In that way, the most likely parse can be determined. The data in this release was parsed into Penn Treebank-style parse trees using a re-ranking parser developed by Eugene Charniak and Mark Johnson. The Charniak and Johnson parser is statistically-based and uses a generative first stage followed by a discriminative second stage. Both stages were trained on the Wall Street Journal data in Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42). BLLIP 1987-1989 WSJ Corpus Release 1 (LDC2000T43) contains a complete Treebank-style parsing of that Wall Street Journal material. In order to produce BLLIP North American News Text, the Charniak-Johnson parser used a simplified context free grammar in the first stage to generate a set of n best parses. Those parses were then pruned by eliminating the parses at the edges of the distribution. In the second stage, a maximum entropy-based parser using a complete grammar was applied. The output trees are ranked in order of probability. *Data* The parses in BLLIP North American News Text include constituency and POS tagging information for each of the 50-best parses of each sentence. Each file contains a sequence of n-best lists. An n-best list is a list of the top n parses of each sentence with the corresponding parser probability and re-ranker score. Following is an example of a simple n-best list: 50 reute9406_007.0356_13 4.9244 -147.337 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency) (, ,) (NN government) (CC and) (NN parliament))))))))))) (. .))) 3.56482 -151.575 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency)))) (, ,) (NP (NN government) (CC and) (NN parliament)))))))))) (. .))) 3.35952 -151.173 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (NP (DT the) (NN presidency)) (, ,) (NP (NN government) (CC and) (NN parliament)))))))))))) (. .))) 2.67662 -148.374 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (VP (ADVP (RB first)) (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency) (, ,) (NN government) (CC and) (NN parliament))))))))))) (. .))) In the above example, the first number ("50") indicates the number of parses. The next token is the article id from the North American News Text Corpus ("reute9406_007.0356"), followed by an underscore, followed by the number of the sentence in the article ("13"). The parses follow; for brevity, only three parses out of the fifty are presented here. Each parse consists of a reranker score (4.9244 for the first parse) and parser log probability (-147.337 for the first parse), a new line, and then the parse tree itself. Parse trees are given in Penn Treebank format. Note that the n-best list is sorted by decreasing reranker scores. Source material is as follows: Source Dates Approx. # Words (millions) Los Angeles Times & Washington Post 1994-1997 52 New York Times 1994-1996 173 Reuters (General and Financial) 1994-1996 85 Wall Street Journal (Not included in General Release) 1994-1996 40 *Additional Licensing Instructions* This 'members-only' corpus is available to current LDC members who can request the data at the listed reduced-license fee.

Extent: Corpus size: 16777216 KB

Identifier: LDC2008T13

https://catalog.ldc.upenn.edu/LDC2008T13

ISBN: 1-58563-481-6

ISLRN: 377-937-743-934-7

DOI: 10.35111/0nk7-yp61

Language: English

Language (ISO639): eng

License: BLLIP North American News Text, Complete (LDC2008T13): https://catalog.ldc.upenn.edu/license/bllip-north-american-news-text-complete.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2008T13

Rights Holder: Portions © 1994-1996 Dow Jones & Company, Inc., © 1994-1997 Los Angeles Times-Washington Post News Service, Inc., © 1994-1996 New York Times, © 1994-1996 Reuters America, Inc., © 1995-1997, 2008 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2008T13

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: McClosky, David; Charniak, Eugene; Johnson, Mark. 2008. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2008T13
Up-to-date as of: Thu Sep 18 0:59:24 EDT 2025

Metadata
Title:		BLLIP North American News Text, Complete
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		McClosky, David, Eugene Charniak, and Mark Johnson. BLLIP North American News Text, Complete LDC2008T13. Web Download. Philadelphia: Linguistic Data Consortium, 2008
Contributor:		McClosky, David
		Charniak, Eugene
		Johnson, Mark
Date (W3CDTF):		2008
Date Issued (W3CDTF):		2008-08-19
Description:		Introduction Brown Laboratory for Linguistic Information Processing (BLLIP) North American News Text, Complete contains a Penn Treebank-style parsing of approximately 24 million sentences from the North American News Text Corpus (LDC95T21). The North American News Text Corpus consists of English news text from the Los Angeles Times-Washington Post (1994-1997), the New York Times (1994-1996), Reuters News Service (1994-1996) and the Wall Street Journal (1994-1996). BLLIP North American News Text is released in two versions: BLLIP North American News Text, Complete (LDC2008T13), a members-only corpus that contains sentences from all sources in The North American News Text Corpus; and BLLIP North American News Text, General Release (LDC2008T14), a corpus available to nonmembers that does not include the Wall Street Journal data from The North American News Text Corpus. To complement the Complete and General Release versions of BLLIP North American News Text, LDC is re-releasing The North American News Text Corpus in two versions. North American News Text, Complete LDC2008T15, the members-only original version, is now available as a 2008 Membership Year corpus. North American News Text, General Release (LDC2008T16) (which does not include news text from the Wall Street Journal), is available to nonmembers for the first time. The directory structures of each of these publications has been restructured to be identical to the directory structure of the BLLIP releases. Methodology A key problem in natural language processing is syntactic ambiguity resulting from uncertain relationships between words and their connections to sentence clauses. Sentences that can be constructed with correct syntax in more than one way are ambiguous, and such sentences generate multiple parse trees when they are separated into clauses by parts of speech. Traditional parsing techniques, such as part-of-speech (POS) tagging, typically achieve a 90% accuracy rate because most sentences are not ambiguous. Resolving ambiguous sentences requires a probabilistic approach. Using the relative frequencies of grammar rules, statistical processing techniques assign probabilities for each clause. These probabilities are then summed up over each complete sentence parse and a probability is assigned for that sentence parse. In that way, the most likely parse can be determined. The data in this release was parsed into Penn Treebank-style parse trees using a re-ranking parser developed by Eugene Charniak and Mark Johnson. The Charniak and Johnson parser is statistically-based and uses a generative first stage followed by a discriminative second stage. Both stages were trained on the Wall Street Journal data in Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42). BLLIP 1987-1989 WSJ Corpus Release 1 (LDC2000T43) contains a complete Treebank-style parsing of that Wall Street Journal material. In order to produce BLLIP North American News Text, the Charniak-Johnson parser used a simplified context free grammar in the first stage to generate a set of n best parses. Those parses were then pruned by eliminating the parses at the edges of the distribution. In the second stage, a maximum entropy-based parser using a complete grammar was applied. The output trees are ranked in order of probability. Data The parses in BLLIP North American News Text include constituency and POS tagging information for each of the 50-best parses of each sentence. Each file contains a sequence of n-best lists. An n-best list is a list of the top n parses of each sentence with the corresponding parser probability and re-ranker score. Following is an example of a simple n-best list: 50 reute9406_007.0356_13 4.9244 -147.337 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency) (, ,) (NN government) (CC and) (NN parliament))))))))))) (. .))) 3.56482 -151.575 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency)))) (, ,) (NP (NN government) (CC and) (NN parliament)))))))))) (. .))) 3.35952 -151.173 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (NP (DT the) (NN presidency)) (, ,) (NP (NN government) (CC and) (NN parliament)))))))))))) (. .))) 2.67662 -148.374 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (VP (ADVP (RB first)) (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency) (, ,) (NN government) (CC and) (NN parliament))))))))))) (. .))) In the above example, the first number ("50") indicates the number of parses. The next token is the article id from the North American News Text Corpus ("reute9406_007.0356"), followed by an underscore, followed by the number of the sentence in the article ("13"). The parses follow; for brevity, only three parses out of the fifty are presented here. Each parse consists of a reranker score (4.9244 for the first parse) and parser log probability (-147.337 for the first parse), a new line, and then the parse tree itself. Parse trees are given in Penn Treebank format. Note that the n-best list is sorted by decreasing reranker scores. Source material is as follows: Source Dates Approx. # Words (millions) Los Angeles Times & Washington Post 1994-1997 52 New York Times 1994-1996 173 Reuters (General and Financial) 1994-1996 85 Wall Street Journal (Not included in General Release) 1994-1996 40 Additional Licensing Instructions This 'members-only' corpus is available to current LDC members who can request the data at the listed reduced-license fee.
Extent:		Corpus size: 16777216 KB
Identifier:		LDC2008T13
		https://catalog.ldc.upenn.edu/LDC2008T13
		ISBN: 1-58563-481-6
		ISLRN: 377-937-743-934-7
		DOI: 10.35111/0nk7-yp61
Language:		English
Language (ISO639):		eng
License:		BLLIP North American News Text, Complete (LDC2008T13): https://catalog.ldc.upenn.edu/license/bllip-north-american-news-text-complete.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2008T13
Rights Holder:		Portions © 1994-1996 Dow Jones & Company, Inc., © 1994-1997 Los Angeles Times-Washington Post News Service, Inc., © 1994-1996 New York Times, © 1994-1996 Reuters America, Inc., © 1995-1997, 2008 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2008T13
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		McClosky, David; Charniak, Eugene; Johnson, Mark. 2008. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text