OLAC Record: Annotated English Gigaword

OLAC Record
oai:www.ldc.upenn.edu:LDC2012T21

Metadata

Title: Annotated English Gigaword

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Napoles, Courtney, Matthew Gormley, and Benjamin Van Durme. Annotated English Gigaword LDC2012T21. Web Download. Philadelphia: Linguistic Data Consortium, 2012

Contributor: Napoles, Courtney

Gormley, Matthew R.

Van Durme, Benjamin

Date (W3CDTF): 2012

Date Issued (W3CDTF): 2012-11-15

Description: *Introduction* Annotated English Gigaword was developed by Johns Hopkins University's Human Language Technology Center of Excellence. It adds automatically-generated syntactic and discourse structure annotation to English Gigaword Fifth Edition (LDC2011T07) and also contains an API and tools for reading the dataset's XML files. The goal of the annotation is to provide a standardized corpus for knowledge extraction and distributional semantics which enables broader involvement in large-scale knowledge-acquisition efforts by researchers. *Data* Annotated English Gigaword contains the nearly ten million documents (over four billion words) of the original English Gigaword Fifth Edition from seven news sources: * Agence France-Presse, English Service (afp_eng) * Associated Press Worldstream, English Service (apw_eng) * Central News Agency of Taiwan, English Service (cna_eng) * Los Angeles Times/Washington Post Newswire Service (ltw_eng) * Washington Post/Bloomberg Newswire Service (wpb_eng) * New York Times Newswire Service (nyt_eng) * Xinhua News Agency, English Service (xin_eng) The following layers of annotation were added: * Tokenized and segmented sentences * Treebank-style constituent parse trees * Syntactic dependency trees * Named entities * In-document coreference chains The annotation was performed in a three-step process: (1) the data was preprocessed and sentences selected for annotation (sentences with more than 100 tokens were excluded) (2) syntactic parses were derived and (3) the parsed output was post-processed to derive syntactic dependencies, named entities and coreference chains. Over 183 million sentences were parsed. The data is stored in a form similar to the gigaword SGML format with XML annotations containing the additional markup. The included API provides object representations for the contents of the XML files. *Samples* Please the link for a sample. *Additional Licensing Information* Any 2011 member organization that licensed English Gigaword Fifth Edition (LDC2011T07) may request a no-cost copy of Annotated English Gigaword. Any non-member organization that licensed English Gigaword Fifth Edition may request a copy of Annotated English Gigaword for a $150 fee. Please contact ldc@ldc.upenn.edu for licensing or with any additional questions. *Updates* None at this time.

Extent: Corpus size: 164392369 KB

Identifier: LDC2012T21

https://catalog.ldc.upenn.edu/LDC2012T21

ISBN: 1-58563-629-0

ISLRN: 335-916-789-872-0

DOI: 10.35111/mv9t-vv26

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2012T21

Rights Holder: Portions © 1994-2010 Agence France Presse, © 1994-2010 The Associated Press, © 1997-2010 Central News Agency (Taiwan), © 1994-1998, 2003-2009 Los Angeles Times-Washington Post News Service, Inc., © 1994-2010 New York Times, © 2010 The Washington Post News Service with Bloomberg News, © 1995-2010 Xinhua News Agency, © 2012 Matthew R. Gormley, © 2003, 2005, 2007, 2009, 2011, 2012 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2012T21

DateStamp: 2021-06-17

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Napoles, Courtney; Gormley, Matthew R.; Van Durme, Benjamin. 2012. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2012T21
Up-to-date as of: Wed Oct 29 7:01:22 EDT 2025

Metadata
Title:		Annotated English Gigaword
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Napoles, Courtney, Matthew Gormley, and Benjamin Van Durme. Annotated English Gigaword LDC2012T21. Web Download. Philadelphia: Linguistic Data Consortium, 2012
Contributor:		Napoles, Courtney
		Gormley, Matthew R.
		Van Durme, Benjamin
Date (W3CDTF):		2012
Date Issued (W3CDTF):		2012-11-15
Description:		Introduction Annotated English Gigaword was developed by Johns Hopkins University's Human Language Technology Center of Excellence. It adds automatically-generated syntactic and discourse structure annotation to English Gigaword Fifth Edition (LDC2011T07) and also contains an API and tools for reading the dataset's XML files. The goal of the annotation is to provide a standardized corpus for knowledge extraction and distributional semantics which enables broader involvement in large-scale knowledge-acquisition efforts by researchers. Data Annotated English Gigaword contains the nearly ten million documents (over four billion words) of the original English Gigaword Fifth Edition from seven news sources: * Agence France-Presse, English Service (afp_eng) * Associated Press Worldstream, English Service (apw_eng) * Central News Agency of Taiwan, English Service (cna_eng) * Los Angeles Times/Washington Post Newswire Service (ltw_eng) * Washington Post/Bloomberg Newswire Service (wpb_eng) * New York Times Newswire Service (nyt_eng) * Xinhua News Agency, English Service (xin_eng) The following layers of annotation were added: * Tokenized and segmented sentences * Treebank-style constituent parse trees * Syntactic dependency trees * Named entities * In-document coreference chains The annotation was performed in a three-step process: (1) the data was preprocessed and sentences selected for annotation (sentences with more than 100 tokens were excluded) (2) syntactic parses were derived and (3) the parsed output was post-processed to derive syntactic dependencies, named entities and coreference chains. Over 183 million sentences were parsed. The data is stored in a form similar to the gigaword SGML format with XML annotations containing the additional markup. The included API provides object representations for the contents of the XML files. Samples Please the link for a sample. Additional Licensing Information Any 2011 member organization that licensed English Gigaword Fifth Edition (LDC2011T07) may request a no-cost copy of Annotated English Gigaword. Any non-member organization that licensed English Gigaword Fifth Edition may request a copy of Annotated English Gigaword for a $150 fee. Please contact ldc@ldc.upenn.edu for licensing or with any additional questions. Updates None at this time.
Extent:		Corpus size: 164392369 KB
Identifier:		LDC2012T21
		https://catalog.ldc.upenn.edu/LDC2012T21
		ISBN: 1-58563-629-0
		ISLRN: 335-916-789-872-0
		DOI: 10.35111/mv9t-vv26
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2012T21
Rights Holder:		Portions © 1994-2010 Agence France Presse, © 1994-2010 The Associated Press, © 1997-2010 Central News Agency (Taiwan), © 1994-1998, 2003-2009 Los Angeles Times-Washington Post News Service, Inc., © 1994-2010 New York Times, © 2010 The Washington Post News Service with Bloomberg News, © 1995-2010 Xinhua News Agency, © 2012 Matthew R. Gormley, © 2003, 2005, 2007, 2009, 2011, 2012 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2012T21
DateStamp:		2021-06-17
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Napoles, Courtney; Gormley, Matthew R.; Van Durme, Benjamin. 2012. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text