OLAC Record: Concretely Annotated English Gigaword

OLAC Record
oai:www.ldc.upenn.edu:LDC2018T20

Metadata

Title: Concretely Annotated English Gigaword

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Ferraro, Francis, et al. Concretely Annotated English Gigaword LDC2018T20. Web Download. Philadelphia: Linguistic Data Consortium, 2018

Contributor: Ferraro, Francis

Thomas, Max

Gormley, Matthew R.

Wolfe, Travis

Harman, Craig

Van Durme, Benjamin

Date (W3CDTF): 2018

Date Issued (W3CDTF): 2018-10-15

Description: *Introduction* Concretely Annotated English Gigaword was developed by Johns Hopkins University's Human Language Technology Center of Excellence (JHU). It adds multiple kinds and instances of automatically-generated syntactic, semantic and coreference annotations to English Gigaword Fifth Edition (LDC2011T07). Concrete is a schema for representing structured, hierarchical and overlapping linguistic annotations. This release provides multiple tool outputs producing the same annotation types as different annotation theories under a shared tokenization. The Linguistic Data Consortium (LDC) has also released Annotated English Gigaword (LDC2012T21), earlier work by JHU researchers to create a standardized corpus for knowledge extraction and distributional semantics by using then-state of the art tools to add automatically-generated syntactic and discourse structure annotation to English Gigaword Fifth Edition. *Data* Concretely Annotated English Gigaword contains the nearly ten million documents (over four billion words) of the original English Gigaword Fifth Edition which consists of newswire stories from seven sources collected by LDC between 1994-2010. The following layers of annotation were added under the Concrete schema: * Segmented sentences and Penn Treebank-style tokenized words * Treebank-style constituent parse trees * Four different syntactic dependency trees * Named entities * Part of speech tags * Lemmas * In-document entity coreference chains * Three different frame semantic parses The data is stored in a binary form called Concrete, which is based upon Apache Thrift. Concrete can be read and written in many common programming languages, like Java, Python, Javascript and C++. Concrete also has a number of utilities to easily access and view the data in human-readable forms. *Samples* Please view the following samples: * Concrete File * Treebanked * CoNLL Style * Quicklime View *Reference* Users of this corpus must cite the following paper: Francis Ferraro, Max Thomas, Matthew Gormley, Travis Wolfe, Craig Harman, and Benjamin Van Durme. "Concretely Annotated Corpora." In The Proceedings of the NIPS Workshop on Automated Knowledge Base Construction (AKBC). NIPS Workshop 2014. *Additional Licensing Instructions* Any organization that licensed English Gigaword Fifth Edition (LDC2011T07) or Annotated English Gigaword (LDC2012T21) may request a copy of Concretely Annotated English Gigaword (LDC2018T20) for a $150 fee. Contact ldc@ldc.upenn.edu for licensing.

Extent: Corpus size: 476490400 KB

Identifier: LDC2018T20

https://catalog.ldc.upenn.edu/LDC2018T20

ISBN: 1-58563-861-7

ISLRN: 309-427-947-277-9

DOI: 10.35111/a802-nz06

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2018T20

Rights Holder: Portions © 1994-2010 Agence France Presse, © 1994-2010 The Associated Press, © 1997-2010 Central News Agency (Taiwan), © 1994-1998, 2003-2009 Los Angeles Times-Washington Post News Service, Inc., © 1994-2010 New York Times, © 2010 The Washington Post News Service with Bloomberg News, © 1995-2010 Xinhua News Agency, © 2003, 2005, 2007, 2009, 2011, 2018 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2018T20

DateStamp: 2021-08-09

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Ferraro, Francis; Thomas, Max; Gormley, Matthew R.; Wolfe, Travis; Harman, Craig; Van Durme, Benjamin. 2018. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2018T20
Up-to-date as of: Wed Oct 29 7:01:50 EDT 2025

Metadata
Title:		Concretely Annotated English Gigaword
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Ferraro, Francis, et al. Concretely Annotated English Gigaword LDC2018T20. Web Download. Philadelphia: Linguistic Data Consortium, 2018
Contributor:		Ferraro, Francis
		Thomas, Max
		Gormley, Matthew R.
		Wolfe, Travis
		Harman, Craig
		Van Durme, Benjamin
Date (W3CDTF):		2018
Date Issued (W3CDTF):		2018-10-15
Description:		Introduction Concretely Annotated English Gigaword was developed by Johns Hopkins University's Human Language Technology Center of Excellence (JHU). It adds multiple kinds and instances of automatically-generated syntactic, semantic and coreference annotations to English Gigaword Fifth Edition (LDC2011T07). Concrete is a schema for representing structured, hierarchical and overlapping linguistic annotations. This release provides multiple tool outputs producing the same annotation types as different annotation theories under a shared tokenization. The Linguistic Data Consortium (LDC) has also released Annotated English Gigaword (LDC2012T21), earlier work by JHU researchers to create a standardized corpus for knowledge extraction and distributional semantics by using then-state of the art tools to add automatically-generated syntactic and discourse structure annotation to English Gigaword Fifth Edition. Data Concretely Annotated English Gigaword contains the nearly ten million documents (over four billion words) of the original English Gigaword Fifth Edition which consists of newswire stories from seven sources collected by LDC between 1994-2010. The following layers of annotation were added under the Concrete schema: * Segmented sentences and Penn Treebank-style tokenized words * Treebank-style constituent parse trees * Four different syntactic dependency trees * Named entities * Part of speech tags * Lemmas * In-document entity coreference chains * Three different frame semantic parses The data is stored in a binary form called Concrete, which is based upon Apache Thrift. Concrete can be read and written in many common programming languages, like Java, Python, Javascript and C++. Concrete also has a number of utilities to easily access and view the data in human-readable forms. Samples Please view the following samples: * Concrete File * Treebanked * CoNLL Style * Quicklime View Reference Users of this corpus must cite the following paper: Francis Ferraro, Max Thomas, Matthew Gormley, Travis Wolfe, Craig Harman, and Benjamin Van Durme. "Concretely Annotated Corpora." In The Proceedings of the NIPS Workshop on Automated Knowledge Base Construction (AKBC). NIPS Workshop 2014. Additional Licensing Instructions Any organization that licensed English Gigaword Fifth Edition (LDC2011T07) or Annotated English Gigaword (LDC2012T21) may request a copy of Concretely Annotated English Gigaword (LDC2018T20) for a $150 fee. Contact ldc@ldc.upenn.edu for licensing.
Extent:		Corpus size: 476490400 KB
Identifier:		LDC2018T20
		https://catalog.ldc.upenn.edu/LDC2018T20
		ISBN: 1-58563-861-7
		ISLRN: 309-427-947-277-9
		DOI: 10.35111/a802-nz06
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2018T20
Rights Holder:		Portions © 1994-2010 Agence France Presse, © 1994-2010 The Associated Press, © 1997-2010 Central News Agency (Taiwan), © 1994-1998, 2003-2009 Los Angeles Times-Washington Post News Service, Inc., © 1994-2010 New York Times, © 2010 The Washington Post News Service with Bloomberg News, © 1995-2010 Xinhua News Agency, © 2003, 2005, 2007, 2009, 2011, 2018 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2018T20
DateStamp:		2021-08-09
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Ferraro, Francis; Thomas, Max; Gormley, Matthew R.; Wolfe, Travis; Harman, Craig; Van Durme, Benjamin. 2018. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text