OLAC Record: Concretely Annotated New York Times

OLAC Record
oai:www.ldc.upenn.edu:LDC2018T12

Metadata

Title: Concretely Annotated New York Times

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Ferraro, Francis, et al. Concretely Annotated New York Times LDC2018T12. Web Download. Philadelphia: Linguistic Data Consortium, 2018

Contributor: Ferraro, Francis

Thomas, Max

Wolfe, Travis

Gormley, Matthew R.

Harman, Craig

Van Durme, Benjamin

Date (W3CDTF): 2018

Date Issued (W3CDTF): 2018-04-16

Description: *Introduction* Concretely Annotated New York Times was developed by Johns Hopkins University's Human Language Technology Center of Excellence. It adds multiple kinds and instances of automatically-generated syntactic, semantic and coreference annotations to The New York Times Annotated Corpus (LDC2008T19). Concrete is a schema for representing structured, hierarchical and overlapping linguistic annotations. This release provides multiple tool outputs producing the same annotation types as different annotation theories under a shared tokenization. *Data* Concretely Annotated New York Times contains all of the 1.8 million articles in The New York Times Annotated Corpus. Those articles were written and published by the New York Times between January 1, 1987 and June 19, 2007; the 2008 corpus also includes metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. The following layers of annotation were added by processing the articles under the Concrete schema: * Segmented sentences and Penn Treebank-style tokenized words * Treebank-style constituent parse trees * Four different syntactic dependency trees * Named entities * Part of speech tags * Lemmas * In-document entity coreference chains * Three different frame semantic parses See analytics.pdf for the list of tools used to create those annotations. The data is stored in a binary form called Concrete, which is based on Apache Thrift. Concrete can be read and written in many common programming languages, such as Java, Python, Javascript and C++. Concrete also includes a number of utilities to access and view the data in human-readable forms. The original NITF (News Industry Text Format) document structure in The New York Times Annotated Corpus was preserved in this Concrete version. *Samples* Please view this concrete sample. *Reference* Users of this corpus must cite the following paper : Francis Ferraro, Max Thomas, Matthew Gormley, Travis Wolfe, Craig Harman, and Benjamin Van Durme. "Concretely Annotated Corpora." In The Proceedings of the NIPS Workshop on Automated Knowledge Base Construction (AKBC). NIPS Workshop 2014. *Additional Licensing Instructions* Any organization that licensed The New York Times Annotated Corpus (LDC2008T19) may request a copy of Concretely Annotated New York Times (LDC2018T12) for a $150 fee. Contact ldc@ldc.upenn.edu for licensing.

Extent: Corpus size: 139089160 KB

Identifier: LDC2018T12

https://catalog.ldc.upenn.edu/LDC2018T12

ISBN: 1-58563-840-4

ISLRN: 504-151-596-424-6

DOI: 10.35111/xgs8-5140

Language: English

Language (ISO639): eng

License: Concretely Annotated New York Times Agreement: https://catalog.ldc.upenn.edu/license/concretely-annotated-new-york-times-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2018T12

Rights Holder: Portions © 1987-2008 New York Times, © 2008, 2018 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2018T12

DateStamp: 2021-05-14

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Ferraro, Francis; Thomas, Max; Wolfe, Travis; Gormley, Matthew R.; Harman, Craig; Van Durme, Benjamin. 2018. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2018T12
Up-to-date as of: Tue Jan 2 7:32:29 EST 2024

Metadata
Title:		Concretely Annotated New York Times
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Ferraro, Francis, et al. Concretely Annotated New York Times LDC2018T12. Web Download. Philadelphia: Linguistic Data Consortium, 2018
Contributor:		Ferraro, Francis
		Thomas, Max
		Wolfe, Travis
		Gormley, Matthew R.
		Harman, Craig
		Van Durme, Benjamin
Date (W3CDTF):		2018
Date Issued (W3CDTF):		2018-04-16
Description:		Introduction Concretely Annotated New York Times was developed by Johns Hopkins University's Human Language Technology Center of Excellence. It adds multiple kinds and instances of automatically-generated syntactic, semantic and coreference annotations to The New York Times Annotated Corpus (LDC2008T19). Concrete is a schema for representing structured, hierarchical and overlapping linguistic annotations. This release provides multiple tool outputs producing the same annotation types as different annotation theories under a shared tokenization. Data Concretely Annotated New York Times contains all of the 1.8 million articles in The New York Times Annotated Corpus. Those articles were written and published by the New York Times between January 1, 1987 and June 19, 2007; the 2008 corpus also includes metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. The following layers of annotation were added by processing the articles under the Concrete schema: * Segmented sentences and Penn Treebank-style tokenized words * Treebank-style constituent parse trees * Four different syntactic dependency trees * Named entities * Part of speech tags * Lemmas * In-document entity coreference chains * Three different frame semantic parses See analytics.pdf for the list of tools used to create those annotations. The data is stored in a binary form called Concrete, which is based on Apache Thrift. Concrete can be read and written in many common programming languages, such as Java, Python, Javascript and C++. Concrete also includes a number of utilities to access and view the data in human-readable forms. The original NITF (News Industry Text Format) document structure in The New York Times Annotated Corpus was preserved in this Concrete version. Samples Please view this concrete sample. Reference Users of this corpus must cite the following paper : Francis Ferraro, Max Thomas, Matthew Gormley, Travis Wolfe, Craig Harman, and Benjamin Van Durme. "Concretely Annotated Corpora." In The Proceedings of the NIPS Workshop on Automated Knowledge Base Construction (AKBC). NIPS Workshop 2014. Additional Licensing Instructions Any organization that licensed The New York Times Annotated Corpus (LDC2008T19) may request a copy of Concretely Annotated New York Times (LDC2018T12) for a $150 fee. Contact ldc@ldc.upenn.edu for licensing.
Extent:		Corpus size: 139089160 KB
Identifier:		LDC2018T12
		https://catalog.ldc.upenn.edu/LDC2018T12
		ISBN: 1-58563-840-4
		ISLRN: 504-151-596-424-6
		DOI: 10.35111/xgs8-5140
Language:		English
Language (ISO639):		eng
License:		Concretely Annotated New York Times Agreement: https://catalog.ldc.upenn.edu/license/concretely-annotated-new-york-times-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2018T12
Rights Holder:		Portions © 1987-2008 New York Times, © 2008, 2018 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2018T12
DateStamp:		2021-05-14
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Ferraro, Francis; Thomas, Max; Wolfe, Travis; Gormley, Matthew R.; Harman, Craig; Van Durme, Benjamin. 2018. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text