OLAC Record: OntoNotes Release 1.0

OLAC Record
oai:www.ldc.upenn.edu:LDC2007T21

Metadata

Title: OntoNotes Release 1.0

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Weischedel, Ralph, et al. OntoNotes Release 1.0 LDC2007T21. Web Download. Philadelphia: Linguistic Data Consortium, 2007

Contributor: Weischedel, Ralph

Pradhan, Sameer

Ramshaw, Lance

Micciulla, Linnea

Palmer, Martha

Xue, Nianwen

Marcus, Mitchell

Taylor, Ann

Babko-Malaya, Olga

Hovy, Eduard

Belvin, Robert

Houston, Ann

Date (W3CDTF): 2007

Date Issued (W3CDTF): 2007-05-17

Description: *Introduction* Natural language applications like machine translation, question answering, and summarization currently are forced to depend on impoverished text models like bags of words or n-grams, while the decisions that they are making ought to be based on the meanings of those words in context. That lack of semantics causes problems throughout the applications. Misinterpreting the meaning of an ambiguous word results in failing to extract data, incorrect alignments for translation, and ambiguous language models. Incorrect coreference resolution results in missed information (because a connection is not made) or incorrectly conflated information (due to false connections). Some richer semantic representation is badly needed. The OntoNotes project is a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania, and the University of Southern California's Information Sciences Institute to produce such a resource. It aims to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, use net, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). OntoNotes builds on two time-tested resources, following the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic representation will include word sense disambiguation for nouns and verbs, with each word sense connected to an ontology, and coreference. The current goals call for annotation of over a million words each of English and Chinese, and half a million words of Arabic over five years. The authors wish to make this resource available to the natural language research community so that decoders for these phenomena can be trained to generate the same structure in new documents. Lessons learned over the years have shown that the quality of annotation is crucial if it is going to be used for training machine learning algorithms. Taking this cue, we ensure that each layer of annotation in OntoNotes will have at least 90% inter- annotator agreement. Our pilot studies have shown that predicate structure, word sense, ontology linking, and coreference can all be annotated rapidly and with better than 90% consistency. *Samples* The following screen captures provide examples of the data contained in this corpus. * English tree. * English sense predicate structure. * Chinese tree and sense predicate structure. *Sponsorship* This work was suppported in part by the Defense Research Advanced Projects Agency, GALE Program Grant No. HR0011-06-C-0022. The content of this publication does not necessarily reflect the position or policy of the Government, and no official endorsement should be inferred.

Extent: Corpus size: 752640 KB

Identifier: LDC2007T21

https://catalog.ldc.upenn.edu/LDC2007T21

ISBN: 1-58563-440-9

ISLRN: 722-221-552-342-8

DOI: 10.35111/2qq3-xx06

Language: English

Mandarin Chinese

Language (ISO639): eng

cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2007T21

Rights Holder: Portions © 1989 Dow Jones & Company, Inc., © 1996-2001 Sinorama Magazine, © 1994-1998 Xinhua News Agency, © 1995, 2005, 2006, 2007 Trustees of the University of Pennsylvania

Subject (OLAC): computational_linguistics

Type (DCMI): Text

Type (Discourse): narrative

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2007T21

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Weischedel, Ralph; Pradhan, Sameer; Ramshaw, Lance; Micciulla, Linnea; Palmer, Martha; Xue, Nianwen; Marcus, Mitchell; Taylor, Ann; Babko-Malaya, Olga; Hovy, Eduard; Belvin, Robert; Houston, Ann. 2007. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB dcmi_Text iso639_cmn iso639_eng olac_computational_linguistics olac_narrative olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2007T21
Up-to-date as of: Wed Oct 29 7:01:00 EDT 2025

Metadata
Title:		OntoNotes Release 1.0
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Weischedel, Ralph, et al. OntoNotes Release 1.0 LDC2007T21. Web Download. Philadelphia: Linguistic Data Consortium, 2007
Contributor:		Weischedel, Ralph
		Pradhan, Sameer
		Ramshaw, Lance
		Micciulla, Linnea
		Palmer, Martha
		Xue, Nianwen
		Marcus, Mitchell
		Taylor, Ann
		Babko-Malaya, Olga
		Hovy, Eduard
		Belvin, Robert
		Houston, Ann
Date (W3CDTF):		2007
Date Issued (W3CDTF):		2007-05-17
Description:		Introduction Natural language applications like machine translation, question answering, and summarization currently are forced to depend on impoverished text models like bags of words or n-grams, while the decisions that they are making ought to be based on the meanings of those words in context. That lack of semantics causes problems throughout the applications. Misinterpreting the meaning of an ambiguous word results in failing to extract data, incorrect alignments for translation, and ambiguous language models. Incorrect coreference resolution results in missed information (because a connection is not made) or incorrectly conflated information (due to false connections). Some richer semantic representation is badly needed. The OntoNotes project is a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania, and the University of Southern California's Information Sciences Institute to produce such a resource. It aims to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, use net, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). OntoNotes builds on two time-tested resources, following the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic representation will include word sense disambiguation for nouns and verbs, with each word sense connected to an ontology, and coreference. The current goals call for annotation of over a million words each of English and Chinese, and half a million words of Arabic over five years. The authors wish to make this resource available to the natural language research community so that decoders for these phenomena can be trained to generate the same structure in new documents. Lessons learned over the years have shown that the quality of annotation is crucial if it is going to be used for training machine learning algorithms. Taking this cue, we ensure that each layer of annotation in OntoNotes will have at least 90% inter- annotator agreement. Our pilot studies have shown that predicate structure, word sense, ontology linking, and coreference can all be annotated rapidly and with better than 90% consistency. Samples The following screen captures provide examples of the data contained in this corpus. * English tree. * English sense predicate structure. * Chinese tree and sense predicate structure. Sponsorship This work was suppported in part by the Defense Research Advanced Projects Agency, GALE Program Grant No. HR0011-06-C-0022. The content of this publication does not necessarily reflect the position or policy of the Government, and no official endorsement should be inferred.
Extent:		Corpus size: 752640 KB
Identifier:		LDC2007T21
		https://catalog.ldc.upenn.edu/LDC2007T21
		ISBN: 1-58563-440-9
		ISLRN: 722-221-552-342-8
		DOI: 10.35111/2qq3-xx06
Language:		English
Language:		Mandarin Chinese
Language (ISO639):		eng
Language (ISO639):		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2007T21
Rights Holder:		Portions © 1989 Dow Jones & Company, Inc., © 1996-2001 Sinorama Magazine, © 1994-1998 Xinhua News Agency, © 1995, 2005, 2006, 2007 Trustees of the University of Pennsylvania
Subject (OLAC):		computational_linguistics
Type (DCMI):		Text
Type (Discourse):		narrative
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2007T21
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Weischedel, Ralph; Pradhan, Sameer; Ramshaw, Lance; Micciulla, Linnea; Palmer, Martha; Xue, Nianwen; Marcus, Mitchell; Taylor, Ann; Babko-Malaya, Olga; Hovy, Eduard; Belvin, Robert; Houston, Ann. 2007. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_GB dcmi_Text iso639_cmn iso639_eng olac_computational_linguistics olac_narrative olac_primary_text