OLAC Record: ACE Time Normalization (TERN) 2004 English Training Data v 1.0

OLAC Record
oai:www.ldc.upenn.edu:LDC2005T07

Metadata

Title: ACE Time Normalization (TERN) 2004 English Training Data v 1.0

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Ferro, Lisa, et al. ACE Time Normalization (TERN) 2004 English Training Data v 1.0 LDC2005T07. Web Download. Philadelphia: Linguistic Data Consortium, 2005

Contributor: Ferro, Lisa

Gerber, Laurie

Hitzeman, Janet

Lima, Elizabeth

Sundheim, Beth

Date (W3CDTF): 2005

Date Issued (W3CDTF): 2005-02-15

Description: *Introduction* ACE Time Normalization (TERN) 2004 English Training Data v 1.0 was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology (NIST) with support from the Automatic Content Extraction (ACE) program. It contains 862 files totalling 306,000 words of English news and treebank text. This release contains the English training data prepared for the 2004 Time Expression Recognition and Normalization (TERN) Evaluation, sponsored by the ACE program. The evaluation was held in August 2004 and a workshop in September 2004. Evaluation participants received this data for training purposes, and it is now being released for general use. The annotation specifications for this corpus were developed under DARPA's Translingual Information Detection Extraction and Summarization (TIDES) program, with continuing support from ACE. The purpose of this corpus and the TERN evaluation is to advance the state of the art in the automatic recognition and normalization of natural language temporal expressions. In most language contexts such expressions are indexical. For example, with "Monday," "last week," or "three months starting October 1," one must know the narrative reference time in order to pinpoint the time interval being conveyed by the expression. In addition, for data exchange purposes, it is essential that the identified interval be rendered according to an established standard, i.e., normalized. Accurate identification and normalization of temporal expressions is in turn essential for the temporal reasoning being demanded by advanced NLP applications such as question answering, information extraction, and summarization. *Data* The data in this corpus is divided into three data sets, ace_2002, ace_2003, and ace_2004. Here are the genres and sources included in this corpus: * bnews - Broadcast news data from TDT4 Multilingual Text and Annotations (LDC2005T16) * nwire - Newswire data from TDT4 Multilingual Text and Annotations (LDC2005T16) * npaper - Washington Post articles (ace_2002 only) * arabic_treebank - Data from the Arabic Treebank 1 Corpus English translations from the MT-2003 translation data set * chinese_treebank - Data from the Chinese Treebank Version 4 English translations from the Chinese Treebank English Parallel Text Corpus And here are the details for the data sets: Data Set Genre Words Documents ace_2002 bnews 17,922 85 npaper 14,682 17 nwire 34,134 78 Total 66,738 180 ace_2003 bnews 34,681 147 nwire 58,592 102 Total 93,273 249 ace_2004 bnews 61,621 222 nwire 58,543 116 arabic_treebank 13,466 58 chinese_treebank 12,522 37 Total 146,452 433 Grand Totals 306,463 862 The data in this corpus includes the original source files in SGML format (.sgm) and the annotated files, also in SGML format (.tmx.sgml). *Samples* For example of the data in this corpus, please view this source sample (SGML) and annotation sample (SGML). *Updates* None at this time.

Identifier: LDC2005T07

https://catalog.ldc.upenn.edu/LDC2005T07

ISBN: 1-58563-331-3

ISLRN: 357-991-519-054-6

DOI: 10.35111/9nye-wg76

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2005T07

Rights Holder: Portions © 1998 Los Angeles Times-Washington Post News Service, Inc., © 1998, 2000 American Broadcasting Corporation, © 1998, 2000 Cable News Network, LP, LLLP, © 1998, 2000 The Associated Press, © 1998, 2000 New York Times, © 1998, 2000 National Broadcasting Company, Inc., ©1998, 2000 Public Radio International, © 2005 Trustees of the University of Pennsylvania

"The World" is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2005T07

DateStamp: 2021-11-15

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Ferro, Lisa; Gerber, Laurie; Hitzeman, Janet; Lima, Elizabeth; Sundheim, Beth. 2005. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2005T07
Up-to-date as of: Wed Oct 29 7:00:25 EDT 2025

Metadata
Title:		ACE Time Normalization (TERN) 2004 English Training Data v 1.0
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Ferro, Lisa, et al. ACE Time Normalization (TERN) 2004 English Training Data v 1.0 LDC2005T07. Web Download. Philadelphia: Linguistic Data Consortium, 2005
Contributor:		Ferro, Lisa
		Gerber, Laurie
		Hitzeman, Janet
		Lima, Elizabeth
		Sundheim, Beth
Date (W3CDTF):		2005
Date Issued (W3CDTF):		2005-02-15
Description:		Introduction ACE Time Normalization (TERN) 2004 English Training Data v 1.0 was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology (NIST) with support from the Automatic Content Extraction (ACE) program. It contains 862 files totalling 306,000 words of English news and treebank text. This release contains the English training data prepared for the 2004 Time Expression Recognition and Normalization (TERN) Evaluation, sponsored by the ACE program. The evaluation was held in August 2004 and a workshop in September 2004. Evaluation participants received this data for training purposes, and it is now being released for general use. The annotation specifications for this corpus were developed under DARPA's Translingual Information Detection Extraction and Summarization (TIDES) program, with continuing support from ACE. The purpose of this corpus and the TERN evaluation is to advance the state of the art in the automatic recognition and normalization of natural language temporal expressions. In most language contexts such expressions are indexical. For example, with "Monday," "last week," or "three months starting October 1," one must know the narrative reference time in order to pinpoint the time interval being conveyed by the expression. In addition, for data exchange purposes, it is essential that the identified interval be rendered according to an established standard, i.e., normalized. Accurate identification and normalization of temporal expressions is in turn essential for the temporal reasoning being demanded by advanced NLP applications such as question answering, information extraction, and summarization. Data The data in this corpus is divided into three data sets, ace_2002, ace_2003, and ace_2004. Here are the genres and sources included in this corpus: * bnews - Broadcast news data from TDT4 Multilingual Text and Annotations (LDC2005T16) * nwire - Newswire data from TDT4 Multilingual Text and Annotations (LDC2005T16) * npaper - Washington Post articles (ace_2002 only) * arabic_treebank - Data from the Arabic Treebank 1 Corpus English translations from the MT-2003 translation data set * chinese_treebank - Data from the Chinese Treebank Version 4 English translations from the Chinese Treebank English Parallel Text Corpus And here are the details for the data sets: Data Set Genre Words Documents ace_2002 bnews 17,922 85 npaper 14,682 17 nwire 34,134 78 Total 66,738 180 ace_2003 bnews 34,681 147 nwire 58,592 102 Total 93,273 249 ace_2004 bnews 61,621 222 nwire 58,543 116 arabic_treebank 13,466 58 chinese_treebank 12,522 37 Total 146,452 433 Grand Totals 306,463 862 The data in this corpus includes the original source files in SGML format (.sgm) and the annotated files, also in SGML format (.tmx.sgml). Samples For example of the data in this corpus, please view this source sample (SGML) and annotation sample (SGML). Updates None at this time.
Identifier:		LDC2005T07
		https://catalog.ldc.upenn.edu/LDC2005T07
		ISBN: 1-58563-331-3
		ISLRN: 357-991-519-054-6
		DOI: 10.35111/9nye-wg76
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2005T07
Rights Holder:		Portions © 1998 Los Angeles Times-Washington Post News Service, Inc., © 1998, 2000 American Broadcasting Corporation, © 1998, 2000 Cable News Network, LP, LLLP, © 1998, 2000 The Associated Press, © 1998, 2000 New York Times, © 1998, 2000 National Broadcasting Company, Inc., ©1998, 2000 Public Radio International, © 2005 Trustees of the University of Pennsylvania "The World" is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2005T07
DateStamp:		2021-11-15
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Ferro, Lisa; Gerber, Laurie; Hitzeman, Janet; Lima, Elizabeth; Sundheim, Beth. 2005. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text