OLAC Record: ACE-2 Version 1.0

OLAC Record
oai:www.ldc.upenn.edu:LDC2003T11

Metadata

Title: ACE-2 Version 1.0

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Mitchell, Alexis, et al. ACE-2 Version 1.0 LDC2003T11. Web Download. Philadelphia: Linguistic Data Consortium, 2003

Contributor: Mitchell, Alexis

Strassel, Stephanie

Przybocki, Mark

Davis, JK

Doddington, George R.

Grishman, Ralph

Meyers, Adam

Brunstein, Ada

Ferro, Lisa

Sundheim, Beth

Date (W3CDTF): 2003

Date Issued (W3CDTF): 2003-09-02

Description: *Introduction* ACE-2 Version 1.0 was produced by the Linguistic Data Consortium (LDC) and contains 519 documents totaling approximately 179,000 words of English news text. This release contains Version 1.0 of the ACE-2 corpus, created and distributed by LDC to support the Automatic Content Extraction (ACE) program. The objective of the ACE program is to develop extraction technology to support automatic processing of source language data (in the form of natural text, and as text derived from ASR and OCR). This includes classification, filtering, and selection based on the language content of the source data, i.e., based on the meaning conveyed by the data. Thus, the ACE program requires the development of technologies that automatically detect and characterize this meaning. The ACE research objectives are viewed as the detection and characterization of Entities, Relations, and Events. There are three main ACE tasks: Entity Detection and Tracking (EDT), Relation Detection and Characterization (RDC), and Event Detection and Characterization (EDC). Annotations for the ACE-2 corpus were produced by LDC to support EDT and RDC. For information about ACE annotation, including annotation guidelines, task definitions, and other project documentation, please visit LDC's ACE Project page. *Data* This publication contains two sets of data: training and devtest. Each of these sets is further divided by source: broadcast news, newspaper, and newswire. The training contains data originally developed as training material for the February 2002 evaluation and again for the September 2002 evaluation. The devtest contains data originally developed as test data for the February 2002 evaluation and later used as devtest data for the September 2002 evaluation. The broadcast and newswire source data is drawn from a subset of TDT2 Multilanguage Text Version 4.0 (LDC2001T57); this has been supplemented with additional newspaper data from the Washington Post. A portion of the training broadcast data was drawn from 1997 English Broadcast News Transcripts (HUB4) (LDC98T28). All material comes from the first half of 1998. The sources for the broadcast, newswire, and newspaper data are listed below. Newswire * New York Times Newswire Service (NYT) * Associated Press Worldstream Service (APW) Broadcast News * Cable News Network, "Headline News" (CNN for TDT2, ed for Hub-4) * American Broadcasting Co., "World News Tonight" (ABC for TDT2, ea for Hub-4) * Public Radio International, "The World" (PRI) * Voice of America, English news programs (VOA) * MSNBC, "The News With Brian Williams" (MNB) * National Broadcasting Company, "Nightly News" (NBC) Newspaper * Washington Post (WAP) This publication includes both the source data files in .sgm format and the annotation files in ACE Pilot Format (APF), supporting documentation, and version 2.0.1 of the ACE DTD which was used for the September 2002 ACE Evaluation. There are 179,007 words of source data, or 519 files, broken down as follows: Source # Words train # Words devtest # Files train # Files devtest NYT 32,892 7,487 48 9 APW 29,144 7,037 82 20 CNN 2,290 2,653 69 11 ABC 1,588 2,687 24 10 PRI 1,272 5,284 43 9 VOA 594 2,611 24 7 MNB 0 2,539 0 6 NBC 0 2,633 0 8 WAP 60,247 15,070 76 17 ea 2,019 0 31 0 ed 1,094 0 25 0 Total 131,023 47,984 422 97 *Samples* For an example of the data in this corpus, please view these samples: Source (SGM) Annotation (XML) *Updates* There are no updates available at this time. "The World" is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

Extent: Corpus size: 35840 KB

Identifier: LDC2003T11

https://catalog.ldc.upenn.edu/LDC2003T11

ISBN: 1-58563-270-8

ISLRN: 498-363-793-174-9

DOI: 10.35111/kcqk-v224

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2003T11

Rights Holder: Portions © 1998 Los Angeles Times-Washington Post News Service, Inc., © 1998 American Broadcasting Corporation, © 1998 Cable News Network, Inc., © 1998 Press Association, Inc., © 1998 New York Times, © 1998 National Broadcasting Company, Inc., © 1998 Public Radio International, © 2003 Trustees of the University of Pennsylvania

"The World" is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2003T11

DateStamp: 2024-09-10

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Mitchell, Alexis; Strassel, Stephanie; Przybocki, Mark; Davis, JK; Doddington, George R.; Grishman, Ralph; Meyers, Adam; Brunstein, Ada; Ferro, Lisa; Sundheim, Beth. 2003. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2003T11
Up-to-date as of: Sat Jun 28 1:00:58 EDT 2025

Metadata
Title:		ACE-2 Version 1.0
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Mitchell, Alexis, et al. ACE-2 Version 1.0 LDC2003T11. Web Download. Philadelphia: Linguistic Data Consortium, 2003
Contributor:		Mitchell, Alexis
		Strassel, Stephanie
		Przybocki, Mark
		Davis, JK
		Doddington, George R.
		Grishman, Ralph
		Meyers, Adam
		Brunstein, Ada
		Ferro, Lisa
		Sundheim, Beth
Date (W3CDTF):		2003
Date Issued (W3CDTF):		2003-09-02
Description:		Introduction ACE-2 Version 1.0 was produced by the Linguistic Data Consortium (LDC) and contains 519 documents totaling approximately 179,000 words of English news text. This release contains Version 1.0 of the ACE-2 corpus, created and distributed by LDC to support the Automatic Content Extraction (ACE) program. The objective of the ACE program is to develop extraction technology to support automatic processing of source language data (in the form of natural text, and as text derived from ASR and OCR). This includes classification, filtering, and selection based on the language content of the source data, i.e., based on the meaning conveyed by the data. Thus, the ACE program requires the development of technologies that automatically detect and characterize this meaning. The ACE research objectives are viewed as the detection and characterization of Entities, Relations, and Events. There are three main ACE tasks: Entity Detection and Tracking (EDT), Relation Detection and Characterization (RDC), and Event Detection and Characterization (EDC). Annotations for the ACE-2 corpus were produced by LDC to support EDT and RDC. For information about ACE annotation, including annotation guidelines, task definitions, and other project documentation, please visit LDC's ACE Project page. Data This publication contains two sets of data: training and devtest. Each of these sets is further divided by source: broadcast news, newspaper, and newswire. The training contains data originally developed as training material for the February 2002 evaluation and again for the September 2002 evaluation. The devtest contains data originally developed as test data for the February 2002 evaluation and later used as devtest data for the September 2002 evaluation. The broadcast and newswire source data is drawn from a subset of TDT2 Multilanguage Text Version 4.0 (LDC2001T57); this has been supplemented with additional newspaper data from the Washington Post. A portion of the training broadcast data was drawn from 1997 English Broadcast News Transcripts (HUB4) (LDC98T28). All material comes from the first half of 1998. The sources for the broadcast, newswire, and newspaper data are listed below. Newswire * New York Times Newswire Service (NYT) * Associated Press Worldstream Service (APW) Broadcast News * Cable News Network, "Headline News" (CNN for TDT2, ed for Hub-4) * American Broadcasting Co., "World News Tonight" (ABC for TDT2, ea for Hub-4) * Public Radio International, "The World" (PRI) * Voice of America, English news programs (VOA) * MSNBC, "The News With Brian Williams" (MNB) * National Broadcasting Company, "Nightly News" (NBC) Newspaper * Washington Post (WAP) This publication includes both the source data files in .sgm format and the annotation files in ACE Pilot Format (APF), supporting documentation, and version 2.0.1 of the ACE DTD which was used for the September 2002 ACE Evaluation. There are 179,007 words of source data, or 519 files, broken down as follows: Source # Words train # Words devtest # Files train # Files devtest NYT 32,892 7,487 48 9 APW 29,144 7,037 82 20 CNN 2,290 2,653 69 11 ABC 1,588 2,687 24 10 PRI 1,272 5,284 43 9 VOA 594 2,611 24 7 MNB 0 2,539 0 6 NBC 0 2,633 0 8 WAP 60,247 15,070 76 17 ea 2,019 0 31 0 ed 1,094 0 25 0 Total 131,023 47,984 422 97 Samples For an example of the data in this corpus, please view these samples: Source (SGM) Annotation (XML) Updates There are no updates available at this time. "The World" is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
Extent:		Corpus size: 35840 KB
Identifier:		LDC2003T11
		https://catalog.ldc.upenn.edu/LDC2003T11
		ISBN: 1-58563-270-8
		ISLRN: 498-363-793-174-9
		DOI: 10.35111/kcqk-v224
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2003T11
Rights Holder:		Portions © 1998 Los Angeles Times-Washington Post News Service, Inc., © 1998 American Broadcasting Corporation, © 1998 Cable News Network, Inc., © 1998 Press Association, Inc., © 1998 New York Times, © 1998 National Broadcasting Company, Inc., © 1998 Public Radio International, © 2003 Trustees of the University of Pennsylvania "The World" is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2003T11
DateStamp:		2024-09-10
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Mitchell, Alexis; Strassel, Stephanie; Przybocki, Mark; Davis, JK; Doddington, George R.; Grishman, Ralph; Meyers, Adam; Brunstein, Ada; Ferro, Lisa; Sundheim, Beth. 2003. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text