OLAC Record: ACE 2004 Multilingual Training Corpus

OLAC Record
oai:www.ldc.upenn.edu:LDC2005T09

Metadata

Title: ACE 2004 Multilingual Training Corpus

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Mitchell, Alexis, et al. ACE 2004 Multilingual Training Corpus LDC2005T09. Web Download. Philadelphia: Linguistic Data Consortium, 2005

Contributor: Mitchell, Alexis

Strassel, Stephanie

Huang, Shudong

Zakhary, Ramez

Date (W3CDTF): 2005

Date Issued (W3CDTF): 2005-03-15

Description: *Introduction* ACE 2004 Multilingual Training Corpus was developed by the Linguistic Data Consortium (LDC) and contains the various genre text in English (158,000 words), Chinese (307,000 characters, 154,000 words), and Arabic (151,000 words) annotated for entities and relations. This corpus represents the complete set of English, Arabic, and Chinese training data for the 2004 Automatic Content Extraction (ACE) technology evaluation created by LDC with support from the ACE Program and additional assistance from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. This data was previously distributed as an e-corpus (LDC2004E17) to participants in the 2004 ACE evaluation. The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. In September 2004, sites were evaluated on system performance in six areas: Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. All tasks were evaluated in three languages: English, Chinese and Arabic. The current publication consists of the official training data for these evaluation tasks. A seventh evaluation area, Timex Detection and Recognition, is supported by ACE Time Normalization (TERN) 2004 English Training Data v 1.0 (LDC2005T07). The TERN corpus source data largely overlaps with the English source data contained in the current release. For more information about linguistic resources for the ACE program, including annotation guidelines, task definitions, free annotation tools and other documentation, please visit LDC's ACE website. *Data* Here is a breakdown of the data amounts by language: English Chinese Arabic Genre Files Words Files Words Characters Files Words Broadcast News 220 60,291 314 67,702 135,405 304 63,238 Newswire 128 59,840 226 60,251 120,502 253 63,122 Chinese Treebank 37 12,337 106 25,749 51,499 Arabic Treebank 58 12,855 132 25,010 Fisher CTS 8 12,630 Totals 451 157,953 646 153,703 307,406 689 151,360 All files are annotated for entities and relations. Annotators tag all mentions of each entity within a document, whether named, nominal or pronominal. For every mention, the annotator identifies the maximal extent of the string that represents the entity, and labels the head of each mention. Annotators also identify relations between entities and their temporal attributes. Relations that are supported by explicit textual evidence are distinguished from those that depend on contextual inference on the part of the reader. The files are stored in four separate formats: * APF (.apf.xml) - The Official ACE Program Format. * ALF (.alf.xml) - The Ace LDC Format is an intermediate format similar to APF designed to store all annotation content represented in the AG files. * AG (-pp.ag.xml) - The LDC Annotation Graph Format (postprocessed). LDC's internal annotation files format for ACE. These files can be viewed with LDC's free ACE annotation tools. * Source (.sgm) - Source text files in with SGML tagging. *Samples* The files listed below are samples from the English data. They should provide a good example of the material in this corpus. * Chinese Treebank (XML) * Fisher Transcripts (XML) * Broadcast News (XML) The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston. *Updates* None at this time.

Extent: Corpus size: 366008 KB

Identifier: LDC2005T09

https://catalog.ldc.upenn.edu/LDC2005T09

ISBN: 1-58563-334-8.

ISLRN: 789-870-824-708-5

DOI: 10.35111/8m4r-v312

Language: English

Standard Arabic

Mandarin Chinese

Language (ISO639): eng

arb

cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2005T09

Rights Holder: Portions (c) 1994-1998, 2000 Xinhua News Agency (c) 1997 Department of Information Services, Hong Kong Special Administrative Region (c) 1996-1998, 2000-2001 Sinorama Magazine (c) 2000 Agence France-Presse, (c) 2000 New York Times, (c) 2000 Associated Press, (c) 2000 SPH AsiaOne, Ltd. (Zaobao), (c) 2000 An-Nahar, (c) 2000 Al-Hayat, (c) 2000 Nile TV, (c) 2000 Cable News Network, All Rights Reserved, (c) 2000 American Broadcasting Corporation, (c) 2000 National Broadcasting Company, Inc., (c) 2000 China National Radio, (c) 2000 China Television System, (c) 2000 China Central TV, (c) 2000 China Broadcasting System, (c) 2000 Public Radio International., (c) 2005 Trustees of the University of Pennsylvania

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2005T09

DateStamp: 2022-10-11

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Mitchell, Alexis; Strassel, Stephanie; Huang, Shudong; Zakhary, Ramez. 2005. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB country_SA dcmi_Text iso639_arb iso639_cmn iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2005T09
Up-to-date as of: Wed Oct 29 7:00:26 EDT 2025

Metadata
Title:		ACE 2004 Multilingual Training Corpus
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Mitchell, Alexis, et al. ACE 2004 Multilingual Training Corpus LDC2005T09. Web Download. Philadelphia: Linguistic Data Consortium, 2005
Contributor:		Mitchell, Alexis
		Strassel, Stephanie
		Huang, Shudong
		Zakhary, Ramez
Date (W3CDTF):		2005
Date Issued (W3CDTF):		2005-03-15
Description:		Introduction ACE 2004 Multilingual Training Corpus was developed by the Linguistic Data Consortium (LDC) and contains the various genre text in English (158,000 words), Chinese (307,000 characters, 154,000 words), and Arabic (151,000 words) annotated for entities and relations. This corpus represents the complete set of English, Arabic, and Chinese training data for the 2004 Automatic Content Extraction (ACE) technology evaluation created by LDC with support from the ACE Program and additional assistance from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. This data was previously distributed as an e-corpus (LDC2004E17) to participants in the 2004 ACE evaluation. The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. In September 2004, sites were evaluated on system performance in six areas: Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. All tasks were evaluated in three languages: English, Chinese and Arabic. The current publication consists of the official training data for these evaluation tasks. A seventh evaluation area, Timex Detection and Recognition, is supported by ACE Time Normalization (TERN) 2004 English Training Data v 1.0 (LDC2005T07). The TERN corpus source data largely overlaps with the English source data contained in the current release. For more information about linguistic resources for the ACE program, including annotation guidelines, task definitions, free annotation tools and other documentation, please visit LDC's ACE website. Data Here is a breakdown of the data amounts by language: English Chinese Arabic Genre Files Words Files Words Characters Files Words Broadcast News 220 60,291 314 67,702 135,405 304 63,238 Newswire 128 59,840 226 60,251 120,502 253 63,122 Chinese Treebank 37 12,337 106 25,749 51,499 Arabic Treebank 58 12,855 132 25,010 Fisher CTS 8 12,630 Totals 451 157,953 646 153,703 307,406 689 151,360 All files are annotated for entities and relations. Annotators tag all mentions of each entity within a document, whether named, nominal or pronominal. For every mention, the annotator identifies the maximal extent of the string that represents the entity, and labels the head of each mention. Annotators also identify relations between entities and their temporal attributes. Relations that are supported by explicit textual evidence are distinguished from those that depend on contextual inference on the part of the reader. The files are stored in four separate formats: * APF (.apf.xml) - The Official ACE Program Format. * ALF (.alf.xml) - The Ace LDC Format is an intermediate format similar to APF designed to store all annotation content represented in the AG files. * AG (-pp.ag.xml) - The LDC Annotation Graph Format (postprocessed). LDC's internal annotation files format for ACE. These files can be viewed with LDC's free ACE annotation tools. * Source (.sgm) - Source text files in with SGML tagging. Samples The files listed below are samples from the English data. They should provide a good example of the material in this corpus. * Chinese Treebank (XML) * Fisher Transcripts (XML) * Broadcast News (XML) The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston. Updates None at this time.
Extent:		Corpus size: 366008 KB
Identifier:		LDC2005T09
		https://catalog.ldc.upenn.edu/LDC2005T09
		ISBN: 1-58563-334-8.
		ISLRN: 789-870-824-708-5
		DOI: 10.35111/8m4r-v312
Language:		English
		Standard Arabic
		Mandarin Chinese
Language (ISO639):		eng
		arb
		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2005T09
Rights Holder:		Portions (c) 1994-1998, 2000 Xinhua News Agency (c) 1997 Department of Information Services, Hong Kong Special Administrative Region (c) 1996-1998, 2000-2001 Sinorama Magazine (c) 2000 Agence France-Presse, (c) 2000 New York Times, (c) 2000 Associated Press, (c) 2000 SPH AsiaOne, Ltd. (Zaobao), (c) 2000 An-Nahar, (c) 2000 Al-Hayat, (c) 2000 Nile TV, (c) 2000 Cable News Network, All Rights Reserved, (c) 2000 American Broadcasting Corporation, (c) 2000 National Broadcasting Company, Inc., (c) 2000 China National Radio, (c) 2000 China Television System, (c) 2000 China Central TV, (c) 2000 China Broadcasting System, (c) 2000 Public Radio International., (c) 2005 Trustees of the University of Pennsylvania The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2005T09
DateStamp:		2022-10-11
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Mitchell, Alexis; Strassel, Stephanie; Huang, Shudong; Zakhary, Ramez. 2005. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_GB country_SA dcmi_Text iso639_arb iso639_cmn iso639_eng olac_primary_text