OLAC Record: GALE Phase 1 Distillation Training

OLAC Record
oai:www.ldc.upenn.edu:LDC2007T20

Metadata

Title: GALE Phase 1 Distillation Training

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Babko-Malaya, Olga, et al. GALE Phase 1 Distillation Training LDC2007T20. Web Download. Philadelphia: Linguistic Data Consortium, 2007

Contributor: Babko-Malaya, Olga

Chen, Song

Zakhary, Ramez

Medero, Julie

Maeda, Kazuaki

Strassel, Stephanie

Date (W3CDTF): 2007

Date Issued (W3CDTF): 2007-11-20

Description: *Introduction* GALE Phase 1 Distillation Training, Linguistic Data Consortium (LDC) catalog number LDC2007T20 and isbn 1-58563-452-2, constitutes the final release of training data created by LDC for the DARPA GALE Program Phase 1 Distillation technology evaluation. Distillation is one of three primary technology components for the DARPA GALE Program, along with Transcription and Translation. Distillation engines respond to queries from English-speaking users, delivering pertinent, consolidated information in easy-to-understand forms. The distillation engine processes English and foreign language material, both speech and text, from multiple sources and documents, removing redundancy and presenting an integrated response to the user. This release consists of 248 English, Chinese and/or Arabic queries and their responses, created by LDC annotators. Queries conform to one of ten template types. Query responses may include document and snippet relevance judgments, nuggets, nugs and supernugs. 158 of the 248 queries have been annotated for all features, while the remainder are labeled for only some features. In addition, not all queries have been exhaustively annotated for a given feature, given resource constraints during corpus development. The table below indicates the number of queries that have been labeled for each template in each source language. English Chinese Arabic Template 1 15/28 9/17 12/16 Template 3 16/29 9/29 13/29 Template 4 15/23 7/18 11/18 Template 5 21/39 10/39 20/36 Template 6 15/20 7/19 7/20 Template 8 12/14 6/13 5/14 Template 9 14/23 7/21 10/21 Template 11 11/22 8/15 2/14 Template 15 12/21 8/11 5/11 Template 16 13/24 10/12 8/12 Total 144/243 81/194 93/191 *Annotation* The annotation task involves responding to a series of user queries. For each query, annotators first find relevant documents and identify snippets (strings of contiguous text that answer the query) in the Arabic, Chinese or English source document. Annotators then create a nugget for each fact expressed in the snippet. Semantically equivalent nuggets are grouped into cross-language, cross-document "supernugs". Judges at BAE Systems finally provide relevance weights for each supernug. Queries in this release have been annotated for the following tasks: * searching for relevant documents and providing yes/no judgements * extracting snippets * resolution of pronouns, and certain types of temporal and locative expressions contained in the snippets * creating nuggets, i.e. atomic pieces of information that an annotator considers a valid answer to the query * building nugs, i.e. clusters of semantically-equivalent nuggets for each language * building supernugs, i.e. clusters of semantically-equivalent nugs across languages *Samples* For an example of the data contained in this corpus, please review this sample. *Sponsorship* This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

Extent: Corpus size: 20480 KB

Identifier: LDC2007T20

https://catalog.ldc.upenn.edu/LDC2007T20

ISBN: 1-58563-452-2

ISLRN: 570-571-401-317-7

DOI: 10.35111/m1p6-wp87

Language: English

Standard Arabic

Mandarin Chinese

Language (ISO639): eng

arb

cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2007T20

Rights Holder: Portions © 2003 Agence France Presse, © 2000, 2001 American Broadcasting Company, © 2000, 2001, 2003 The Associated Press, © 2000, 2001 Cable News Network, LP, LLLP, © 2003 Los Angeles Times-Washington Post News Service, Inc., © 2000 National Broadcasting Company, Inc., © 2000, 2001 New York Times, © 2000, 2001 Public Radio International, © 2000 SPH AsiaOne Ltd, © 2003 Ummah Press Service, © 2003 Xinhua News Agency, © 2006, 2007 Trustees of the University of Pennsylvania

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2007T20

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Babko-Malaya, Olga; Chen, Song; Zakhary, Ramez; Medero, Julie; Maeda, Kazuaki; Strassel, Stephanie. 2007. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB country_SA dcmi_Text iso639_arb iso639_cmn iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2007T20
Up-to-date as of: Wed Oct 29 7:00:59 EDT 2025

Metadata
Title:		GALE Phase 1 Distillation Training
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Babko-Malaya, Olga, et al. GALE Phase 1 Distillation Training LDC2007T20. Web Download. Philadelphia: Linguistic Data Consortium, 2007
Contributor:		Babko-Malaya, Olga
		Chen, Song
		Zakhary, Ramez
		Medero, Julie
		Maeda, Kazuaki
		Strassel, Stephanie
Date (W3CDTF):		2007
Date Issued (W3CDTF):		2007-11-20
Description:		Introduction GALE Phase 1 Distillation Training, Linguistic Data Consortium (LDC) catalog number LDC2007T20 and isbn 1-58563-452-2, constitutes the final release of training data created by LDC for the DARPA GALE Program Phase 1 Distillation technology evaluation. Distillation is one of three primary technology components for the DARPA GALE Program, along with Transcription and Translation. Distillation engines respond to queries from English-speaking users, delivering pertinent, consolidated information in easy-to-understand forms. The distillation engine processes English and foreign language material, both speech and text, from multiple sources and documents, removing redundancy and presenting an integrated response to the user. This release consists of 248 English, Chinese and/or Arabic queries and their responses, created by LDC annotators. Queries conform to one of ten template types. Query responses may include document and snippet relevance judgments, nuggets, nugs and supernugs. 158 of the 248 queries have been annotated for all features, while the remainder are labeled for only some features. In addition, not all queries have been exhaustively annotated for a given feature, given resource constraints during corpus development. The table below indicates the number of queries that have been labeled for each template in each source language. English Chinese Arabic Template 1 15/28 9/17 12/16 Template 3 16/29 9/29 13/29 Template 4 15/23 7/18 11/18 Template 5 21/39 10/39 20/36 Template 6 15/20 7/19 7/20 Template 8 12/14 6/13 5/14 Template 9 14/23 7/21 10/21 Template 11 11/22 8/15 2/14 Template 15 12/21 8/11 5/11 Template 16 13/24 10/12 8/12 Total 144/243 81/194 93/191 Annotation The annotation task involves responding to a series of user queries. For each query, annotators first find relevant documents and identify snippets (strings of contiguous text that answer the query) in the Arabic, Chinese or English source document. Annotators then create a nugget for each fact expressed in the snippet. Semantically equivalent nuggets are grouped into cross-language, cross-document "supernugs". Judges at BAE Systems finally provide relevance weights for each supernug. Queries in this release have been annotated for the following tasks: * searching for relevant documents and providing yes/no judgements * extracting snippets * resolution of pronouns, and certain types of temporal and locative expressions contained in the snippets * creating nuggets, i.e. atomic pieces of information that an annotator considers a valid answer to the query * building nugs, i.e. clusters of semantically-equivalent nuggets for each language * building supernugs, i.e. clusters of semantically-equivalent nugs across languages Samples For an example of the data contained in this corpus, please review this sample. Sponsorship This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
Extent:		Corpus size: 20480 KB
Identifier:		LDC2007T20
		https://catalog.ldc.upenn.edu/LDC2007T20
		ISBN: 1-58563-452-2
		ISLRN: 570-571-401-317-7
		DOI: 10.35111/m1p6-wp87
Language:		English
		Standard Arabic
		Mandarin Chinese
Language (ISO639):		eng
		arb
		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2007T20
Rights Holder:		Portions © 2003 Agence France Presse, © 2000, 2001 American Broadcasting Company, © 2000, 2001, 2003 The Associated Press, © 2000, 2001 Cable News Network, LP, LLLP, © 2003 Los Angeles Times-Washington Post News Service, Inc., © 2000 National Broadcasting Company, Inc., © 2000, 2001 New York Times, © 2000, 2001 Public Radio International, © 2000 SPH AsiaOne Ltd, © 2003 Ummah Press Service, © 2003 Xinhua News Agency, © 2006, 2007 Trustees of the University of Pennsylvania The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2007T20
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Babko-Malaya, Olga; Chen, Song; Zakhary, Ramez; Medero, Julie; Maeda, Kazuaki; Strassel, Stephanie. 2007. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_GB country_SA dcmi_Text iso639_arb iso639_cmn iso639_eng olac_primary_text