OLAC Record: NIST Open MT 2008 Evaluation (MT08) Selected References and System Translations

OLAC Record
oai:www.ldc.upenn.edu:LDC2010T01

Metadata

Title: NIST Open MT 2008 Evaluation (MT08) Selected References and System Translations

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: NIST Multimodal Information Group. NIST Open MT 2008 Evaluation (MT08) Selected References and System Translations LDC2010T01. Web Download. Philadelphia: Linguistic Data Consortium, 2010

Contributor: NIST Multimodal Information Group

Date (W3CDTF): 2010

Date Issued (W3CDTF): 2010-01-20

Description: *Introduction* This file contains documentation for NIST Open Machine Translation 2008 Evaluation (MT08) Selected Reference and System Translations, Linguistic Data Consortium (LDC) catalog number LDC2010T01 and isbn 1-58563-533-2. NIST Open MT is an evaluation series to support research in, and help advance the state of the art of, technologies that translate text between human languages. Participants submit machine translation output of source language data to NIST (National Institute of Standards and Technology); the output is then evaluated with automatic and manual measures of quality against high quality human translations of the same source data. This program supports the growing interest in system combination approaches that generate improved translations from output of several different machine translation (MT) systems. MT system combination approaches require data sets composed of high-quality human reference translations and a variety of machine translations of the same text. The NIST Open Machine Translation 2008 Evaluation (MT08) Selected Reference and System Translations set addresses this need. The data in this release consists of the human reference translations and corresponding machine translations for the NIST Open MT08 test sets, which consist of newswire and web data in the four MT08 language pairs -- Arabic-to-English, Chinese-to-English, English-to-Chinese (newswire only) and Urdu-to-English. Two documents per language pair and genre were removed at random from the test sets for release. For the machine translations, only output from one submission (in most cases, the participant's primary submission) per training condition (Constrained and Unconstrained training, where available) per participant is included. See section 2 of the MT08 Evaluation Plan for a description of the training conditions. The resulting data set has the following characteristics: * Arabic-to-English: 120 documents with 1312 segments, output from 17 machine translation systems. * Chinese-to-English: 105 documents with 1312 segments, output from 23 machine translation systems. * English-to-Chinese: 127 documents with 1830 segments, output from 11 machine translation systems. * Urdu-to-English: 128 documents with 1794 segments, output from 12 machine translation systems. The data is organized and annotated in such a way that subsets for each language pair and/or data genre and/or training condition can be extracted and used separately, depending on the user's needs. *Samples* * Arabic to English output, reference. * Arabic to English output, system

Extent: Corpus size: 19456 KB

Identifier: LDC2010T01

https://catalog.ldc.upenn.edu/LDC2010T01

ISBN: 1-58563-533-2

ISLRN: 000-078-785-720-8

DOI: 10.35111/tzpc-v131

Language: Urdu

Mandarin Chinese

Standard Arabic

English

Chinese

Arabic

Language (ISO639): urd

cmn

arb

eng

zho

ara

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2010T01

Rights Holder: Portions © 2007 Agence France Presse, Al-Ahram, Al Hayat, An Nahar, Al Quds - Al Arabi, Asharq Al-Awsat, Assabah, BBC, The Associated Press, China Military Online, Chinanews.com, Daily Jang, Guangming Daily, Los Angeles Times - Washington Post News Service, Inc., New York Times, PakTribune.com, People's Daily Online, Xinhua News Agency, © 2007, 2009, 2010 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2010T01

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: NIST Multimodal Information Group. 2010. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB country_PK country_SA dcmi_Text iso639_ara iso639_arb iso639_cmn iso639_eng iso639_urd iso639_zho olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2010T01
Up-to-date as of: Wed Oct 29 7:01:10 EDT 2025

Metadata
Title:		NIST Open MT 2008 Evaluation (MT08) Selected References and System Translations
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		NIST Multimodal Information Group. NIST Open MT 2008 Evaluation (MT08) Selected References and System Translations LDC2010T01. Web Download. Philadelphia: Linguistic Data Consortium, 2010
Contributor:		NIST Multimodal Information Group
Date (W3CDTF):		2010
Date Issued (W3CDTF):		2010-01-20
Description:		Introduction This file contains documentation for NIST Open Machine Translation 2008 Evaluation (MT08) Selected Reference and System Translations, Linguistic Data Consortium (LDC) catalog number LDC2010T01 and isbn 1-58563-533-2. NIST Open MT is an evaluation series to support research in, and help advance the state of the art of, technologies that translate text between human languages. Participants submit machine translation output of source language data to NIST (National Institute of Standards and Technology); the output is then evaluated with automatic and manual measures of quality against high quality human translations of the same source data. This program supports the growing interest in system combination approaches that generate improved translations from output of several different machine translation (MT) systems. MT system combination approaches require data sets composed of high-quality human reference translations and a variety of machine translations of the same text. The NIST Open Machine Translation 2008 Evaluation (MT08) Selected Reference and System Translations set addresses this need. The data in this release consists of the human reference translations and corresponding machine translations for the NIST Open MT08 test sets, which consist of newswire and web data in the four MT08 language pairs -- Arabic-to-English, Chinese-to-English, English-to-Chinese (newswire only) and Urdu-to-English. Two documents per language pair and genre were removed at random from the test sets for release. For the machine translations, only output from one submission (in most cases, the participant's primary submission) per training condition (Constrained and Unconstrained training, where available) per participant is included. See section 2 of the MT08 Evaluation Plan for a description of the training conditions. The resulting data set has the following characteristics: * Arabic-to-English: 120 documents with 1312 segments, output from 17 machine translation systems. * Chinese-to-English: 105 documents with 1312 segments, output from 23 machine translation systems. * English-to-Chinese: 127 documents with 1830 segments, output from 11 machine translation systems. * Urdu-to-English: 128 documents with 1794 segments, output from 12 machine translation systems. The data is organized and annotated in such a way that subsets for each language pair and/or data genre and/or training condition can be extracted and used separately, depending on the user's needs. Samples * Arabic to English output, reference. * Arabic to English output, system
Extent:		Corpus size: 19456 KB
Identifier:		LDC2010T01
		https://catalog.ldc.upenn.edu/LDC2010T01
		ISBN: 1-58563-533-2
		ISLRN: 000-078-785-720-8
		DOI: 10.35111/tzpc-v131
Language:		Urdu
		Mandarin Chinese
		Standard Arabic
		English
		Chinese
		Arabic
Language (ISO639):		urd
		cmn
		arb
		eng
		zho
		ara
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2010T01
Rights Holder:		Portions © 2007 Agence France Presse, Al-Ahram, Al Hayat, An Nahar, Al Quds - Al Arabi, Asharq Al-Awsat, Assabah, BBC, The Associated Press, China Military Online, Chinanews.com, Daily Jang, Guangming Daily, Los Angeles Times - Washington Post News Service, Inc., New York Times, PakTribune.com, People's Daily Online, Xinhua News Agency, © 2007, 2009, 2010 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2010T01
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		NIST Multimodal Information Group. 2010. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_GB country_PK country_SA dcmi_Text iso639_ara iso639_arb iso639_cmn iso639_eng iso639_urd iso639_zho olac_primary_text