OLAC Record: NIST 2002 Open Machine Translation (OpenMT) Evaluation

OLAC Record
oai:www.ldc.upenn.edu:LDC2010T10

Metadata

Title: NIST 2002 Open Machine Translation (OpenMT) Evaluation

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: NIST Multimodal Information Group. NIST 2002 Open Machine Translation (OpenMT) Evaluation LDC2010T10. Web Download. Philadelphia: Linguistic Data Consortium, 2010

Contributor: NIST Multimodal Information Group

Date (W3CDTF): 2010

Date Issued (W3CDTF): 2010-05-14

Description: *Introduction* NIST 2002 Open Machine Translation (OpenMT) Evaluation is a package containing source data, reference translations, and scoring software used in the NIST 2002 OpenMT evaluation. It is designed to help evaluate the effectiveness of machine translation systems. The package was compiled and scoring software was developed by researchers at NIST, making use of newswire source data and reference translations collected and developed by LDC. The objective of the NIST OpenMT evaluation series is to support research in, and help advance the state of the art of, machine translation (MT) technologies -- technologies that translate text between human languages. Input may include all forms of text. The goal is for the output to be an adequate and fluent translation of the original. The MT evaluation series started in 2001 as part of the DARPA TIDES (Translingual Information Detection, Extraction) program. Beginning with the 2006 evaluation, the evaluations have been driven and coordinated by NIST as NIST OpenMT. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities in MT. The OpenMT evaluations are intended to be of interest to all researchers working on the general problem of automatic translation between human languages. To this end, they are designed to be simple, to focus on core technology issues, and to be fully supported. The 2002 task was to evaluate translation from Chinese to English and from Arabic to English. Additional information about these evaluations may be found at the NIST Open Machine Translation (OpenMT) Evaluation web site. *Scoring Tools* This evaluation kit includes a single perl script (mteval-v09.pl) that may be used to produce a translation quality score for one (or more) MT systems. The script works by comparing the system output translation with a set of (expert) reference translations of the same source text. Comparison is based on finding sequences of words in the reference translations that match word sequences in the system output translation. More information on the evaluation algorithm may be obtained from the paper detailing the algorithm: BLEU: a Method for Automatic Evaluation of Machine Translation (Papineni et al, 2002). *Data* The Chinese-language source text included in this corpus is a reorganization of data that was initially released to the public as Multiple-Translation Chinese (MTC) Part 2 (LDC2003T17). The Chinese-language reference translations are a reorganized subset of data from the same MTC corpus. The Arabic-language data (source text and reference translations) is a reorganized subset of data that was initially released to the public as Multiple-Translation Arabic (MTA) Part 1 (LDC2003T18). All source data for this corpus is newswire text. Chinese source text was drawn in March and April 2002 from Xinhua News Agency and in March 2002 from Zaobao News Service (sources indicated in docids). Arabic source text was drawn from the Xinhua News Agency's Arabic newswire feed (October 2001, in the docid range: artb_500 - artb_565) and Agence France-Presse (Feb. 1998 - Oct. 1999, in the docid range: artb_001 - artb_069). Arabic Agence France-Presse source text was also released as part of Arabic Newswire Part 1 (LDC2001T55). For details on the methodology of the source data collection and production of reference translations, see the documentation for the above-mentioned corpora. For each language, the test set consists of two files, a source and a reference file. Each reference file contains four independent translations of the data set. The evaluation year, source language, test set (which, by default, is "evalset"), version of the data, and source vs. reference file (with the latter being indicated by "-ref") are reflected in the file name. DARPA TIDES MT and NIST OpenMT evaluations used SGML-formatted test data until 2008 and XML-formatted test data thereafter. The files in this package are provided in both formats. * * *Updates* No updates have been issued at this time.

Extent: Corpus size: 3481 KB

Identifier: LDC2010T10

https://catalog.ldc.upenn.edu/LDC2010T10

ISBN: 1-58563-548-0

ISLRN: 907-893-472-321-4

DOI: 10.35111/63w9-a726

Language: English

Mandarin Chinese

Standard Arabic

Arabic

Language (ISO639): eng

cmn

arb

ara

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2010T10

Rights Holder: Portions © 1998, 1999 Agence France Presse, © 2002 SPH AsiaOne Ltd, © 2001, 2001 Xinhua News Agency, © 2001, 2003, 2010 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2010T10

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: NIST Multimodal Information Group. 2010. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB country_SA dcmi_Text iso639_ara iso639_arb iso639_cmn iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2010T10
Up-to-date as of: Wed Oct 29 7:01:12 EDT 2025

Metadata
Title:		NIST 2002 Open Machine Translation (OpenMT) Evaluation
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		NIST Multimodal Information Group. NIST 2002 Open Machine Translation (OpenMT) Evaluation LDC2010T10. Web Download. Philadelphia: Linguistic Data Consortium, 2010
Contributor:		NIST Multimodal Information Group
Date (W3CDTF):		2010
Date Issued (W3CDTF):		2010-05-14
Description:		Introduction NIST 2002 Open Machine Translation (OpenMT) Evaluation is a package containing source data, reference translations, and scoring software used in the NIST 2002 OpenMT evaluation. It is designed to help evaluate the effectiveness of machine translation systems. The package was compiled and scoring software was developed by researchers at NIST, making use of newswire source data and reference translations collected and developed by LDC. The objective of the NIST OpenMT evaluation series is to support research in, and help advance the state of the art of, machine translation (MT) technologies -- technologies that translate text between human languages. Input may include all forms of text. The goal is for the output to be an adequate and fluent translation of the original. The MT evaluation series started in 2001 as part of the DARPA TIDES (Translingual Information Detection, Extraction) program. Beginning with the 2006 evaluation, the evaluations have been driven and coordinated by NIST as NIST OpenMT. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities in MT. The OpenMT evaluations are intended to be of interest to all researchers working on the general problem of automatic translation between human languages. To this end, they are designed to be simple, to focus on core technology issues, and to be fully supported. The 2002 task was to evaluate translation from Chinese to English and from Arabic to English. Additional information about these evaluations may be found at the NIST Open Machine Translation (OpenMT) Evaluation web site. Scoring Tools This evaluation kit includes a single perl script (mteval-v09.pl) that may be used to produce a translation quality score for one (or more) MT systems. The script works by comparing the system output translation with a set of (expert) reference translations of the same source text. Comparison is based on finding sequences of words in the reference translations that match word sequences in the system output translation. More information on the evaluation algorithm may be obtained from the paper detailing the algorithm: BLEU: a Method for Automatic Evaluation of Machine Translation (Papineni et al, 2002). Data The Chinese-language source text included in this corpus is a reorganization of data that was initially released to the public as Multiple-Translation Chinese (MTC) Part 2 (LDC2003T17). The Chinese-language reference translations are a reorganized subset of data from the same MTC corpus. The Arabic-language data (source text and reference translations) is a reorganized subset of data that was initially released to the public as Multiple-Translation Arabic (MTA) Part 1 (LDC2003T18). All source data for this corpus is newswire text. Chinese source text was drawn in March and April 2002 from Xinhua News Agency and in March 2002 from Zaobao News Service (sources indicated in docids). Arabic source text was drawn from the Xinhua News Agency's Arabic newswire feed (October 2001, in the docid range: artb_500 - artb_565) and Agence France-Presse (Feb. 1998 - Oct. 1999, in the docid range: artb_001 - artb_069). Arabic Agence France-Presse source text was also released as part of Arabic Newswire Part 1 (LDC2001T55). For details on the methodology of the source data collection and production of reference translations, see the documentation for the above-mentioned corpora. For each language, the test set consists of two files, a source and a reference file. Each reference file contains four independent translations of the data set. The evaluation year, source language, test set (which, by default, is "evalset"), version of the data, and source vs. reference file (with the latter being indicated by "-ref") are reflected in the file name. DARPA TIDES MT and NIST OpenMT evaluations used SGML-formatted test data until 2008 and XML-formatted test data thereafter. The files in this package are provided in both formats. * * Updates No updates have been issued at this time.
Extent:		Corpus size: 3481 KB
Identifier:		LDC2010T10
		https://catalog.ldc.upenn.edu/LDC2010T10
		ISBN: 1-58563-548-0
		ISLRN: 907-893-472-321-4
		DOI: 10.35111/63w9-a726
Language:		English
		Mandarin Chinese
		Standard Arabic
		Arabic
Language (ISO639):		eng
		cmn
		arb
		ara
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2010T10
Rights Holder:		Portions © 1998, 1999 Agence France Presse, © 2002 SPH AsiaOne Ltd, © 2001, 2001 Xinhua News Agency, © 2001, 2003, 2010 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2010T10
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		NIST Multimodal Information Group. 2010. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_GB country_SA dcmi_Text iso639_ara iso639_arb iso639_cmn iso639_eng olac_primary_text