OLAC Record: NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source

OLAC Record
oai:www.ldc.upenn.edu:LDC2014T02

Metadata

Title: NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: NIST Multimodal Information Group. NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source LDC2014T02. Web Download. Philadelphia: Linguistic Data Consortium, 2014

Contributor: NIST Multimodal Information Group

Date (W3CDTF): 2014

Date Issued (W3CDTF): 2014-02-17

Description: *Introduction* NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source was developed by NIST Multimodal Information Group. This release contains the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plan for the OpenMT 2012 test for Arabic, Chinese, Dari, Farsi, and Korean to English on a parallel data set. The set is based on a subset of the Arabic-to-English and Chinese-to-English progress tests from the OpenMT 2008, 2009 and 2012 evaluations with new source data created by humans based on the English reference translation. The package was compiled, and scoring software was developed, at NIST, making use of newswire and web data and reference translations developed by the Linguistic Data Consortium (LDC) and the Defense Language Institute Foreign Language Center. The objective of the OpenMT evaluation series is to support research in, and help advance the state of the art of, machine translation (MT) technologies -- technologies that translate text between human languages. Input may include all forms of text. The goal is for the output to be an adequate and fluent translation of the original. The MT evaluation series started in 2001 as part of the DARPA TIDES (Translingual Information Detection, Extraction) program. Beginning with the 2006 evaluation, the evaluations have been driven and coordinated by NIST as NIST OpenMT. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities in MT. The Open MT evaluations are intended to be of interest to all researchers working on the general problem of automatic translation between human languages. To this end, they are designed to be simple, to focus on core technology issues and to be fully supported. The 2012 task included the evaluation of five language pairs: Arabic-to-English, Chinese-to-English, Dari-to-English, Farsi-to-English and Korean-to-English in two source data styles. For general information about the NIST OpenMT evaluations, refer to the NIST OpenMT website. This evaluation kit includes a single Perl script (mteval-v13a.pl) that may be used to produce a translation quality score for one (or more) MT systems. The script works by comparing the system output translation with a set of (expert) reference translations of the same source text. Comparison is based on finding sequences of words in the reference translations that match word sequences in the system output translation. LDC has also released the following related corpora: NIST 2012 Open Machine Translation (OpenMT) Evaluation (LDC2013T03) (material from the Chinese-to-English pair track including restricted domain data) and NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets (LDC2013T07) (Arabic, Chinese and English test data). *Data* This release consists of 20 files, four for each of the five languages, presented in XML with an included DTD. The four files are source and reference data from the same source data in the following two styles: * English-true: an English-oriented translation this requires that the text read well and not use any idiomatic expressions in the foreign language to convey meaning, unless absolutely necessary. * Foreign-true: a translation as close as possible to the foreign language, as if the text had originated in that language. *Samples* Please view these samples for Arabic * Reference Foreign * Reference English * Source Foreign * Source English *Updates* None at this time.

Extent: Corpus size: 17408 KB

Identifier: LDC2014T02

https://catalog.ldc.upenn.edu/LDC2014T02

ISBN: 1-58563-668-1

ISLRN: 847-333-922-514-6

DOI: 10.35111/ewaw-bc22

Language: Dari

Korean

Persian

English

Mandarin Chinese

Arabic

Iranian Persian

Chinese

Language (ISO639): prs

kor

fas

eng

cmn

ara

pes

zho

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2014T02

Rights Holder: Portions © 2007 Asharq Al-Awsat, © 2007, 2013, 2014 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2014T02

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: NIST Multimodal Information Group. 2014. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_AF country_CN country_GB country_IR country_KR dcmi_Text iso639_ara iso639_cmn iso639_eng iso639_fas iso639_kor iso639_pes iso639_prs iso639_zho olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2014T02
Up-to-date as of: Wed Oct 29 7:01:25 EDT 2025

Metadata
Title:		NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		NIST Multimodal Information Group. NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source LDC2014T02. Web Download. Philadelphia: Linguistic Data Consortium, 2014
Contributor:		NIST Multimodal Information Group
Date (W3CDTF):		2014
Date Issued (W3CDTF):		2014-02-17
Description:		Introduction NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source was developed by NIST Multimodal Information Group. This release contains the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plan for the OpenMT 2012 test for Arabic, Chinese, Dari, Farsi, and Korean to English on a parallel data set. The set is based on a subset of the Arabic-to-English and Chinese-to-English progress tests from the OpenMT 2008, 2009 and 2012 evaluations with new source data created by humans based on the English reference translation. The package was compiled, and scoring software was developed, at NIST, making use of newswire and web data and reference translations developed by the Linguistic Data Consortium (LDC) and the Defense Language Institute Foreign Language Center. The objective of the OpenMT evaluation series is to support research in, and help advance the state of the art of, machine translation (MT) technologies -- technologies that translate text between human languages. Input may include all forms of text. The goal is for the output to be an adequate and fluent translation of the original. The MT evaluation series started in 2001 as part of the DARPA TIDES (Translingual Information Detection, Extraction) program. Beginning with the 2006 evaluation, the evaluations have been driven and coordinated by NIST as NIST OpenMT. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities in MT. The Open MT evaluations are intended to be of interest to all researchers working on the general problem of automatic translation between human languages. To this end, they are designed to be simple, to focus on core technology issues and to be fully supported. The 2012 task included the evaluation of five language pairs: Arabic-to-English, Chinese-to-English, Dari-to-English, Farsi-to-English and Korean-to-English in two source data styles. For general information about the NIST OpenMT evaluations, refer to the NIST OpenMT website. This evaluation kit includes a single Perl script (mteval-v13a.pl) that may be used to produce a translation quality score for one (or more) MT systems. The script works by comparing the system output translation with a set of (expert) reference translations of the same source text. Comparison is based on finding sequences of words in the reference translations that match word sequences in the system output translation. LDC has also released the following related corpora: NIST 2012 Open Machine Translation (OpenMT) Evaluation (LDC2013T03) (material from the Chinese-to-English pair track including restricted domain data) and NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets (LDC2013T07) (Arabic, Chinese and English test data). Data This release consists of 20 files, four for each of the five languages, presented in XML with an included DTD. The four files are source and reference data from the same source data in the following two styles: * English-true: an English-oriented translation this requires that the text read well and not use any idiomatic expressions in the foreign language to convey meaning, unless absolutely necessary. * Foreign-true: a translation as close as possible to the foreign language, as if the text had originated in that language. Samples Please view these samples for Arabic * Reference Foreign * Reference English * Source Foreign * Source English Updates None at this time.
Extent:		Corpus size: 17408 KB
Identifier:		LDC2014T02
		https://catalog.ldc.upenn.edu/LDC2014T02
		ISBN: 1-58563-668-1
		ISLRN: 847-333-922-514-6
		DOI: 10.35111/ewaw-bc22
Language:		Dari
		Korean
		Persian
		English
		Mandarin Chinese
		Arabic
		Iranian Persian
		Chinese
Language (ISO639):		prs
		kor
		fas
		eng
		cmn
		ara
		pes
		zho
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2014T02
Rights Holder:		Portions © 2007 Asharq Al-Awsat, © 2007, 2013, 2014 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2014T02
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		NIST Multimodal Information Group. 2014. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_AF country_CN country_GB country_IR country_KR dcmi_Text iso639_ara iso639_cmn iso639_eng iso639_fas iso639_kor iso639_pes iso639_prs iso639_zho olac_primary_text