OLAC Record

Title:NIST 2012 Open Machine Translation (OpenMT) Evaluation
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:NIST Multimodal Information Group. NIST 2012 Open Machine Translation (OpenMT) Evaluation LDC2013T03. Web Download. Philadelphia: Linguistic Data Consortium, 2013
Contributor:NIST Multimodal Information Group
Date (W3CDTF):2013
Date Issued (W3CDTF):2013-02-15
Description:*Introduction* NIST 2012 Open Machine Translation (OpenMT) Evaluation was developed by NIST Multimodal Information Group. This release contains source data, reference translations and scoring software used in the NIST 2012 OpenMT evaluation, specifically, for the Chinese-to-English language pair track. The package was compiled and scoring software was developed at NIST, making use of Chinese newswire and web data and reference translations collected and developed by LDC. The objective of the OpenMT evaluation series is to support research in, and help advance the state of the art of, machine translation (MT) technologies -- technologies that translate text between human languages. Input may include all forms of text. The goal is for the output to be an adequate and fluent translation of the original. The MT evaluation series started in 2001 as part of the DARPA TIDES (Translingual Information Detection, Extraction) program. Beginning with the 2006 evaluation, the evaluations have been driven and coordinated by NIST as NIST OpenMT. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities in MT. The Open MT evaluations are intended to be of interest to all researchers working on the general problem of automatic translation between human languages. To this end, they are designed to be simple, to focus on core technology issues and to be fully supported. The 2012 task was to evaluate five language pairs: Arabic-to-English, Chinese-to-English, Dari-to-English, Farsi-to-English and Korean-to-English. This release consists of the material used in the Chinese-to-English language pair track. For more general information about the NIST OpenMT evaluations, please refer to the NIST OpenMT website. This evaluation kit includes a single Perl script (mteval-v13a.pl) that may be used to produce a translation quality score for one (or more) MT systems. The script works by comparing the system output translation with a set of (expert) reference translations of the same source text. Comparison is based on finding sequences of words in the reference translations that match word sequences in the system output translation. *Data* This release contains 222 documents with corresponding source and reference files, the latter of which contains four independent human reference translations of the source data. The source data is comprised of Chinese newswire and web data collected by LDC in 2011. A portion of the web data concerned the topic of food and was treated as a restricted domain. The table below displays statistics by source, genre, documents, segments and source tokens. Source Genre Documents Segments Source Tokens Chinese General Newswire 45 400 18184 Chinese General Web Data 28 420 15181 Chinese Restricted Domain Web Data 149 2184 48422 The token counts for Chinese data are character counts, which were obtained by counting tokens matching the UNICODE-based regular expression w. The Python re module was used to obtain those counts. The data in this package are in XML format compliant with the included DTD *Samples* Please view these Chinese and English samples. *Updates* None at this time.
Extent:Corpus size: 3012 KB
ISBN: 1-58563-635-5
ISLRN: 896-999-017-833-1
DOI: 10.35111/ekv5-3297
Mandarin Chinese
Language (ISO639):eng
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2013T03
Rights Holder:Portions © 2011 Agence France Presse, Chinanews.com, Xinhua News Agency, © 2011, 2013 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text


Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2013T03
DateStamp:  2020-11-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: NIST Multimodal Information Group. 2013. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB dcmi_Text iso639_cmn iso639_eng iso639_zho olac_primary_text

Up-to-date as of: Sun Jun 16 7:34:25 EDT 2024