OLAC Record oai:www.ldc.upenn.edu:LDC2006T04 |
Metadata | ||
Title: | Multiple-Translation Chinese (MTC) Part 4 | |
Access Rights: | Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining | |
Bibliographic Citation: | Ma, Xiaoyi. Multiple-Translation Chinese (MTC) Part 4 LDC2006T04. Web Download. Philadelphia: Linguistic Data Consortium, 2006 | |
Contributor: | Ma, Xiaoyi | |
Date (W3CDTF): | 2006 | |
Date Issued (W3CDTF): | 2006-01-15 | |
Description: | *Introduction* Multiple-Translation Chinese (MTC) Part 4 was developed by the Linguistic Data Consortium (LDC) and contains 100 Chinese newswire source files and their translations by four human translator teams and 11 Machine Translation (MT) systems, totalling 1,500 translation files, and also assessments for more than 11,000 segments of the MT output. Of the MT systems, five were commercial-off-the-shelf systems (COTS) and six were participants in the TIDES 2003 MT Evaluation. Of the COTS systems, two were free web-based services and three were commercial software. For this corpus, LDC assessed the output from all the TIDES participants' MT systems and one of the COTS systems. To determine if automatic evaluation systems, such as BLEU, track human assessment, LDC also performed human assessments on one COTS output and the six TIDES research systems. The corpus includes the assessment results for one of the five COTS systems, the assessment results for the six TIDES research systems, and the specifications used for conducting the assessments. *Data* The table below has a breakdown of the text files by source: Source Stories Words Xinhua News Agency 50 19,650 Agence France Presse 50 22,450 Total 100 42,100 For the Chinese data, there are approximately 21 K-words, while the English translations total 396 K-words and 16K unique words. The original source files used GB-2312 encoding for the Chinese characters, and SGML tags for marking sentence and paragraph boundaries and other information about each story. The character encoding is unaltered. To facilitate translation, nearly all SGML tags were removed or replaced by "plain text" markers. The markers were intended to assure that the resulting translations would be easily alignable to the source texts, so extra care was taken to ensure that they would be kept intact and properly oriented. Some normalization was performed on all files to conform to this format, including splitting long segments into smaller chunks and adding segment markers. As a last step, all files were converted from UNIX-style line termination (new-line only) to MS-DOS-style (carriage-return plus line-feed) on the assumption that most (possibly all) translators would use MS-Windows-based editors. Human Translation: The human translation teams were required to submit an initial set of five stories for quality evaluation, and after the initial feedback continued with the rest of the assigned stories. For the rest of the stories, their translations were continuously monitored for adherence to guidelines and quality assurance. Machine Translation: Starting from the original SGML text format, special alterations were made to the files on an as-needed basis, so that they would be accepted and handled correctly by the various systems. Also, the systems differed in terms of the input and retrieval methods required to submit the source data for translation and to save the translated text in alignable form. Human Assessment: The goal of this effort was to evaluate the quality of TIDES research, human translation teams, and COTS systems. Translations were evaluated on the basis of adequacy and fluency. Adequacy refers to the degree to which the translation communicates information present in the original source language text. Fluency refers to the degree to which the translation is well-formed according to the grammar of the target language. *Samples* For an example of the data provided in this corpus, please review the following samples: * Chinese source (TXT) * English translation (TXT) *Updates* None at this time. | |
Extent: | Corpus size: 6041 KB | |
Identifier: | LDC2006T04 | |
https://catalog.ldc.upenn.edu/LDC2006T04 | ||
ISBN: 1-58563-375-5 | ||
ISLRN: 018-899-448-641-7 | ||
DOI: 10.35111/17a2-xh75 | ||
Language: | English | |
Mandarin Chinese | ||
Language (ISO639): | eng | |
cmn | ||
License: | LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf | |
Medium: | Distribution: Web Download | |
Publisher: | Linguistic Data Consortium | |
Publisher (URI): | https://www.ldc.upenn.edu | |
Relation (URI): | https://catalog.ldc.upenn.edu/docs/LDC2006T04 | |
Rights Holder: | Portions © 2003 Xinhua News Agency, © 2003 Agence France Press, © 2005-2006 Trustees of the University of Pennsylvania | |
Type (DCMI): | Text | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | The LDC Corpus Catalog | |
Description: | http://www.language-archives.org/archive/www.ldc.upenn.edu | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:www.ldc.upenn.edu:LDC2006T04 | |
DateStamp: | 2021-08-13 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Ma, Xiaoyi. 2006. Linguistic Data Consortium. | |
Terms: | area_Asia area_Europe country_CN country_GB dcmi_Text iso639_cmn iso639_eng olac_primary_text |