OLAC Record
oai:www.ldc.upenn.edu:LDC2006T04

Metadata
Title:Multiple-Translation Chinese (MTC) Part 4
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Ma, Xiaoyi. Multiple-Translation Chinese (MTC) Part 4 LDC2006T04. Web Download. Philadelphia: Linguistic Data Consortium, 2006
Contributor:Ma, Xiaoyi
Date (W3CDTF):2006
Date Issued (W3CDTF):2006-01-15
Description:*Introduction* Multiple-Translation Chinese (MTc) Part 4, Linguistic Data Consortium (LDC) catalog number LDC2006T04 and ISBN 1-58563-375-5, was developed by LDC. To support the development of automatic means for evaluating translation quality, LDC was sponsored to solicit four sets of human translations for a single set of Chinese source materials. LDC was also asked to produce translations from various commercial-off-the-shelf-systems (COTS, including commercial Machine Translation (MT) systems as well as MT systems available on the Internet). There are a total of five sets of COTS outputs and six output sets from TIDES 2003 MT Evaluation participants. To determine if automatic evaluation systems, such as BLEU, track human assessment, LDC also performed human assessments on one COTS output and the six TIDES research systems. The corpus includes the assessment results for one of the five COTS systems, the assessment results for the six TIDES research systems, and the specifications used for conducting the assessments. *Data* Source Data Selection Two sources of journalistic Chinese text were selected to provide the Chinese material: - Xinhua News Agency (Xinhua): 50 news stories - Agence France Presse (AFP): 50 news stories (total: 100 stories) There are 100 source files and 1,100 translation files. All source data were drawn from LDC's January and February 2003 collection of Xinhua Chinese data and AFP Chinese data. The story selection from the two newswire collections was controlled by story length: all selected stories contain between 280 and 605 Chinese characters. The overall count of Chinese words (excluding markup), by source, is shown in the following table: AFP 22,450 Xinhua 19,650 ------------- 42,100 For the Chinese data, there are approximately 21K-words, while for the English translations, there are 396K-words in total and 16K unique words. Source Data Preparation for Human Translation The original source files used GB-2312 encoding for the Chinese characters, and SGML tags for marking sentence and paragraph boundaries and other information about each story. The character encoding is unaltered. To facilitate translation, nearly all sgml tags were removed or replaced by "plain text" markers. Specifically, each story was presented to the human translators in the following format: --Segment 1-- {Chinese text to be translated} --Segment 2-- {Chinese text to be translated} --Segment 3-- {Chinese text to be translated} ... Each --Segment-- corresponds to a Chinese sentence. The rationale for using the term "segment" instead of "sentence" was to discourage the translators from inserting additional "-Sentence-" markers if a Chinese sentence was translated into two or more English sentences. The markers were intended to assure that the resulting translations would be easily alignable to the source texts, so extra care was taken to ensure that they would be kept intact and properly oriented. Some normalization was performed on all files to conform to the above format, including splitting long segments into smaller chunks and adding segment markers. As a last step, all files were converted from UNIX-style line termination (new-line only) to MS-DOS-style (carriage-return plus line-feed) on the assumption that most (possibly all) translators would use MS-Windows-based editors. Human Translation Procedure and Quality Assessment Each initially selected translation team received the translation guidelines and a sample pair of source and translation (excluded from the final release) for review. After the team indicated that they understood the task requirements and would be willing to participate in the project, 100 news stories were sent to them. Each translation team returned the first five AFP stories for quality checking to ensure that the team was following the guidelines and that the translation quality was acceptable. LDC returned translations to the translation team for any deviations from the guidelines or for quality issues detected. Subsequent translation submissions were continuously monitored for conformance and quality. Once the full set of translations was complete, a final pass of reformatting and validation was carried out to assure alignability of segments and to convert the translated texts into SGML format. Each translation team was also asked to complete and return a questionnaire to describe their procedures and professional background. Machine Translation Procedure Complete sets of automatic MT translations were also produced by submitting the 100 stories to each of the five publicly-available MT systems. Starting from the original SGML text format, special alterations were made to the files on an as-needed basis, so that they would be accepted and handled correctly by the various systems. Also, the systems differed in terms of the input and retrieval methods required to submit the source data for translation and to save the translated text in alignable form. Human Assessment Procedure The goal of this effort was to evaluate the quality of TIDES research, human translation teams and commercial off-the shelf (COTS) systems. Translations were evaluated on the basis of adequacy and fluency. Adequacy refers to the degree to which the translation communicates information present in the original source language text. Fluency refers to the degree to which the translation is well-formed according to the grammar of the target language. Final Data Format and Validation For the present release, the corpus content is organized into source and translation directories. Within translation there is a separate subdirectory for each translation service or system, identified as follows: Human translators: E01 E02 E03 E04 COTS systems: E05 E06 E07 E08 E09 Research systems: E11 E12 E14 E15 E17 E22 The source directory and each of the human and COTS translation subdirectories contain 100 files with one news story per file. Corresponding file names are identical across all directories, consisting of "docid.sgm." Within each source file, the content is formatted in SGML as follows: [Chinese text in GB-2312 character encoding] [Chinese text in GB-2312 character encoding] ... Ranking of Manual Translations Ranking of manual translations was performed by two LDC staff members, one a Chinese-dominant bilingual and the other an English native monolingual. There was overall agreement on the ranking between the two and minor discrepancies were resolved through discussion and comparison of additional files. The ranking for the manual translations is: best-----------------------------worst E01 > E02 > E03 > E04 > The ranking method was unstructured and somewhat casual -- it is not intended to be definitive, or even accountable. *Samples* For an example of the data provided in this corpus, please review the following samples: * newswire source * newswire translation
Extent:Corpus size: 6041 KB
Identifier:LDC2006T04
https://catalog.ldc.upenn.edu/LDC2006T04
ISBN: 1-58563-375-5
ISLRN: 018-899-448-641-7
Language:English
Mandarin Chinese
Language (ISO639):eng
cmn
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/LDC%20User%20Agreement%20for%20Non-Members.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2006T04
Rights Holder:Portions © 2003 Xinhua News Agency, © 2003 Agence France Press, © 2005-2006 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2006T04
DateStamp:  2019-12-12
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Ma, Xiaoyi. 2006. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB dcmi_Text iso639_cmn iso639_eng olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2006T04
Up-to-date as of: Sat Jan 18 13:56:34 EST 2020