OLAC Record: Multiple-Translation Chinese (MTC) Part 2

OLAC Record
oai:www.ldc.upenn.edu:LDC2003T17

Metadata

Title: Multiple-Translation Chinese (MTC) Part 2

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Huang, Shudong, et al. Multiple-Translation Chinese (MTC) Part 2 LDC2003T17. Web Download. Philadelphia: Linguistic Data Consortium, 2003

Contributor: Huang, Shudong

Graff, David

Walker, Kevin

Miller, David

Ma, Xiaoyi

Cieri, Christopher

Doddington, George R.

Date (W3CDTF): 2003

Date Issued (W3CDTF): 2003-10-02

Description: *Introduction* Multiple-Translation Chinese (MTC) Part 2 was produced by the Linguistic Data Consortium (LDC) and contains approximately 40,000 characters of Mandarin newswire text, 11 translations of that text (four sets of human translations and seven sets of machine translation), and assessment of three of the machine translation systems. To support the development of automatic means for evaluating translation quality, LDC was sponsored to solicit four sets of human translations for a single set of Mandarin Chinese source materials. LDC was also asked to produce translations from various commercial-off-the-shelf-systems (COTS, including commercial Machine Translation (MT) systems as well as MT systems available on the Internet). There are a total of six sets of COTS outputs, and one set of outputs from a TIDES MT Evaluation participant, which is representative for the state-of-the-art research systems. To see if automatic evaluation systems, such as BLEU, track human assessment, LDC has also performed human assessment on two of the six COTS outputs and the TIDES research system. The corpus includes the assessment results for these two COTS systems, the assessment result for the TIDES research system, and the specifications used for conducting the assessments. A similar corpus, Multiple-Translation Chinese Corpus (LDC2002T01), was published in 2002. Both the 2002 and the present corpus used Chinese news articles from the Xinhua and Zaobao News Service, and provide human and COTS translations. However, Part 2 also offers translations from a TIDES research system, and provides human assessment of some of the automatic translations. *Data* Two sources of journalistic Mandarin Chinese text were selected to provide the Chinese material: Xinhua News Service and Zaobao News Service. The story selection from the two newswire collections was controlled by story length: all selected stories contain between about 212 and 707 Chinese characters. The Xinhua data were drawn from March and April 2002 collection of Xinhua news. The Zaobao data were drawn from March 2002 collection of Zaobao's online news service. Zaobao is a news portal from Singapore and many of its news stories are translations from other news agencies' releases. Here is a breakdown of the source data: Source Stories Characters Xinhua 70 25,247 Zaobao 30 14,009 Total 100 39,256 For the Chinese data, there are approximately 20 K-words, while for the English translation, there are approximately 258 K-words in total, and 13K unique words. The original source files used GB-2312 encoding for the Chinese characters, and SGML tags for marking sentence and paragraph boundaries and other information about each story. The character encoding has been left unaltered. To make things easier for translators, nearly all sgml tags were removed, or replaced by "plain text" markers. Human Translation Procedure and Quality Assessment The four best translation teams were chosen from the 11 which had participated in the translation of Multiple Translation Chinese Corpus Part 1 (LDC2002T01) to take part in the project. In accordance with the guidelines, each translation team was asked to return the first 10 Xinhua stories for quality checking. This was to ensure that the translation team had indeed understood and was following the guidelines and the translation quality was acceptable. LDC sent the translations back to the translation team for any deviations from the guidelines or quality issues detected. Subsequent translation submissions were continuously monitored for conformance and quality. Once the full set of translations was complete, a final pass of reformatting and validation was carried out, to assure alignability of segments, and to convert the translated texts into SGML format. Machine Translation Procedure Complete sets of automatic MT translations were also produced by submitting the 100 stories to each of six publicly-available MT systems. Four of these were commercial MT software packages (off-the-shelf products), and two were free web-based services. Starting from the original SGML text format, special alterations were made to the files on an as-needed basis, so that they would be accepted and handled correctly by the various systems; also, the systems differed in terms of the input and retrieval methods required to submit the source data for translation and to save the translated text in alignable form. Human Assessment Procedure The goal of this effort is to evaluate the quality of TIDES research, human translation teams, and COTS systems. Translations are evaluated on the basis of adequacy and fluency. Adequacy refers to the degree to which the translation communicates information present in the original source language text. Fluency refers to the degree to which the translation is well-formed according to the grammar of the target language. *Samples* For an example of the data in this corpus, please view these source (SGML) and translation (SGML) samples. *Updates* There are no updates available at this time.

Extent: Corpus size: 6451 KB

Identifier: LDC2003T17

https://catalog.ldc.upenn.edu/LDC2003T17

ISBN: 1-58563-275-9

ISLRN: 484-381-943-904-5

DOI: 10.35111/ysj7-3f12

Language: English

Mandarin Chinese

Language (ISO639): eng

cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2003T17

Rights Holder: Portions © 2003, Trustees of the University of Pennsylvania, © 2002 Xinhua News Agency, © 2002 SPH AsiaOne Ltd.

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2003T17

DateStamp: 2024-09-04

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Huang, Shudong; Graff, David; Walker, Kevin; Miller, David; Ma, Xiaoyi; Cieri, Christopher; Doddington, George R. 2003. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB dcmi_Text iso639_cmn iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2003T17
Up-to-date as of: Wed Oct 29 7:00:17 EDT 2025

Metadata
Title:		Multiple-Translation Chinese (MTC) Part 2
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Huang, Shudong, et al. Multiple-Translation Chinese (MTC) Part 2 LDC2003T17. Web Download. Philadelphia: Linguistic Data Consortium, 2003
Contributor:		Huang, Shudong
		Graff, David
		Walker, Kevin
		Miller, David
		Ma, Xiaoyi
		Cieri, Christopher
		Doddington, George R.
Date (W3CDTF):		2003
Date Issued (W3CDTF):		2003-10-02
Description:		Introduction Multiple-Translation Chinese (MTC) Part 2 was produced by the Linguistic Data Consortium (LDC) and contains approximately 40,000 characters of Mandarin newswire text, 11 translations of that text (four sets of human translations and seven sets of machine translation), and assessment of three of the machine translation systems. To support the development of automatic means for evaluating translation quality, LDC was sponsored to solicit four sets of human translations for a single set of Mandarin Chinese source materials. LDC was also asked to produce translations from various commercial-off-the-shelf-systems (COTS, including commercial Machine Translation (MT) systems as well as MT systems available on the Internet). There are a total of six sets of COTS outputs, and one set of outputs from a TIDES MT Evaluation participant, which is representative for the state-of-the-art research systems. To see if automatic evaluation systems, such as BLEU, track human assessment, LDC has also performed human assessment on two of the six COTS outputs and the TIDES research system. The corpus includes the assessment results for these two COTS systems, the assessment result for the TIDES research system, and the specifications used for conducting the assessments. A similar corpus, Multiple-Translation Chinese Corpus (LDC2002T01), was published in 2002. Both the 2002 and the present corpus used Chinese news articles from the Xinhua and Zaobao News Service, and provide human and COTS translations. However, Part 2 also offers translations from a TIDES research system, and provides human assessment of some of the automatic translations. Data Two sources of journalistic Mandarin Chinese text were selected to provide the Chinese material: Xinhua News Service and Zaobao News Service. The story selection from the two newswire collections was controlled by story length: all selected stories contain between about 212 and 707 Chinese characters. The Xinhua data were drawn from March and April 2002 collection of Xinhua news. The Zaobao data were drawn from March 2002 collection of Zaobao's online news service. Zaobao is a news portal from Singapore and many of its news stories are translations from other news agencies' releases. Here is a breakdown of the source data: Source Stories Characters Xinhua 70 25,247 Zaobao 30 14,009 Total 100 39,256 For the Chinese data, there are approximately 20 K-words, while for the English translation, there are approximately 258 K-words in total, and 13K unique words. The original source files used GB-2312 encoding for the Chinese characters, and SGML tags for marking sentence and paragraph boundaries and other information about each story. The character encoding has been left unaltered. To make things easier for translators, nearly all sgml tags were removed, or replaced by "plain text" markers. Human Translation Procedure and Quality Assessment The four best translation teams were chosen from the 11 which had participated in the translation of Multiple Translation Chinese Corpus Part 1 (LDC2002T01) to take part in the project. In accordance with the guidelines, each translation team was asked to return the first 10 Xinhua stories for quality checking. This was to ensure that the translation team had indeed understood and was following the guidelines and the translation quality was acceptable. LDC sent the translations back to the translation team for any deviations from the guidelines or quality issues detected. Subsequent translation submissions were continuously monitored for conformance and quality. Once the full set of translations was complete, a final pass of reformatting and validation was carried out, to assure alignability of segments, and to convert the translated texts into SGML format. Machine Translation Procedure Complete sets of automatic MT translations were also produced by submitting the 100 stories to each of six publicly-available MT systems. Four of these were commercial MT software packages (off-the-shelf products), and two were free web-based services. Starting from the original SGML text format, special alterations were made to the files on an as-needed basis, so that they would be accepted and handled correctly by the various systems; also, the systems differed in terms of the input and retrieval methods required to submit the source data for translation and to save the translated text in alignable form. Human Assessment Procedure The goal of this effort is to evaluate the quality of TIDES research, human translation teams, and COTS systems. Translations are evaluated on the basis of adequacy and fluency. Adequacy refers to the degree to which the translation communicates information present in the original source language text. Fluency refers to the degree to which the translation is well-formed according to the grammar of the target language. Samples For an example of the data in this corpus, please view these source (SGML) and translation (SGML) samples. Updates There are no updates available at this time.
Extent:		Corpus size: 6451 KB
Identifier:		LDC2003T17
		https://catalog.ldc.upenn.edu/LDC2003T17
		ISBN: 1-58563-275-9
		ISLRN: 484-381-943-904-5
		DOI: 10.35111/ysj7-3f12
Language:		English
Language:		Mandarin Chinese
Language (ISO639):		eng
Language (ISO639):		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2003T17
Rights Holder:		Portions © 2003, Trustees of the University of Pennsylvania, © 2002 Xinhua News Agency, © 2002 SPH AsiaOne Ltd.
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2003T17
DateStamp:		2024-09-04
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Huang, Shudong; Graff, David; Walker, Kevin; Miller, David; Ma, Xiaoyi; Cieri, Christopher; Doddington, George R. 2003. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_GB dcmi_Text iso639_cmn iso639_eng olac_primary_text