OLAC Record: Multiple-Translation Chinese (MTC) Part 4

OLAC Record
oai:www.ldc.upenn.edu:LDC2006T04

Metadata

Title: Multiple-Translation Chinese (MTC) Part 4

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Ma, Xiaoyi. Multiple-Translation Chinese (MTC) Part 4 LDC2006T04. Web Download. Philadelphia: Linguistic Data Consortium, 2006

Contributor: Ma, Xiaoyi

Date (W3CDTF): 2006

Date Issued (W3CDTF): 2006-01-15

Description: *Introduction* Multiple-Translation Chinese (MTC) Part 4 was developed by the Linguistic Data Consortium (LDC) and contains 100 Chinese newswire source files and their translations by four human translator teams and 11 Machine Translation (MT) systems, totalling 1,500 translation files, and also assessments for more than 11,000 segments of the MT output. Of the MT systems, five were commercial-off-the-shelf systems (COTS) and six were participants in the TIDES 2003 MT Evaluation. Of the COTS systems, two were free web-based services and three were commercial software. For this corpus, LDC assessed the output from all the TIDES participants' MT systems and one of the COTS systems. To determine if automatic evaluation systems, such as BLEU, track human assessment, LDC also performed human assessments on one COTS output and the six TIDES research systems. The corpus includes the assessment results for one of the five COTS systems, the assessment results for the six TIDES research systems, and the specifications used for conducting the assessments. *Data* The table below has a breakdown of the text files by source: Source Stories Words Xinhua News Agency 50 19,650 Agence France Presse 50 22,450 Total 100 42,100 For the Chinese data, there are approximately 21 K-words, while the English translations total 396 K-words and 16K unique words. The original source files used GB-2312 encoding for the Chinese characters, and SGML tags for marking sentence and paragraph boundaries and other information about each story. The character encoding is unaltered. To facilitate translation, nearly all SGML tags were removed or replaced by "plain text" markers. The markers were intended to assure that the resulting translations would be easily alignable to the source texts, so extra care was taken to ensure that they would be kept intact and properly oriented. Some normalization was performed on all files to conform to this format, including splitting long segments into smaller chunks and adding segment markers. As a last step, all files were converted from UNIX-style line termination (new-line only) to MS-DOS-style (carriage-return plus line-feed) on the assumption that most (possibly all) translators would use MS-Windows-based editors. Human Translation: The human translation teams were required to submit an initial set of five stories for quality evaluation, and after the initial feedback continued with the rest of the assigned stories. For the rest of the stories, their translations were continuously monitored for adherence to guidelines and quality assurance. Machine Translation: Starting from the original SGML text format, special alterations were made to the files on an as-needed basis, so that they would be accepted and handled correctly by the various systems. Also, the systems differed in terms of the input and retrieval methods required to submit the source data for translation and to save the translated text in alignable form. Human Assessment: The goal of this effort was to evaluate the quality of TIDES research, human translation teams, and COTS systems. Translations were evaluated on the basis of adequacy and fluency. Adequacy refers to the degree to which the translation communicates information present in the original source language text. Fluency refers to the degree to which the translation is well-formed according to the grammar of the target language. *Samples* For an example of the data provided in this corpus, please review the following samples: * Chinese source (TXT) * English translation (TXT) *Updates* None at this time.

Extent: Corpus size: 6041 KB

Identifier: LDC2006T04

https://catalog.ldc.upenn.edu/LDC2006T04

ISBN: 1-58563-375-5

ISLRN: 018-899-448-641-7

DOI: 10.35111/17a2-xh75

Language: English

Mandarin Chinese

Language (ISO639): eng

cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2006T04

Rights Holder: Portions © 2003 Xinhua News Agency, © 2003 Agence France Press, © 2005-2006 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2006T04

DateStamp: 2021-08-13

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Ma, Xiaoyi. 2006. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB dcmi_Text iso639_cmn iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2006T04
Up-to-date as of: Wed Oct 29 7:00:54 EDT 2025

Metadata
Title:		Multiple-Translation Chinese (MTC) Part 4
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Ma, Xiaoyi. Multiple-Translation Chinese (MTC) Part 4 LDC2006T04. Web Download. Philadelphia: Linguistic Data Consortium, 2006
Contributor:		Ma, Xiaoyi
Date (W3CDTF):		2006
Date Issued (W3CDTF):		2006-01-15
Description:		Introduction Multiple-Translation Chinese (MTC) Part 4 was developed by the Linguistic Data Consortium (LDC) and contains 100 Chinese newswire source files and their translations by four human translator teams and 11 Machine Translation (MT) systems, totalling 1,500 translation files, and also assessments for more than 11,000 segments of the MT output. Of the MT systems, five were commercial-off-the-shelf systems (COTS) and six were participants in the TIDES 2003 MT Evaluation. Of the COTS systems, two were free web-based services and three were commercial software. For this corpus, LDC assessed the output from all the TIDES participants' MT systems and one of the COTS systems. To determine if automatic evaluation systems, such as BLEU, track human assessment, LDC also performed human assessments on one COTS output and the six TIDES research systems. The corpus includes the assessment results for one of the five COTS systems, the assessment results for the six TIDES research systems, and the specifications used for conducting the assessments. Data The table below has a breakdown of the text files by source: Source Stories Words Xinhua News Agency 50 19,650 Agence France Presse 50 22,450 Total 100 42,100 For the Chinese data, there are approximately 21 K-words, while the English translations total 396 K-words and 16K unique words. The original source files used GB-2312 encoding for the Chinese characters, and SGML tags for marking sentence and paragraph boundaries and other information about each story. The character encoding is unaltered. To facilitate translation, nearly all SGML tags were removed or replaced by "plain text" markers. The markers were intended to assure that the resulting translations would be easily alignable to the source texts, so extra care was taken to ensure that they would be kept intact and properly oriented. Some normalization was performed on all files to conform to this format, including splitting long segments into smaller chunks and adding segment markers. As a last step, all files were converted from UNIX-style line termination (new-line only) to MS-DOS-style (carriage-return plus line-feed) on the assumption that most (possibly all) translators would use MS-Windows-based editors. Human Translation: The human translation teams were required to submit an initial set of five stories for quality evaluation, and after the initial feedback continued with the rest of the assigned stories. For the rest of the stories, their translations were continuously monitored for adherence to guidelines and quality assurance. Machine Translation: Starting from the original SGML text format, special alterations were made to the files on an as-needed basis, so that they would be accepted and handled correctly by the various systems. Also, the systems differed in terms of the input and retrieval methods required to submit the source data for translation and to save the translated text in alignable form. Human Assessment: The goal of this effort was to evaluate the quality of TIDES research, human translation teams, and COTS systems. Translations were evaluated on the basis of adequacy and fluency. Adequacy refers to the degree to which the translation communicates information present in the original source language text. Fluency refers to the degree to which the translation is well-formed according to the grammar of the target language. Samples For an example of the data provided in this corpus, please review the following samples: * Chinese source (TXT) * English translation (TXT) Updates None at this time.
Extent:		Corpus size: 6041 KB
Identifier:		LDC2006T04
		https://catalog.ldc.upenn.edu/LDC2006T04
		ISBN: 1-58563-375-5
		ISLRN: 018-899-448-641-7
		DOI: 10.35111/17a2-xh75
Language:		English
Language:		Mandarin Chinese
Language (ISO639):		eng
Language (ISO639):		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2006T04
Rights Holder:		Portions © 2003 Xinhua News Agency, © 2003 Agence France Press, © 2005-2006 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2006T04
DateStamp:		2021-08-13
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Ma, Xiaoyi. 2006. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_GB dcmi_Text iso639_cmn iso639_eng olac_primary_text