OLAC Record: Multiple-Translation Chinese (MTC) Part 3

OLAC Record
oai:www.ldc.upenn.edu:LDC2004T07

Metadata

Title: Multiple-Translation Chinese (MTC) Part 3

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Ma, Xiaoyi. Multiple-Translation Chinese (MTC) Part 3 LDC2004T07. Web Download. Philadelphia: Linguistic Data Consortium, 2004

Contributor: Ma, Xiaoyi

Date (W3CDTF): 2004

Date Issued (W3CDTF): 2004-07-12

Description: *Introduction* Multiple-Translation Chinese (MTC) Part 3 was developed by the Linguistic Data Consortium (LDC) and contains approximately 21,000 words of Chinese newswire with their translations by four different translation teams, totaling approximately 100,000 English words. This corpus is the third part of a line of corpora created to support the development of automatic means for evaluating translation quality. The other corpora in this collection are: * Multiple-Translation Chinese Corpus (LDC2002T01) * Multiple-Translation Chinese (MTC) Part 2 (LDC2003T17) * Multiple-Translation Chinese (MTC) Part 4 (LDC2006T04) All four parts contain unique source texts. The first part contains multiple human translations and machine translations (MT) of the source text, and Parts 2 and 4 contain multiple human and machine translations along with MT assessment. This corpus, Part 3, contains only source text and four sets of human translation. For the first part, 11 translation teams were selected to create the human translations, and for the rest of the parts, the four best teams from the original 11 were selected to create translations. To support the development of automatic means for evaluating translation quality, LDC was sponsored to solicit four sets of human translations for a single set of Mandarin Chinese source materials. *Data* The data was drawn from two sources of journalistic Mandarin Chinese text, AFP News Service and Xinhua News Service. The text was drawn from the May and June 2002 collection of both sources. The story selection from the two newswire collections was controlled by story length: all selected stories contain between about 230 and 564 Chinese characters. The overall count of Chinese characters by source is shown in the following table: Source Stories Chinese Characters AFP 50 22,135 Xinhua 50 20,321 Total 100 42,456 For the Chinese data, there are approximately 21 K-words (thousands of words), while for the English translation, there are approximately 100 K-words in total, and 12K unique words. In accordance with the guidelines, each translation team was asked to return the first 10 Xinhua stories for quality checking. This was to ensure that each translation team had indeed understood and was following the guidelines, and the translation quality was acceptable. LDC sent the translations back to the translation team for any deviations from the guidelines or any quality issues detected. Subsequent translation submissions were continuously monitored for conformance and quality. Once the full set of translations was complete, a final pass of reformatting and validation was carried out, to assure alignability of segments, and to convert the translated texts into SGML format. Each translation team was also asked to fill out and return a questionnaire to describe their procedures and professional background. *Samples* For examples of the data in this corpus, please view these Chinese (CHN) and an English (ENG) samples. *Updates* There are no updates available at this time.

Extent: Corpus size: 1945 KB

Identifier: LDC2004T07

https://catalog.ldc.upenn.edu/LDC2004T07

ISBN: 1-58563-289-9

ISLRN: 026-006-085-012-3

DOI: 10.35111/9nxq-9e06

Language: English

Mandarin Chinese

Language (ISO639): eng

cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2004T07

Rights Holder: Portions © 2002 Xinhua News Agency, © 2002 Agence France-Presse, © 2004 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2004T07

DateStamp: 2024-03-08

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Ma, Xiaoyi. 2004. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB dcmi_Text iso639_cmn iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2004T07
Up-to-date as of: Wed Oct 29 7:00:21 EDT 2025

Metadata
Title:		Multiple-Translation Chinese (MTC) Part 3
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Ma, Xiaoyi. Multiple-Translation Chinese (MTC) Part 3 LDC2004T07. Web Download. Philadelphia: Linguistic Data Consortium, 2004
Contributor:		Ma, Xiaoyi
Date (W3CDTF):		2004
Date Issued (W3CDTF):		2004-07-12
Description:		Introduction Multiple-Translation Chinese (MTC) Part 3 was developed by the Linguistic Data Consortium (LDC) and contains approximately 21,000 words of Chinese newswire with their translations by four different translation teams, totaling approximately 100,000 English words. This corpus is the third part of a line of corpora created to support the development of automatic means for evaluating translation quality. The other corpora in this collection are: * Multiple-Translation Chinese Corpus (LDC2002T01) * Multiple-Translation Chinese (MTC) Part 2 (LDC2003T17) * Multiple-Translation Chinese (MTC) Part 4 (LDC2006T04) All four parts contain unique source texts. The first part contains multiple human translations and machine translations (MT) of the source text, and Parts 2 and 4 contain multiple human and machine translations along with MT assessment. This corpus, Part 3, contains only source text and four sets of human translation. For the first part, 11 translation teams were selected to create the human translations, and for the rest of the parts, the four best teams from the original 11 were selected to create translations. To support the development of automatic means for evaluating translation quality, LDC was sponsored to solicit four sets of human translations for a single set of Mandarin Chinese source materials. Data The data was drawn from two sources of journalistic Mandarin Chinese text, AFP News Service and Xinhua News Service. The text was drawn from the May and June 2002 collection of both sources. The story selection from the two newswire collections was controlled by story length: all selected stories contain between about 230 and 564 Chinese characters. The overall count of Chinese characters by source is shown in the following table: Source Stories Chinese Characters AFP 50 22,135 Xinhua 50 20,321 Total 100 42,456 For the Chinese data, there are approximately 21 K-words (thousands of words), while for the English translation, there are approximately 100 K-words in total, and 12K unique words. In accordance with the guidelines, each translation team was asked to return the first 10 Xinhua stories for quality checking. This was to ensure that each translation team had indeed understood and was following the guidelines, and the translation quality was acceptable. LDC sent the translations back to the translation team for any deviations from the guidelines or any quality issues detected. Subsequent translation submissions were continuously monitored for conformance and quality. Once the full set of translations was complete, a final pass of reformatting and validation was carried out, to assure alignability of segments, and to convert the translated texts into SGML format. Each translation team was also asked to fill out and return a questionnaire to describe their procedures and professional background. Samples For examples of the data in this corpus, please view these Chinese (CHN) and an English (ENG) samples. Updates There are no updates available at this time.
Extent:		Corpus size: 1945 KB
Identifier:		LDC2004T07
		https://catalog.ldc.upenn.edu/LDC2004T07
		ISBN: 1-58563-289-9
		ISLRN: 026-006-085-012-3
		DOI: 10.35111/9nxq-9e06
Language:		English
Language:		Mandarin Chinese
Language (ISO639):		eng
Language (ISO639):		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2004T07
Rights Holder:		Portions © 2002 Xinhua News Agency, © 2002 Agence France-Presse, © 2004 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2004T07
DateStamp:		2024-03-08
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Ma, Xiaoyi. 2004. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_GB dcmi_Text iso639_cmn iso639_eng olac_primary_text