OLAC Record: Multiple-Translation Arabic (MTA) Part 2

OLAC Record
oai:www.ldc.upenn.edu:LDC2005T05

Metadata

Title: Multiple-Translation Arabic (MTA) Part 2

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Ma, Xiaoyi. Multiple-Translation Arabic (MTA) Part 2 LDC2005T05. Web Download. Philadelphia: Linguistic Data Consortium, 2005

Contributor: Ma, Xiaoyi

Date (W3CDTF): 2005

Date Issued (W3CDTF): 2005-02-15

Description: *Introduction* Multiple-Translation Arabic (MTA) Part 2 was developed by the Linguistic Data Consortium (LDC) and contains approximately 15,000 Arabic words of source news text along with seven English translation sets, four by humans and three by machine translation (MT) systems, and assessments of the MT. To support the development of automatic means for evaluating translation quality, LDC was sponsored to solicit four sets of human translations for a single set of Arabic source materials. LDC was also asked to produce translations from various commercial-off-the-shelf-systems (COTS, including commercial MT systems and ones available on the Internet). This corpus contains two sets of COTS outputs and one output set from a TIDES 2003 MT Evaluation participant, which is representative for the state-of-the-art research systems. The goal of this effort is to evaluate the quality of TIDES research, human translation teams, and COTS systems. To determine if automatic evaluation systems such as BLEU track human assessment, LDC also performed human assessment on the two COTS outputs and the TIDES research system. The corpus includes the assessment results for one of the two COTS systems, the assessment result for the TIDES research system, and the specifications used for conducting the assessments. This corpus represents the second part of a collection of multiple-translation Arabic. The first part is available from LDC as Multiple-Translation Arabic (MTA) Part 1 (LDC2003T18). *Data* All source data was drawn from January and February 2003. Here's a breakdown of the data amounts by source contained in this corpus: Source Abbreviation Stories Words Xinhua News Service Xinhua 50 7,551 Agence France Presse AFP 50 7,528 Totals 100 15,079 There are 100 source files and 700 translation files. The story selection from the two newswire collections was controlled by story length: all selected stories contain between 700 and 1,500 Arabic characters. The MT outputs were evaluated on the basis of adequacy and fluency, using the human translations as the gold standard. Adequacy refers to the degree to which the translation communicates information present in the original source language text. Fluency refers to the degree to which the translation is well-formed according to the grammar of the target language. The human translation teams initially submitted five stories, which were returned with feedback before being assigned the rest of the material. Further submissions were continually monitored for quality. Ranking of manual translations was performed by two LDC staff members, one an Arabic-dominant bilingual and the other an English native monolingual. There was overall agreement between the two and minor discrepancies were resolved through discussion and comparison of additional files. The ranking method was unstructured and somewhat casual -- it is not intended to be definitive, or even accountable. The source and translation data are presented in SGML formatting, and the assessment is presented in a .txt file with comma separated fields containing judgements and identification info. *Samples* For examples of the data in this corpus, please view this Arabic source file (SGML) and its translation (SGML). *Updates* None at this time.

Identifier: LDC2005T05

https://catalog.ldc.upenn.edu/LDC2005T05

ISBN: 1-58563-328-3

ISLRN: 136-463-995-609-6

DOI: 10.35111/6a17-c826

Language: English

Standard Arabic

Language (ISO639): eng

arb

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Rights Holder: Portions © 2003 Xinhua News Agency, © 2003 Agence France Press, © 2004-2005 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2005T05

DateStamp: 2021-11-29

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Ma, Xiaoyi. 2005. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_GB country_SA dcmi_Text iso639_arb iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2005T05
Up-to-date as of: Wed Oct 29 7:00:25 EDT 2025

Metadata
Title:		Multiple-Translation Arabic (MTA) Part 2
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Ma, Xiaoyi. Multiple-Translation Arabic (MTA) Part 2 LDC2005T05. Web Download. Philadelphia: Linguistic Data Consortium, 2005
Contributor:		Ma, Xiaoyi
Date (W3CDTF):		2005
Date Issued (W3CDTF):		2005-02-15
Description:		Introduction Multiple-Translation Arabic (MTA) Part 2 was developed by the Linguistic Data Consortium (LDC) and contains approximately 15,000 Arabic words of source news text along with seven English translation sets, four by humans and three by machine translation (MT) systems, and assessments of the MT. To support the development of automatic means for evaluating translation quality, LDC was sponsored to solicit four sets of human translations for a single set of Arabic source materials. LDC was also asked to produce translations from various commercial-off-the-shelf-systems (COTS, including commercial MT systems and ones available on the Internet). This corpus contains two sets of COTS outputs and one output set from a TIDES 2003 MT Evaluation participant, which is representative for the state-of-the-art research systems. The goal of this effort is to evaluate the quality of TIDES research, human translation teams, and COTS systems. To determine if automatic evaluation systems such as BLEU track human assessment, LDC also performed human assessment on the two COTS outputs and the TIDES research system. The corpus includes the assessment results for one of the two COTS systems, the assessment result for the TIDES research system, and the specifications used for conducting the assessments. This corpus represents the second part of a collection of multiple-translation Arabic. The first part is available from LDC as Multiple-Translation Arabic (MTA) Part 1 (LDC2003T18). Data All source data was drawn from January and February 2003. Here's a breakdown of the data amounts by source contained in this corpus: Source Abbreviation Stories Words Xinhua News Service Xinhua 50 7,551 Agence France Presse AFP 50 7,528 Totals 100 15,079 There are 100 source files and 700 translation files. The story selection from the two newswire collections was controlled by story length: all selected stories contain between 700 and 1,500 Arabic characters. The MT outputs were evaluated on the basis of adequacy and fluency, using the human translations as the gold standard. Adequacy refers to the degree to which the translation communicates information present in the original source language text. Fluency refers to the degree to which the translation is well-formed according to the grammar of the target language. The human translation teams initially submitted five stories, which were returned with feedback before being assigned the rest of the material. Further submissions were continually monitored for quality. Ranking of manual translations was performed by two LDC staff members, one an Arabic-dominant bilingual and the other an English native monolingual. There was overall agreement between the two and minor discrepancies were resolved through discussion and comparison of additional files. The ranking method was unstructured and somewhat casual -- it is not intended to be definitive, or even accountable. The source and translation data are presented in SGML formatting, and the assessment is presented in a .txt file with comma separated fields containing judgements and identification info. Samples For examples of the data in this corpus, please view this Arabic source file (SGML) and its translation (SGML). Updates None at this time.
Identifier:		LDC2005T05
		https://catalog.ldc.upenn.edu/LDC2005T05
		ISBN: 1-58563-328-3
		ISLRN: 136-463-995-609-6
		DOI: 10.35111/6a17-c826
Language:		English
Language:		Standard Arabic
Language (ISO639):		eng
Language (ISO639):		arb
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Rights Holder:		Portions © 2003 Xinhua News Agency, © 2003 Agence France Press, © 2004-2005 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2005T05
DateStamp:		2021-11-29
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Ma, Xiaoyi. 2005. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_GB country_SA dcmi_Text iso639_arb iso639_eng olac_primary_text