OLAC Record: Arabic News Translation Text Part 1

OLAC Record
oai:www.ldc.upenn.edu:LDC2004T17

Metadata

Title: Arabic News Translation Text Part 1

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Ma, Xiaoyi, Dalal Zakhary, and Moussa Bamba. Arabic News Translation Text Part 1 LDC2004T17. Web Download. Philadelphia: Linguistic Data Consortium, 2004

Contributor: Ma, Xiaoyi

Zakhary, Dalal

Bamba, Moussa

Date (W3CDTF): 2004

Date Issued (W3CDTF): 2004-09-23

Description: *Introduction* Arabic News Translation Text Part 1 was produced by the Linguistic Data Consortium (LDC) and contains 1,526 Arabic news stories and their English translations, totaling approximately 441,000 Arabic words and 581,000 English words. To support the development of automatic machine translation systems, LDC was sponsored to solicit English translations for a single set of Arabic source materials. The source Arabic text was selected and translated in different LDC projects from November 2002 to February 2004. Arabic news stories were selected from three sources, namely Xinhua, AFP, and An Nahar, and translation services were provided by eight translation agencies who translated each Arabic news story once. The Xinhua and An Nahar stories and their translations were created for TIDES Machine Translation, while the AFP stories and their English translations were created for TIDES TDT. The development of all these translations followed roughly the same guidelines and procedures. *Data* Here is a breakdown of the Arabic material by source: Source News Stories Arabic Words Collection Span AFP News Service 250 44,193 October 1998 - December 1998 Xinhua News Service 670 99,514 November 2001 - March 2002 An Nahar 606 297,533 October 2001 - December 2002 Total 1,526 441,240 For the Arabic data, there are 441 K-words (thousands of words), while for the English translation, there are approximately 581 K-words in total, and 25K unique words. Each translation team was provided with translation guidelines. In accordance with the guidelines, each translation team was asked to return the first five stories for quality checking in each project. This was to ensure that each translation team had indeed understood and was following the guidelines, and the translation quality was acceptable. LDC sent the translations back to the translation team for any deviations from the guidelines or any quality issues detected. Subsequent translation submissions were continuously monitored for conformance and quality. Once the full set of translations was complete, a final pass of reformatting and validation was carried out, to assure alignability of segments, and to convert the translated texts into SGML format. An Arabic-English bilingual LDC employee went through all the source data and English translations, and fixed any problems that had been found. For the present release, the corpus content is organized into source and translation directories, containing 1,526 files in source and 1,526 files in translation, one news story per file. *Samples* For an example of the data in this corpus, please view this Arabic sample (SGM) and its English translation (SGM). *Updates* None at this time.

Identifier: LDC2004T17

https://catalog.ldc.upenn.edu/LDC2004T17

ISBN: 1-58563-307-0

ISLRN: 443-183-109-992-5

DOI: 10.35111/qhv1-1z67

Language: English

Standard Arabic

Language (ISO639): eng

arb

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Rights Holder: Portions © 2001-2002 An Nahar, © 2001-2002 Xinhua News Agency, © 1998 Agence France-Presse, © 2004 Trustees of the University of Pennsylvania.

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2004T17

DateStamp: 2022-04-01

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Ma, Xiaoyi; Zakhary, Dalal; Bamba, Moussa. 2004. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_GB country_SA dcmi_Text iso639_arb iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2004T17
Up-to-date as of: Wed Oct 29 7:00:23 EDT 2025

Metadata
Title:		Arabic News Translation Text Part 1
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Ma, Xiaoyi, Dalal Zakhary, and Moussa Bamba. Arabic News Translation Text Part 1 LDC2004T17. Web Download. Philadelphia: Linguistic Data Consortium, 2004
Contributor:		Ma, Xiaoyi
		Zakhary, Dalal
		Bamba, Moussa
Date (W3CDTF):		2004
Date Issued (W3CDTF):		2004-09-23
Description:		Introduction Arabic News Translation Text Part 1 was produced by the Linguistic Data Consortium (LDC) and contains 1,526 Arabic news stories and their English translations, totaling approximately 441,000 Arabic words and 581,000 English words. To support the development of automatic machine translation systems, LDC was sponsored to solicit English translations for a single set of Arabic source materials. The source Arabic text was selected and translated in different LDC projects from November 2002 to February 2004. Arabic news stories were selected from three sources, namely Xinhua, AFP, and An Nahar, and translation services were provided by eight translation agencies who translated each Arabic news story once. The Xinhua and An Nahar stories and their translations were created for TIDES Machine Translation, while the AFP stories and their English translations were created for TIDES TDT. The development of all these translations followed roughly the same guidelines and procedures. Data Here is a breakdown of the Arabic material by source: Source News Stories Arabic Words Collection Span AFP News Service 250 44,193 October 1998 - December 1998 Xinhua News Service 670 99,514 November 2001 - March 2002 An Nahar 606 297,533 October 2001 - December 2002 Total 1,526 441,240 For the Arabic data, there are 441 K-words (thousands of words), while for the English translation, there are approximately 581 K-words in total, and 25K unique words. Each translation team was provided with translation guidelines. In accordance with the guidelines, each translation team was asked to return the first five stories for quality checking in each project. This was to ensure that each translation team had indeed understood and was following the guidelines, and the translation quality was acceptable. LDC sent the translations back to the translation team for any deviations from the guidelines or any quality issues detected. Subsequent translation submissions were continuously monitored for conformance and quality. Once the full set of translations was complete, a final pass of reformatting and validation was carried out, to assure alignability of segments, and to convert the translated texts into SGML format. An Arabic-English bilingual LDC employee went through all the source data and English translations, and fixed any problems that had been found. For the present release, the corpus content is organized into source and translation directories, containing 1,526 files in source and 1,526 files in translation, one news story per file. Samples For an example of the data in this corpus, please view this Arabic sample (SGM) and its English translation (SGM). Updates None at this time.
Identifier:		LDC2004T17
		https://catalog.ldc.upenn.edu/LDC2004T17
		ISBN: 1-58563-307-0
		ISLRN: 443-183-109-992-5
		DOI: 10.35111/qhv1-1z67
Language:		English
Language:		Standard Arabic
Language (ISO639):		eng
Language (ISO639):		arb
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Rights Holder:		Portions © 2001-2002 An Nahar, © 2001-2002 Xinhua News Agency, © 1998 Agence France-Presse, © 2004 Trustees of the University of Pennsylvania.
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2004T17
DateStamp:		2022-04-01
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Ma, Xiaoyi; Zakhary, Dalal; Bamba, Moussa. 2004. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_GB country_SA dcmi_Text iso639_arb iso639_eng olac_primary_text