OLAC Record
oai:www.ldc.upenn.edu:LDC2005T06

Metadata
Title:Chinese News Translation Text Part 1
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Ma, Xiaoyi. Chinese News Translation Text Part 1 LDC2005T06. Web Download. Philadelphia: Linguistic Data Consortium, 2005
Contributor:Ma, Xiaoyi
Date (W3CDTF):2005
Date Issued (W3CDTF):2005-03-15
Description:*Introduction* Chinese News Translation Text Part 1 was developed by the Linguistic Data Consortium (LDC) and contains approximately 474,000 characters of Chinese text and corresponding English translations, totalling approximately 285,000 words. All the stories in this corpus were collected and all translations made as Machine Translation (MT) training data for DARPA's Translingual Information Detection Extraction and Summarization (TIDES) program. They were selected and translated in different LDC projects during the time period of February 2003 to January 2005. Translation services were provided by seven translation agencies following roughly the same guidelines and procedures, and each Chinese news story was translated once. *Data* Two sources of journalistic Chinese text were selected to provide the Chinese material, collected from July 2002 - September 2002, and from April 2004 - August 2004: * Agence France-Presse News Service: 580 news stories * Xinhua News Service: 421 news stories * Total: 1001 stories The original source files used GB encoding for the Chinese characters. They also used SGML tags for marking sentence and paragraph boundaries and other information about each story. To make things easier for translators, nearly all SGML tags were removed, or replaced by "plain text" markers. Each translation team was provided with translation guidelines. The translation guidelines were modified several times during the development of these data. Each team began with five stories which were checked for quality before taking on larger amounts of data. Subsequent translation submissions were continuously monitored for conformance and quality. For the present release, the corpus content is organized into "source" and "translation" directories. The source directory and each of the human translation subdirectories contain 1,001 files, one news story per file. Corresponding file names are identical in the translation directory. The source and translation files are offered in SGML format. *Samples* For an example of the data in this corpus, please examine this translation sample (TXT). *Updates* None at this time.
Identifier:LDC2005T06
https://catalog.ldc.upenn.edu/LDC2005T06
ISBN: 1-58563-329-1
ISLRN: 008-710-816-829-0
DOI: 10.35111/9n1n-0q43
Language:English
Mandarin Chinese
Language (ISO639):eng
cmn
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Rights Holder:Portions © 2002-2004 Xinhua News Agency, 2002-2004 Agence France-Presse, © 2005 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2005T06
DateStamp:  2021-11-29
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Ma, Xiaoyi. 2005. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB dcmi_Text iso639_cmn iso639_eng olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2005T06
Up-to-date as of: Mon Mar 25 7:19:46 EDT 2024