OLAC Record oai:www.ldc.upenn.edu:LDC2005T06 |
Metadata | ||
Title: | Chinese News Translation Text Part 1 | |
Access Rights: | Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining | |
Bibliographic Citation: | Ma, Xiaoyi. Chinese News Translation Text Part 1 LDC2005T06. Web Download. Philadelphia: Linguistic Data Consortium, 2005 | |
Contributor: | Ma, Xiaoyi | |
Date (W3CDTF): | 2005 | |
Date Issued (W3CDTF): | 2005-03-15 | |
Description: | *Introduction* Chinese News Translation Text Part 1 was developed by the Linguistic Data Consortium (LDC) and contains approximately 474,000 characters of Chinese text and corresponding English translations, totalling approximately 285,000 words. All the stories in this corpus were collected and all translations made as Machine Translation (MT) training data for DARPA's Translingual Information Detection Extraction and Summarization (TIDES) program. They were selected and translated in different LDC projects during the time period of February 2003 to January 2005. Translation services were provided by seven translation agencies following roughly the same guidelines and procedures, and each Chinese news story was translated once. *Data* Two sources of journalistic Chinese text were selected to provide the Chinese material, collected from July 2002 - September 2002, and from April 2004 - August 2004: * Agence France-Presse News Service: 580 news stories * Xinhua News Service: 421 news stories * Total: 1001 stories The original source files used GB encoding for the Chinese characters. They also used SGML tags for marking sentence and paragraph boundaries and other information about each story. To make things easier for translators, nearly all SGML tags were removed, or replaced by "plain text" markers. Each translation team was provided with translation guidelines. The translation guidelines were modified several times during the development of these data. Each team began with five stories which were checked for quality before taking on larger amounts of data. Subsequent translation submissions were continuously monitored for conformance and quality. For the present release, the corpus content is organized into "source" and "translation" directories. The source directory and each of the human translation subdirectories contain 1,001 files, one news story per file. Corresponding file names are identical in the translation directory. The source and translation files are offered in SGML format. *Samples* For an example of the data in this corpus, please examine this translation sample (TXT). *Updates* None at this time. | |
Identifier: | LDC2005T06 | |
https://catalog.ldc.upenn.edu/LDC2005T06 | ||
ISBN: 1-58563-329-1 | ||
ISLRN: 008-710-816-829-0 | ||
DOI: 10.35111/9n1n-0q43 | ||
Language: | English | |
Mandarin Chinese | ||
Language (ISO639): | eng | |
cmn | ||
License: | LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf | |
Medium: | Distribution: Web Download | |
Publisher: | Linguistic Data Consortium | |
Publisher (URI): | https://www.ldc.upenn.edu | |
Rights Holder: | Portions © 2002-2004 Xinhua News Agency, 2002-2004 Agence France-Presse, © 2005 Trustees of the University of Pennsylvania | |
Type (DCMI): | Text | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | The LDC Corpus Catalog | |
Description: | http://www.language-archives.org/archive/www.ldc.upenn.edu | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:www.ldc.upenn.edu:LDC2005T06 | |
DateStamp: | 2021-11-29 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Ma, Xiaoyi. 2005. Linguistic Data Consortium. | |
Terms: | area_Asia area_Europe country_CN country_GB dcmi_Text iso639_cmn iso639_eng olac_primary_text |