OLAC Record
oai:www.ldc.upenn.edu:LDC2005T10

Metadata
Title:Chinese English News Magazine Parallel Text
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Ma, Xiaoyi. Chinese English News Magazine Parallel Text LDC2005T10. CD. Philadelphia: Linguistic Data Consortium, 2005
Contributor:Ma, Xiaoyi
Date (W3CDTF):2005
Date Issued (W3CDTF):2005-06-15
Description:*Introduction* This file contains documentation on the Chinese English News Magazine Parallel Text, Linguistic Data Consortium (LDC) catalog number LDC2005T10 and ISBN 1-58563-333-X. This corpus contains Chinese news stories and their English translations LDC collected via Sinorama Magazine, Taiwan, from 1976 to 2004. It totals 6,366 story pairs, 365,568 sentence pairs, 20M Chinese characters and 9M English words. The corpus is aligned at sentence level. *Data* Sinorama Magazine is published monthly in several languages, including Chinese, English, Japanese. LDC received its 1976 to 2000 publications on a single CD, and its 2001 to 2004 publications via Sinorama's website. The Sinorama Chinese text was encoded in Big5. The data came story aligned but were lack of sentence level alignment. The sentence alignment was done at the LDC using Champollion v 1.1. The final data is put in the data directory, which contains subdirectories for Chinese documents, English documents, and the sentence level alignment, identified as "Chinese," "English," and "alignment." The English and Chinese files may contain one or more documents, with each document formatted in SGML as follows: [English or Chinese text] [English or Chinese text] [English or Chinese text] ... Notes: * the and tags are always assigned sequential numeric IDs, starting at one. * the tags are always placed on the same line with their contents, and are always separated from the contents by a space. * if an English file and a Chinese file share the same file name, they contain the same documents. * all Chinese text is encoded in Big5. Each alignment file contains the sentence level alignment of multiple documents, each being formatted in SGML as follows: ... Notes: * the docid in an English file, its Chinese translation and the ALIGNMENT are the same. * EnglishSegId and ChineseSegId may have none, one, or more than one segment IDs. *Samples* The following files provide an example of this corpus: * Chinese * English * Alignment Portions © 2005 Trustees of the University of Pennsylvania
Identifier:LDC2005T10
https://catalog.ldc.upenn.edu/LDC2005T10
ISBN: 1-58563-333-X
ISLRN: 629-451-208-314-7
Language:Mandarin Chinese
Language (ISO639):cmn
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2005T10
Rights Holder:Portions © 1976-2004 Sinorama Magazine

Portions © 2005 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2005T10
DateStamp:  2019-12-12
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Ma, Xiaoyi. 2005. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2005T10
Up-to-date as of: Sun Aug 2 15:58:55 EDT 2020