OLAC Record: Chinese English News Magazine Parallel Text

OLAC Record
oai:www.ldc.upenn.edu:LDC2005T10

Metadata

Title: Chinese English News Magazine Parallel Text

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Ma, Xiaoyi. Chinese English News Magazine Parallel Text LDC2005T10. Web Download. Philadelphia: Linguistic Data Consortium, 2005

Contributor: Ma, Xiaoyi

Date (W3CDTF): 2005

Date Issued (W3CDTF): 2005-06-15

Description: *Introduction* Chinese English News Magazine Parallel Text was developed by the Linguistic Data Consortium (LDC) and contains Chinese news stories (20 million characters) and their English translations (9 million words) aligned at sentence level. The data consists of content from Sinorama Magazine, Taiwan, from 1976 to 2004 collected by LDC. It totals 6,366 story pairs and 365,568 sentence pairs. *Data* Sinorama Magazine is published monthly in several languages, including Chinese, English, Japanese. LDC received its 1976 to 2000 publications on a single CD, and its 2001 to 2004 publications via Sinorama's website. The Sinorama Chinese text was encoded in Big5. The data came aligned by story but lacked sentence-level alignment, which was done at LDC using Champollion v 1.1. The data directory contains subdirectories for Chinese documents, English documents, and the sentence level alignment. The English and Chinese files may contain one or more documents, with each document formatted in SGML. The documents are tagged with DOCIDs, and each segment (generally a sentence) in the document is given a numerical SEG ID starting at one for each document. The alignment files contain SGML formatted lines that map the English translations to their Chinese counterparts by specifying segment IDs in the form of EnglishSegId= "#" ChineseSegId= "#". The EnglishSegId and ChineseSegId fields may have none, one, or more than one segment ID. *Samples* The following files provide an example of this corpus: * Chinese (TXT) * English (TXT) * Alignment (TXT) *Updates* None at this time.

Identifier: LDC2005T10

https://catalog.ldc.upenn.edu/LDC2005T10

ISBN: 1-58563-333-X

ISLRN: 629-451-208-314-7

DOI: 10.35111/28bx-hc14

Language: Mandarin Chinese

Language (ISO639): cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Rights Holder: Portions © 1976-2004 Sinorama Magazine

Portions © 2005 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2005T10

DateStamp: 2021-11-12

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Ma, Xiaoyi. 2005. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2005T10
Up-to-date as of: Wed Oct 29 7:00:26 EDT 2025

Metadata
Title:		Chinese English News Magazine Parallel Text
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Ma, Xiaoyi. Chinese English News Magazine Parallel Text LDC2005T10. Web Download. Philadelphia: Linguistic Data Consortium, 2005
Contributor:		Ma, Xiaoyi
Date (W3CDTF):		2005
Date Issued (W3CDTF):		2005-06-15
Description:		Introduction Chinese English News Magazine Parallel Text was developed by the Linguistic Data Consortium (LDC) and contains Chinese news stories (20 million characters) and their English translations (9 million words) aligned at sentence level. The data consists of content from Sinorama Magazine, Taiwan, from 1976 to 2004 collected by LDC. It totals 6,366 story pairs and 365,568 sentence pairs. Data Sinorama Magazine is published monthly in several languages, including Chinese, English, Japanese. LDC received its 1976 to 2000 publications on a single CD, and its 2001 to 2004 publications via Sinorama's website. The Sinorama Chinese text was encoded in Big5. The data came aligned by story but lacked sentence-level alignment, which was done at LDC using Champollion v 1.1. The data directory contains subdirectories for Chinese documents, English documents, and the sentence level alignment. The English and Chinese files may contain one or more documents, with each document formatted in SGML. The documents are tagged with DOCIDs, and each segment (generally a sentence) in the document is given a numerical SEG ID starting at one for each document. The alignment files contain SGML formatted lines that map the English translations to their Chinese counterparts by specifying segment IDs in the form of EnglishSegId= "#" ChineseSegId= "#". The EnglishSegId and ChineseSegId fields may have none, one, or more than one segment ID. Samples The following files provide an example of this corpus: * Chinese (TXT) * English (TXT) * Alignment (TXT) Updates None at this time.
Identifier:		LDC2005T10
		https://catalog.ldc.upenn.edu/LDC2005T10
		ISBN: 1-58563-333-X
		ISLRN: 629-451-208-314-7
		DOI: 10.35111/28bx-hc14
Language:		Mandarin Chinese
Language (ISO639):		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Rights Holder:		Portions © 1976-2004 Sinorama Magazine Portions © 2005 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2005T10
DateStamp:		2021-11-12
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Ma, Xiaoyi. 2005. Linguistic Data Consortium.
Terms:		area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text