OLAC Record
oai:www.ldc.upenn.edu:LDC2009T27

Metadata
Title:Chinese Gigaword Fourth Edition
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Parker, Robert, et al. Chinese Gigaword Fourth Edition LDC2009T27. Web Download. Philadelphia: Linguistic Data Consortium, 2009
Contributor:Parker, Robert
Graff, David
Chen, Ke
Kong, Junbo
Maeda, Kazuaki
Date (W3CDTF):2009
Date Issued (W3CDTF):2009-09-15
Description:*Introduction* Chinese Gigaword Fourth Edition, Linguistic Data Consortium (LDC) catalog number LDC2009T27 and isbn 1-58563-527-8, is a comprehensive archive of newswire text data that has been acquired over several years by the LDC. This edition includes all of the contents in Chinese Gigaword Third Edition (LDC2007T38) as well as newly collected data. In addition, four entirely new sources have been added in the fourth edition, Central News Service, Guangming Daily, Peoples Liberation Army Daily, and Peoples Daily. The eight distinct international sources of Chinese newswire included in this edition are the following: * Agence France Presse (afp_cmn) * Central News Agency, Taiwan (cna_cmn) * Central News Service (cns_cmn) * Guangming Daily (gmw_cmn) * Peoples Daily (pda_cmn) * Peoples Liberation Army Daily (pla_cmn) * Xinhua News Agency (xin_cmn) * Zaobao Newspaper (zbn_cmn) The seven-letter codes in the parentheses above are used for the directory names and data files for each source, and are also used (in ALL_CAPS) as part of the unique DOC id string assigned to each news article. *Data* The original data received by the LDC from AFP, Peoples Liberation Army Daily, Xinhua, and Zaobao were encoded in GB-2312, those from CNA were in Big-5, and those from GMW, CNS, and Peoples Daily were in a combination of GB-2312 and GB-18030. To avoid the problems and confusion that could result from differences in character-set specifications, all text files in this corpus have been converted to UTF-8 character encoding. *New in the Fourth Edition* * Two years worth of new articles (January 2007 through December 2008) have been added to the Xinhua, Agence France Presse, and CNA data sets. * Four new data sources have been added - Guangming Daily, Central News Service, Peoples Daily and Peoples Liberation Army daily, covering a timespan from November 2006 through December 2008. *Samples* Please view this sample. *Sponsorship* This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Extent:Corpus size: 3040870 KB
Identifier:LDC2009T27
https://catalog.ldc.upenn.edu/LDC2009T27
ISBN: 1-58563-527-8
ISLRN: 261-416-300-929-8
DOI: 10.35111/abt0-qy36
Language:Mandarin Chinese
Language (ISO639):cmn
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2009T27
Rights Holder:Portions © 2000-2008 Agence France Presse,© 1991-2008 Central News Agency (Taiwan),© 2006-2008 China Military Online, © 2006-2008 Chinanews.com, © 2006-2008 Guangming Daily, © 2006-2008 Peoples Daily, © 1998, 2000-2003 SPH AsiaOne, Ltd., © 1990-2008 Xinhua News Agency, © 2003, 2005, 2007, 2009 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2009T27
DateStamp:  2020-11-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Parker, Robert; Graff, David; Chen, Ke; Kong, Junbo; Maeda, Kazuaki. 2009. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2009T27
Up-to-date as of: Thu Oct 24 7:30:26 EDT 2024