OLAC Record

Title:Tagged Chinese Gigaword
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Huang, Chu-Ren. Tagged Chinese Gigaword LDC2007T03. Web Download. Philadelphia: Linguistic Data Consortium, 2007
Contributor:Huang, Chu-Ren
Date (W3CDTF):2007
Date Issued (W3CDTF):2007-06-20
Description:*Introduction* Tagged Chinese Gigaword, created by scholars at Academia Sinica, Taipei, Taiwan, is the part-of-speech tagged version of the LDC's Chinese Gigaword Second Edition LDC2005T14. It contains all of the data in Chinese Gigaword Second Edition -- from Central News Agency (Taiwan), Xinhua News Agency and Lianhe Zaobao -- annotated with full part of speech tags. In order to avoid any problems or confusion that could result from differences in character-set specifications in the source data, all text files in this corpus have been converted to UTF-8 character encoding. With some exceptions described in the readme file, all characters in the text are either single-byte ASCII or multi-byte Chinese. All sources have been categorized into four distinct "types": * story: This type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences. * multi: This type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event; examples include "summaries of today's news," "news briefs in ..." (some general area like finance or sports), and so on. * advis: These are DOCs which the news service addresses to news editors; they are not intended for publication to the "end users." * other: These DOCs clearly do not fall into any of the above types; they include items such as lists of sports scores, stock prices, temperatures around the world, and so on. *Data* The table below lists the number files, their compressed and uncompressed size, number of words and number of documents divided by source. #Files = number of files. Rzip-MB = compressed size in megabytes. Totl-MB = uncompressed size in megabytes. K-words = number of words in thousands. #DOCs = number of documents. Source #Files Rzip-MB Totl-MB K-wrds #DOCs CNA_CMN 168 994 7363 792195 1769953 XIN_CMN 168 615 4535 471110 992261 ZBN_CMN 10 40 223 28066 41418 TOTAL 346 1648 12121 1291371 2803632 The following tables present the quantity of "K-wrds" and "#DOCS", divided by source and DOC type: #DOCs K-wrds type="advis": CNA_CMN 8160 751 XIN_CMN 6553 711 ZBN_CMN 0 0 TOTAL 14713 1462 type="multi": CNA_CMN 30552 23429 XIN_CMN 11329 7516 ZBN_CMN 55 41 TOTAL 41936 30986 type="other": CNA_CMN 100758 40258 XIN_CMN 31255 9999 ZBN_CMN 279 130 TOTAL 132292 50387 type="story": CNA_CMN 1630483 727748 XIN_CMN 943132 452878 ZBN_CMN 41084 27898 TOTAL 2614691 1208524 The performance of CKIP Segmentation and POS tagging system has been tested in Bakeoff 2005 and Bakeoff 2006. The test result is shown as follows: Doc# RefWord# TestWord# MatchWord# Recall (%) Precision (%) F-Score (%) Bakeoff 2005 190 116509 116443 112091 96.2 96.3 96.2 Bakeoff 2006 148 90405 90327 87332 96.6 96.7 96.6 Note: Recall=MatchWord# / RefWord# Precision=MatchWord# / TestWord# F-Score=2 * Recall * Precision / (Recall + Precision) *Samples* For an example of the data contained in this corpus, please view this screen capture(jpg) of the annotated text.
Extent:Corpus size: 2527068 KB
ISBN: 1-58563-409-3
ISLRN: 614-675-002-053-4
DOI: 10.35111/ckna-1h68
Language:Mandarin Chinese
Language (ISO639):cmn
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2007T03
Rights Holder:Portions © 2005-2007 Academia Sinica, © 1991-2004 Central News Agency (Taiwan), © 2000-2003 SPH AsiaOne, Ltd., © 1990-2004 Xinhua News Agency, © 2005, 2007 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text


Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2007T03
DateStamp:  2020-11-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Huang, Chu-Ren. 2007. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text

Up-to-date as of: Tue May 7 7:24:49 EDT 2024