OLAC Record
oai:www.ldc.upenn.edu:LDC2004T08

Metadata
Title:Hong Kong Parallel Text
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Ma, Xiaoyi. Hong Kong Parallel Text LDC2004T08. Web Download. Philadelphia: Linguistic Data Consortium, 2004
Contributor:Ma, Xiaoyi
Date (W3CDTF):2004
Date Issued (W3CDTF):2004-09-01
Description:*Introduction* Hong Kong Parallel Text was developed by the Linguistic Data Consortium (LDC) and contains approximately 49 million words of Chinese text and 59 million words of English translation aligned at the sentence level. The data comes from three sub-corpora, namely Hong Kong Hansards Parallel Text (LDC2000T50), Hong Kong Laws Parallel Text (LDC2000T47) and Hong Kong News Parallel Text (LDC2000T46). The original corpora were published in 2000. The 2000 versions of Hong Kong Hansards Parallel Text and Hong Kong News Parallel Text are aligned at the document level, while the 2004 versions are aligned at the sentence level. The 2000 and 2004 versions of Hong Kong News Parallel Text were aligned using different sentence alignment algorithms. As a result, the 2004 version has better sentence alignment and also has slightly more data than the 2000 version. Chinese text is presented in the traditional script and encoded as BIG5. *Data* Hong Kong Hansards Hong Kong Hansards contains excerpts from the Official Record of Proceedings (hansards) of the Legislative Council of the HKSAR from October 1985 to April 2003. LDC downloaded the hansards in Chinese and English from the official website of HKSAR. Hong Kong Laws Hong Kong Laws contains statute laws of Hong Kong, downloaded from the Bilingual Laws Information System (BLIS), a searchable electronic database of the statute laws of Hong Kong, established and updated by the Department of Justice of the HKSAR, in 2000. The original BLIS database contains statute laws of Hong Kong in English and Chinese, constitutional instruments, national laws and other relevant instruments, collections of terms and expressions used in the laws of Hong Kong and subject indices of Ordinances. This corpus contains only statute laws of Hong Kong in English and Chinese, constitutional instruments, national laws, and other relevant instruments published up to year 2000. Hong Kong News Hong Kong News contains press releases from July 1997 to October 2003 from the government of HKSAR. The HKSAR publishes press releases in both Chinese and English on a daily basis. Most press releases are available in both languages, some were translated from English to Chinese, some were translated from Chinese to English. Final Data Format and Validation The original hansards files were in PDF format, and the laws and news files were in HTML format. Using automatic conversion and alignment software, LDC converted all files to plain text and aligned at sentence level. Sentence alignment was performed on all data using Champollion, a parallel text sentence alignment tool developed at LDC. See http://champollion.sourceforge.net for more information about Champollion. For the English translation, there are approximately 466K unique words. The following table shows the number of documents, paragraphs, segments, words, and characters for each source: Source Documents Paragraphs (English) Paragraphs (Chinese) Segments (English) Segments (Chinese) English Words Chinese Characters Hong Kong Hansards 714 642,008 632,173 1,688,278 1,414,573 36,140,737 56,618,181 Hong Kong Laws 42,255 423,192 462,283 451,884 491,719 8,396,243 14,868,621 Hong Kong News 44,621 605,183 603,118 811,638 775,019 14,798,671 26,677,514 Total 87,590 1,670,383 1,697,574 2,951,800 2,681,311 59,335,651 98,164,316 * * *Samples* Please view the following samples * Chinese * English * Alignment *Updates* There are no updates available at this time. *Copying and Distribution* Permission is granted to the Linguistic Data Consortium to make and distribute copies of the laws, press releases and news of Hong Kong Special Administrative Region provided this copyright notice and permission notices are distributed with all copies. Permission has been given to the Linguistic Data Consortium to reproduce the laws, press releases, and/or news articles from the Hong Kong Special Administrative Region Government website for research, education, and technology development. *Additional Licensing Instructions* This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.
Extent:Corpus size: 887808 KB
Identifier:LDC2004T08
https://catalog.ldc.upenn.edu/LDC2004T08
ISBN: 1-58563-290-2
ISLRN: 619-530-254-208-2
DOI: 10.35111/byvg-fv73
Language:English
Chinese
Language (ISO639):eng
zho
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2004T08
Rights Holder:Portions © 1985-2003 The Government of the Hong Kong Special Administrative Region, © 2004 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2004T08
DateStamp:  2024-03-08
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Ma, Xiaoyi. 2004. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng iso639_zho olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2004T08
Up-to-date as of: Mon Mar 25 7:19:43 EDT 2024