OLAC Record: Arabic Gigaword Third Edition

OLAC Record
oai:www.ldc.upenn.edu:LDC2007T40

Metadata

Title: Arabic Gigaword Third Edition

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Graff, David. Arabic Gigaword Third Edition LDC2007T40. Web Download. Philadelphia: Linguistic Data Consortium, 2007

Contributor: Graff, David

Date (W3CDTF): 2007

Date Issued (W3CDTF): 2007-11-20

Description: *Introduction* Arabic Gigaword Third Edition is a comprehensive archive of newswire text data acquired from Arabic news sources by the LDC at the University of Pennsylvania. Arabic Gigaword Third Edition includes all of the content of Arabic Gigaword Second Edition (LDC2006T02) as well as new data collected after the publication of that edition. Also, an archive from a new newswire source -- Assabah -- has been included in the third editon. The six distinct sources of Arabic newswire represented in the third edition are: * Agence France Presse (afp_arb) * Assabah (asb_arb) * Al Hayat (hyt_arb) * An Nahar (nhr_arb) * Ummah Press (umh_arb) * Xinhua News Agency (xin_arb) The seven-character codes in the parantheses above consist of the three-character source name IDs and the three-character language code ("arb") separated by an underscore ("_") character. The epochs and document counts for the data in the third edition are set forth below: Newly Added Data Source Date Span Document Count Agence France Presse 2005.01 - 2006.12 137815 Assabah News Agency 2004.09 - 2006.12 15410 (new source) Al Hayat News Agency 2005.01 - 2006.1 8799 (no data for 2004) An Nahar News Agency 2005.01 - 2006.12 104950 (no data for 2004) Xinhua News Agency 2005.01 - 2006.12 135472 *Data* This release contains 547 files, totalling approximately 1.8GB in compressed form (6,673 MB uncompressed) and 1,994,735 K-words. The table below shows data quantity by source under the following categories: data source (Source); the number of files per source (#Files); compressed file size (Gzip-MB); uncompressed file size (Totl-MB); the number of space-separated words tokens in the text (K-words) and the number of documents per source (#DOCs). Data Sources and Quanities Source #Files Gzip-MB Totl-MB K-wrds #DOCs afp_arb 152 441 1806 147612 798436 asb_arb 28 23 77 6587 15410 hyt_arb 142 559 1932 171502 378353 nhr_arb 134 612 2172 193732 449340 umh_arb 24 4 14 1201 4645 xin_arb 67 171 672 56165 348551 TOTAL 547 1810 6673 576799 1994735 All text files in this corpus have been converted to UTF-8 character encoding. Certain data and formatting issues observed in previous releases of Arabic Gigaword have been normalized in the third edition: * Approximately 15,000 stories from older AFP files (1994 - 2002) contained very brief documents where the text content was not recognized as such; in those cases, the TEXT element appeared empty while the HEADLINE element contained anywhere from three to several lines of text. The content of these documents has been rearranged. The first line remains as the headline and the rest of the lines have been moved into the text segment. All stories of this sort had been originally classified as "other", and that classification has not been changed in this edition. * Al Hayat data from 2002 and 2003 contained some Arabic-Indic digits, despite the intention to convert all digit strings to the ASCII digit characters for consistency. The digits have now been converted to the ASCII range. For more details about the encoding challenges presented by this data, see the readme file accompanying this corpus. * Some Al Hayat data had stray angle-bracket characters (""), which have been rendered as "". There were also some defective "Doc-ID" strings (the 'id' attribute in the "" tag that begins each news story) in the January 2001 data. * Some An Nahar data had "bare" ampersand characters ("&") which have been rendered as "&". * Some Xinhua documents included empty sub-elements (HEADLINE, DATELINE and/or TEXT sections containing no data); when HEADLINE or DATELINE were empty, these tags were removed. When the TEXT segment was empty, the document as a whole was removed. * In several Xinhua stories, the Doc-ID string, which is supposed to provide the year, month, date and sequence number for the story, had become garbled, yielding an incorrect or impossible date string. A separate data file in the "docs" directory, called "docid_changes.txt", lists the changes in document inventory and Doc-ID strings. * Xinhua stories typically end with a formulaic Arabic string (meaning "end-of-story"), which should not have been included as part of the final paragraph in each story. * In general, consistent line-wrapping was applied to make the overall text presentation consistent across all sources and with Gigaword releases in other languages. The markup pattern was also applied consistently for all sources without exception. *Samples* For an example of the data contained in this corpues, please view this image of sample text *Sponsorship* This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

Extent: Corpus size: 1887436 KB

Identifier: LDC2007T40

https://catalog.ldc.upenn.edu/LDC2007T40

ISBN: 1-58563-460-3

ISLRN: 769-869-926-619-3

DOI: 10.35111/rmss-cp38

Language: Standard Arabic

Language (ISO639): arb

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2007T40

Rights Holder: Portions © 1994-2006 Agence France Presse, © 2004-2006 Assabah, © 1994-2003, 2005-2006 Al Hayat, © 1995-2006 An Nahar, © 2003-2004 Ummah Press Service, © 2001-2006 Xinhua News Agency, © 2003, 2005-2007 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2007T40

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Graff, David. 2007. Linguistic Data Consortium.
Terms: area_Asia country_SA dcmi_Text iso639_arb olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2007T40
Up-to-date as of: Fri Aug 8 0:27:44 EDT 2025

Metadata
Title:		Arabic Gigaword Third Edition
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Graff, David. Arabic Gigaword Third Edition LDC2007T40. Web Download. Philadelphia: Linguistic Data Consortium, 2007
Contributor:		Graff, David
Date (W3CDTF):		2007
Date Issued (W3CDTF):		2007-11-20
Description:		Introduction Arabic Gigaword Third Edition is a comprehensive archive of newswire text data acquired from Arabic news sources by the LDC at the University of Pennsylvania. Arabic Gigaword Third Edition includes all of the content of Arabic Gigaword Second Edition (LDC2006T02) as well as new data collected after the publication of that edition. Also, an archive from a new newswire source -- Assabah -- has been included in the third editon. The six distinct sources of Arabic newswire represented in the third edition are: * Agence France Presse (afp_arb) * Assabah (asb_arb) * Al Hayat (hyt_arb) * An Nahar (nhr_arb) * Ummah Press (umh_arb) * Xinhua News Agency (xin_arb) The seven-character codes in the parantheses above consist of the three-character source name IDs and the three-character language code ("arb") separated by an underscore ("_") character. The epochs and document counts for the data in the third edition are set forth below: Newly Added Data Source Date Span Document Count Agence France Presse 2005.01 - 2006.12 137815 Assabah News Agency 2004.09 - 2006.12 15410 (new source) Al Hayat News Agency 2005.01 - 2006.1 8799 (no data for 2004) An Nahar News Agency 2005.01 - 2006.12 104950 (no data for 2004) Xinhua News Agency 2005.01 - 2006.12 135472 Data This release contains 547 files, totalling approximately 1.8GB in compressed form (6,673 MB uncompressed) and 1,994,735 K-words. The table below shows data quantity by source under the following categories: data source (Source); the number of files per source (#Files); compressed file size (Gzip-MB); uncompressed file size (Totl-MB); the number of space-separated words tokens in the text (K-words) and the number of documents per source (#DOCs). Data Sources and Quanities Source #Files Gzip-MB Totl-MB K-wrds #DOCs afp_arb 152 441 1806 147612 798436 asb_arb 28 23 77 6587 15410 hyt_arb 142 559 1932 171502 378353 nhr_arb 134 612 2172 193732 449340 umh_arb 24 4 14 1201 4645 xin_arb 67 171 672 56165 348551 TOTAL 547 1810 6673 576799 1994735 All text files in this corpus have been converted to UTF-8 character encoding. Certain data and formatting issues observed in previous releases of Arabic Gigaword have been normalized in the third edition: * Approximately 15,000 stories from older AFP files (1994 - 2002) contained very brief documents where the text content was not recognized as such; in those cases, the TEXT element appeared empty while the HEADLINE element contained anywhere from three to several lines of text. The content of these documents has been rearranged. The first line remains as the headline and the rest of the lines have been moved into the text segment. All stories of this sort had been originally classified as "other", and that classification has not been changed in this edition. * Al Hayat data from 2002 and 2003 contained some Arabic-Indic digits, despite the intention to convert all digit strings to the ASCII digit characters for consistency. The digits have now been converted to the ASCII range. For more details about the encoding challenges presented by this data, see the readme file accompanying this corpus. * Some Al Hayat data had stray angle-bracket characters (""), which have been rendered as "". There were also some defective "Doc-ID" strings (the 'id' attribute in the "" tag that begins each news story) in the January 2001 data. * Some An Nahar data had "bare" ampersand characters ("&") which have been rendered as "&". * Some Xinhua documents included empty sub-elements (HEADLINE, DATELINE and/or TEXT sections containing no data); when HEADLINE or DATELINE were empty, these tags were removed. When the TEXT segment was empty, the document as a whole was removed. * In several Xinhua stories, the Doc-ID string, which is supposed to provide the year, month, date and sequence number for the story, had become garbled, yielding an incorrect or impossible date string. A separate data file in the "docs" directory, called "docid_changes.txt", lists the changes in document inventory and Doc-ID strings. * Xinhua stories typically end with a formulaic Arabic string (meaning "end-of-story"), which should not have been included as part of the final paragraph in each story. * In general, consistent line-wrapping was applied to make the overall text presentation consistent across all sources and with Gigaword releases in other languages. The markup pattern was also applied consistently for all sources without exception. Samples For an example of the data contained in this corpues, please view this image of sample text Sponsorship This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Extent:		Corpus size: 1887436 KB
Identifier:		LDC2007T40
		https://catalog.ldc.upenn.edu/LDC2007T40
		ISBN: 1-58563-460-3
		ISLRN: 769-869-926-619-3
		DOI: 10.35111/rmss-cp38
Language:		Standard Arabic
Language (ISO639):		arb
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2007T40
Rights Holder:		Portions © 1994-2006 Agence France Presse, © 2004-2006 Assabah, © 1994-2003, 2005-2006 Al Hayat, © 1995-2006 An Nahar, © 2003-2004 Ummah Press Service, © 2001-2006 Xinhua News Agency, © 2003, 2005-2007 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2007T40
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Graff, David. 2007. Linguistic Data Consortium.
Terms:		area_Asia country_SA dcmi_Text iso639_arb olac_primary_text