OLAC Record: Arabic Gigaword

OLAC Record
oai:www.ldc.upenn.edu:LDC2003T12

Metadata

Title: Arabic Gigaword

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Graff, David. Arabic Gigaword LDC2003T12. Web Download. Philadelphia: Linguistic Data Consortium, 2003

Contributor: Graff, David

Date (W3CDTF): 2003

Date Issued (W3CDTF): 2003-07-22

Description: *Introduction* Arabic Gigaword was produced by the Linguistic Data Consortium (LDC) and contains approximately 1 million news documents totaling 400 million words of Arabic text. This is a comprehensive archive of newswire text data that has been acquired from Arabic news sources by LDC at the University of Pennsylvania. Four distinct sources of Arabic newswire are represented here: * Agence France Presse (AFA) * Al Hayat News Agency (ALH) * Al Nahar News Agency (ANN) * Xinhua News Agency (XIN) Much of the AFP content in this collection has been published previously by the LDC in Arabic Newswire Part 1 (LDC2001T55) and some of this content has also been included in an Arabic supplement to TDT3 (Topic Detection and Tracking) and as the Arabic component of TDT4. TDT4 also included a four-month sample from Al Hayat and An Nahar (October 2000 - January 2001). Apart from that, all of the Al Hayat, An Nahar, and Xinhua Arabic content, as well as AFP content for 2001-2002, is being released here for the first time. *Data* There are 319 files, totaling approximately 1.1 GB in compressed form, 4.3 GB uncompressed, and 391,619 K-words (thousands of words). The table below presents the following categories of information: source of the data, number of files per source, and K-words (the number of space-separated tokens in the text, excluding SGML tags), and number of documents per source. Source #Files K-words #DOCs AFA 104 94,484 516,855 ALH 95 139,501 305,250 ANN 96 140,247 327,768 XIA 24 17,387 106,846 TOTAL 319 391,619 1,256,719 All text files in this corpus have been converted to UTF-8 character encoding. Owing to the use of UTF-8, the SGML tagging within each file shows up as lines of single-byte-per-character (ASCII) text, whereas lines of actual text data, including article headlines and datelines, contain a mixture of single-byte and multi-byte characters. In general, single-byte characters in the text data will consist of digits and punctuation marks (where the original source relied on ASCII punctuation codes, rather than Arabic-specific punctuation), whereas multi-byte characters consist of Arabic letters and a small number of special punctuation or other symbols. This variable-width character encoding is intrinsic to UTF-8, and all UTF-8 capable processes will handle the data appropriately. Each file contains all the usable data received by LDC for the given month from the given news source. All text data are presented in SGML form, using a very simple, minimal markup structure. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using the DTD file provided in the publication. Unlike older corpora, the present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs). All sources have received a uniform treatment in terms of quality control, and have been categorized into three distinct "types": story This type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences. multi This type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event: "summaries of today's news," or "news briefs in ... (some general area like finance or sports)," and so on. other These DOCs clearly do not fall into any of the above types; these are things like lists of sports scores, stock prices, temperatures around the world, and so on. The general strategy for categorizing DOCs into these three classes was, for each source, to discover the most common and frequent clues in the text stream that correlated with the "non-story" types. When none of the known clues was in evidence, the DOC was classified as a "story." Previous "Gigaword" corpora (in English and Chinese) had a fourth category, "advis" (for "advisory"), which applied to DOCs that contain text intended solely for news service editors, not the news-reading public. In preparing the Arabic data, the task of determining patterns for assigning "non-story" type labels was carried out by a native speaker of Arabic. For whatever reason, this person did not find the "advis" category to be applicable to any of the data. *Samples* For an example of the data in this corpus, please view this sample (TXT). *Updates* This edition of Arabic Gigaword has been superseded by a a new edtion, LDC2006T02

Extent: Corpus size: 1153433 KB

Identifier: LDC2003T12

https://catalog.ldc.upenn.edu/LDC2003T12

ISBN: 1-58563-271-6

ISLRN: 537-362-711-928-4

DOI: 10.35111/ep1n-de95

Language: Standard Arabic

Language (ISO639): arb

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2003T12

Rights Holder: Portions © 1994-2002 Agence France Presse, © 1994-2001 Al Hayat News Agency, © 1995-2002 An Nahar News Agency, © 2001-2003 Xinhua News Agency, © 2003 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2003T12

DateStamp: 2024-09-10

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Graff, David. 2003. Linguistic Data Consortium.
Terms: area_Asia country_SA dcmi_Text iso639_arb olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2003T12
Up-to-date as of: Wed Oct 29 7:00:16 EDT 2025

Metadata
Title:		Arabic Gigaword
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Graff, David. Arabic Gigaword LDC2003T12. Web Download. Philadelphia: Linguistic Data Consortium, 2003
Contributor:		Graff, David
Date (W3CDTF):		2003
Date Issued (W3CDTF):		2003-07-22
Description:		Introduction Arabic Gigaword was produced by the Linguistic Data Consortium (LDC) and contains approximately 1 million news documents totaling 400 million words of Arabic text. This is a comprehensive archive of newswire text data that has been acquired from Arabic news sources by LDC at the University of Pennsylvania. Four distinct sources of Arabic newswire are represented here: * Agence France Presse (AFA) * Al Hayat News Agency (ALH) * Al Nahar News Agency (ANN) * Xinhua News Agency (XIN) Much of the AFP content in this collection has been published previously by the LDC in Arabic Newswire Part 1 (LDC2001T55) and some of this content has also been included in an Arabic supplement to TDT3 (Topic Detection and Tracking) and as the Arabic component of TDT4. TDT4 also included a four-month sample from Al Hayat and An Nahar (October 2000 - January 2001). Apart from that, all of the Al Hayat, An Nahar, and Xinhua Arabic content, as well as AFP content for 2001-2002, is being released here for the first time. Data There are 319 files, totaling approximately 1.1 GB in compressed form, 4.3 GB uncompressed, and 391,619 K-words (thousands of words). The table below presents the following categories of information: source of the data, number of files per source, and K-words (the number of space-separated tokens in the text, excluding SGML tags), and number of documents per source. Source #Files K-words #DOCs AFA 104 94,484 516,855 ALH 95 139,501 305,250 ANN 96 140,247 327,768 XIA 24 17,387 106,846 TOTAL 319 391,619 1,256,719 All text files in this corpus have been converted to UTF-8 character encoding. Owing to the use of UTF-8, the SGML tagging within each file shows up as lines of single-byte-per-character (ASCII) text, whereas lines of actual text data, including article headlines and datelines, contain a mixture of single-byte and multi-byte characters. In general, single-byte characters in the text data will consist of digits and punctuation marks (where the original source relied on ASCII punctuation codes, rather than Arabic-specific punctuation), whereas multi-byte characters consist of Arabic letters and a small number of special punctuation or other symbols. This variable-width character encoding is intrinsic to UTF-8, and all UTF-8 capable processes will handle the data appropriately. Each file contains all the usable data received by LDC for the given month from the given news source. All text data are presented in SGML form, using a very simple, minimal markup structure. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using the DTD file provided in the publication. Unlike older corpora, the present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs). All sources have received a uniform treatment in terms of quality control, and have been categorized into three distinct "types": story This type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences. multi This type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event: "summaries of today's news," or "news briefs in ... (some general area like finance or sports)," and so on. other These DOCs clearly do not fall into any of the above types; these are things like lists of sports scores, stock prices, temperatures around the world, and so on. The general strategy for categorizing DOCs into these three classes was, for each source, to discover the most common and frequent clues in the text stream that correlated with the "non-story" types. When none of the known clues was in evidence, the DOC was classified as a "story." Previous "Gigaword" corpora (in English and Chinese) had a fourth category, "advis" (for "advisory"), which applied to DOCs that contain text intended solely for news service editors, not the news-reading public. In preparing the Arabic data, the task of determining patterns for assigning "non-story" type labels was carried out by a native speaker of Arabic. For whatever reason, this person did not find the "advis" category to be applicable to any of the data. Samples For an example of the data in this corpus, please view this sample (TXT). Updates This edition of Arabic Gigaword has been superseded by a a new edtion, LDC2006T02
Extent:		Corpus size: 1153433 KB
Identifier:		LDC2003T12
		https://catalog.ldc.upenn.edu/LDC2003T12
		ISBN: 1-58563-271-6
		ISLRN: 537-362-711-928-4
		DOI: 10.35111/ep1n-de95
Language:		Standard Arabic
Language (ISO639):		arb
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2003T12
Rights Holder:		Portions © 1994-2002 Agence France Presse, © 1994-2001 Al Hayat News Agency, © 1995-2002 An Nahar News Agency, © 2001-2003 Xinhua News Agency, © 2003 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2003T12
DateStamp:		2024-09-10
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Graff, David. 2003. Linguistic Data Consortium.
Terms:		area_Asia country_SA dcmi_Text iso639_arb olac_primary_text