OLAC Record: Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis)

OLAC Record
oai:www.ldc.upenn.edu:LDC2005T20

Metadata

Title: Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis)

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Maamouri, Mohamed, et al. Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis) LDC2005T20. Web Download. Philadelphia: Linguistic Data Consortium, 2005

Contributor: Maamouri, Mohamed

Bies, Ann

Buckwalter, Tim

Jin, Hubert

Mekki, Wigdan

Date (W3CDTF): 2005

Date Issued (W3CDTF): 2005-06-15

Description: *Introduction* Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis) was developed by the Linguistic Data Consortium (LDC) and contains approximately 300,000 Arabic word tokens with both syntactic treebank annotation and annotation on part of speech (POS), gloss, and word segmentation. The goal of the Arabic Treebank project is to support the development of data-driven approaches to natural language processing (NLP), human language technologies, automatic content extraction (topic extraction and/or grammar extraction), cross-lingual information retrieval, information detection, and other forms of linguistic research on Modern Standard Arabic in general. LDC was sponsored to develop an Arabic POS and Treebank of 1 million words. This corpus is part three of that project. Treebanks are language resources that provide annotations of natural languages at various levels of structure: at the word level, the phrase level, and the sentence level. Treebanks have become crucially important for the development of both data-driven and general linguistic research. This corpus is designed for those who study and use languages either professionally or academically, and who need text corpora in their work. The Penn Arabic Treebank, which started in November 2001 as part of the DARPA TIDES project, is particularly suitable for language developers, computational linguists, and computer scientists who are interested in various aspects of NLP. It started in the fall of 2001 with the objective of annotating a large Arabic machine-readable text corpus manually and automatically. As in previous Penn Treebanks, two different kinds of information need to be produced by two different (human and computer) processes. The Arabic Treebank project consists therefore of two distinct phases: * Part-of-Speech (POS) tagging, which includes inflectional features and gloss information not traditionally included with POS annotation * Arabic Treebanking (ArabicTB), which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, etc. *Data* The following table gives a breakdown of the data contained in the entire Arabic Treebank project, with discrepancies between versions for Parts 1 and 3. The fields include source, number of stories, total number of tokens, number of tokens after clitic separation, and number of Arabic word tokens after punctuation, numbers, and latin strings have been taken out. The totals given at the bottom are calculated from the latest versions where discrepencies exist, and do not include tokens after clitic separation since that number is missing from Part 4. Part Source Stories Total Tokens Tokens After Clitic Separation Arabic Word Tokens 1 (V 2.0) Agence France Presse 734 140,265 168,123 N/A 1 (V 3.0 and 4.1) Agence France Presse 734 145,386 166,068 123,795 2 Ummah Press 501 144,199 169,319 125,709 3 (V 1.0 and 2.0) An Nahar News Agency 600 340,281 400,213 293,035 3 (V 3.2) An Nahar News Agency 599 339,710 402,291 292,554 4 Assabah 397 161,915 N/A 146,491 Totals 2,231 791,210 688,549 For this corpus, the An Nahar News Agency stories were taken from Arabic Gigaword (LDC2003T12). This corpus is also referred to as ANNAHAR. The new features include complete vocalization of all Imperfect Verb mood endings: Indicative, Subjunctive, and Jussive. Tim Buckwalter's lexicon and morphological analyzer was used to generate a candidate list of POS tags for each word. (Please note that some words do not exist in this lexicon.) The POS task is just to select the correct POS tag. This corpus has both previous and subsequent versions. They are, respectively: * Arabic Treebank: Part 3 v 1.0 (LDC2004T11) - POS annotation only * Arabic Treebank: Part 3 v 3.2 (LDC2010T08) - Contains significant revisions *Samples* For examples of the data contained in this corpus, please view this POS sample (XML) and this Treebank sample (XML). *Updates* None at this time.

Identifier: LDC2005T20

https://catalog.ldc.upenn.edu/LDC2005T20

ISBN: 1-58563-341-0

ISLRN: 661-115-390-052-2

DOI: 10.35111/ghrm-vt27

Language: Standard Arabic

Language (ISO639): arb

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2005T20

Rights Holder: Portions © 2002 An Nahar, © 2003, 2004, 2005 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2005T20

DateStamp: 2021-09-27

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Maamouri, Mohamed; Bies, Ann; Buckwalter, Tim; Jin, Hubert; Mekki, Wigdan. 2005. Linguistic Data Consortium.
Terms: area_Asia country_SA dcmi_Text iso639_arb olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2005T20
Up-to-date as of: Wed Oct 29 7:00:51 EDT 2025

Metadata
Title:		Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis)
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Maamouri, Mohamed, et al. Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis) LDC2005T20. Web Download. Philadelphia: Linguistic Data Consortium, 2005
Contributor:		Maamouri, Mohamed
		Bies, Ann
		Buckwalter, Tim
		Jin, Hubert
		Mekki, Wigdan
Date (W3CDTF):		2005
Date Issued (W3CDTF):		2005-06-15
Description:		Introduction Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis) was developed by the Linguistic Data Consortium (LDC) and contains approximately 300,000 Arabic word tokens with both syntactic treebank annotation and annotation on part of speech (POS), gloss, and word segmentation. The goal of the Arabic Treebank project is to support the development of data-driven approaches to natural language processing (NLP), human language technologies, automatic content extraction (topic extraction and/or grammar extraction), cross-lingual information retrieval, information detection, and other forms of linguistic research on Modern Standard Arabic in general. LDC was sponsored to develop an Arabic POS and Treebank of 1 million words. This corpus is part three of that project. Treebanks are language resources that provide annotations of natural languages at various levels of structure: at the word level, the phrase level, and the sentence level. Treebanks have become crucially important for the development of both data-driven and general linguistic research. This corpus is designed for those who study and use languages either professionally or academically, and who need text corpora in their work. The Penn Arabic Treebank, which started in November 2001 as part of the DARPA TIDES project, is particularly suitable for language developers, computational linguists, and computer scientists who are interested in various aspects of NLP. It started in the fall of 2001 with the objective of annotating a large Arabic machine-readable text corpus manually and automatically. As in previous Penn Treebanks, two different kinds of information need to be produced by two different (human and computer) processes. The Arabic Treebank project consists therefore of two distinct phases: * Part-of-Speech (POS) tagging, which includes inflectional features and gloss information not traditionally included with POS annotation * Arabic Treebanking (ArabicTB), which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, etc. Data The following table gives a breakdown of the data contained in the entire Arabic Treebank project, with discrepancies between versions for Parts 1 and 3. The fields include source, number of stories, total number of tokens, number of tokens after clitic separation, and number of Arabic word tokens after punctuation, numbers, and latin strings have been taken out. The totals given at the bottom are calculated from the latest versions where discrepencies exist, and do not include tokens after clitic separation since that number is missing from Part 4. Part Source Stories Total Tokens Tokens After Clitic Separation Arabic Word Tokens 1 (V 2.0) Agence France Presse 734 140,265 168,123 N/A 1 (V 3.0 and 4.1) Agence France Presse 734 145,386 166,068 123,795 2 Ummah Press 501 144,199 169,319 125,709 3 (V 1.0 and 2.0) An Nahar News Agency 600 340,281 400,213 293,035 3 (V 3.2) An Nahar News Agency 599 339,710 402,291 292,554 4 Assabah 397 161,915 N/A 146,491 Totals 2,231 791,210 688,549 For this corpus, the An Nahar News Agency stories were taken from Arabic Gigaword (LDC2003T12). This corpus is also referred to as ANNAHAR. The new features include complete vocalization of all Imperfect Verb mood endings: Indicative, Subjunctive, and Jussive. Tim Buckwalter's lexicon and morphological analyzer was used to generate a candidate list of POS tags for each word. (Please note that some words do not exist in this lexicon.) The POS task is just to select the correct POS tag. This corpus has both previous and subsequent versions. They are, respectively: * Arabic Treebank: Part 3 v 1.0 (LDC2004T11) - POS annotation only * Arabic Treebank: Part 3 v 3.2 (LDC2010T08) - Contains significant revisions Samples For examples of the data contained in this corpus, please view this POS sample (XML) and this Treebank sample (XML). Updates None at this time.
Identifier:		LDC2005T20
		https://catalog.ldc.upenn.edu/LDC2005T20
		ISBN: 1-58563-341-0
		ISLRN: 661-115-390-052-2
		DOI: 10.35111/ghrm-vt27
Language:		Standard Arabic
Language (ISO639):		arb
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2005T20
Rights Holder:		Portions © 2002 An Nahar, © 2003, 2004, 2005 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2005T20
DateStamp:		2021-09-27
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Maamouri, Mohamed; Bies, Ann; Buckwalter, Tim; Jin, Hubert; Mekki, Wigdan. 2005. Linguistic Data Consortium.
Terms:		area_Asia country_SA dcmi_Text iso639_arb olac_primary_text