OLAC Record oai:www.ldc.upenn.edu:LDC2003T06 |
Metadata | ||
Title: | Arabic Treebank: Part 1 v 2.0 | |
Access Rights: | Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining | |
Bibliographic Citation: | Maamouri, Mohamed, et al. Arabic Treebank: Part 1 v 2.0 LDC2003T06. Web Download. Philadelphia: Linguistic Data Consortium, 2003 | |
Contributor: | Maamouri, Mohamed | |
Bies, Ann | ||
Jin, Hubert | ||
Buckwalter, Tim | ||
Date (W3CDTF): | 2003 | |
Date Issued (W3CDTF): | 2003-02-03 | |
Description: | *Introduction* Arabic Treebank: Part 1 v 2.0 was developed by the Linguistic Data Consortium (LDC) and contains approximately 140,000 tokens of Arabic text with part-of-speech (POS) and treebank annotation. The goal of the Arabic Treebank project is to support the development of data-driven approaches to natural language processing (NLP), human language technologies, automatic content extraction (topic extraction and/or grammar extraction), cross-lingual information retrieval, information detection, and general linguistic research on Modern Standard Arabic. LDC was sponsored to develop an Arabic POS and Treebank of 1 million words. The Penn Arabic Treebank, which started in November 2001 as part of the DARPA TIDES project, is particularly suitable for language developers, computational linguists, and computer scientists who are interested in various aspects of NLP. It started in the fall of 2001 with the objective of annotating a large Arabic machine-readable text corpus manually and automatically. This corpus is a release of part one of that project. The subsequent versions of this corpus are: * Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis) (LDC2005T02) * Arabic Treebank: Part 1 v 4.1 (LDC2010T13) *Data* The following table gives a breakdown of the data contained in the entire Arabic Treebank project, with discrepancies between versions for Parts 1, 2, and 3. The fields include source, number of stories, total number of tokens, number of tokens after clitic separation, and number of Arabic word tokens after punctuation, numbers, and Latin strings have been taken out. The totals given at the bottom are calculated from the latest versions where discrepancies exist, and do not include tokens after clitic separation since that number is missing from Part 4. Part Source Stories Total Tokens Tokens After Clitic Separation Arabic Word Tokens 1 (V 2.0) Agence France Presse 734 140,265 168,123 N/A 1 (V 3.0 and 4.1) Agence France Presse 734 145,386 166,068 123,795 2 (V 2.0) Ummah Press 501 144,199 168,297 125,698 2 (V 3.1) Ummah Press 501 144,199 169,319 125,709 3 (V 1.0 and 2.0) An Nahar News Agency 600 340,281 400,213 293,035 3 (V 3.2) An Nahar News Agency 599 339,710 402,291 292,554 4 Assabah 397 161,915 N/A 146,491 Totals 2,231 791,210 688,549 This corpus uses Modern Standard Arabic text from the Agence France Presse (AFP) newswire archives for July - November 2000 later released in Arabic Gigaword (LDC2003T12). For this work, annotators must be native speakers of Arabic, and they must understand enough linguistics to check morphosyntactic analysis and build syntactic structures. *Samples* For examples of the data in this corpus, please view these samples: * Treebank Sample (TXT) * POS Sample (TXT) * SGML Sample * XML Sample *Updates* None at this time. | |
Extent: | Corpus size: 271360 KB | |
Identifier: | LDC2003T06 | |
https://catalog.ldc.upenn.edu/LDC2003T06 | ||
ISBN: 1-58563-261-9 | ||
ISLRN: 333-321-196-670-5 | ||
DOI: 10.35111/vfdx-p575 | ||
Language: | Standard Arabic | |
Language (ISO639): | arb | |
License: | LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf | |
Medium: | Distribution: Web Download | |
Publisher: | Linguistic Data Consortium | |
Publisher (URI): | https://www.ldc.upenn.edu | |
Relation (URI): | https://catalog.ldc.upenn.edu/docs/LDC2003T06 | |
Rights Holder: | Portions © 2000 Agence France-Presse, © 2002 Trustees of the University of Pennsylvania | |
Type (DCMI): | Text | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | The LDC Corpus Catalog | |
Description: | http://www.language-archives.org/archive/www.ldc.upenn.edu | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:www.ldc.upenn.edu:LDC2003T06 | |
DateStamp: | 2021-08-06 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Maamouri, Mohamed; Bies, Ann; Jin, Hubert; Buckwalter, Tim. 2003. Linguistic Data Consortium. | |
Terms: | area_Asia country_SA dcmi_Text iso639_arb olac_primary_text |