OLAC Record
oai:www.ldc.upenn.edu:LDC2005T30

Metadata
Title:Arabic Treebank: Part 4 v 1.0 (MPG Annotation)
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Maamouri, Mohamed, et al. Arabic Treebank: Part 4 v 1.0 (MPG Annotation) LDC2005T30. Web Download. Philadelphia: Linguistic Data Consortium, 2005
Contributor:Maamouri, Mohamed
Bies, Ann
Buckwalter, Tim
Jin, Hubert
Mekki, Wigdan
Date (W3CDTF):2005
Date Issued (W3CDTF):2005-10-15
Description:*Introduction* Arabic Treebank: Part 4 v 1.0 (MPG Annotation) was developed by the Linguistic Data Consortium (LDC) and contains approximately 150,000 Arabic word tokens with annotation on part of speech (POS), gloss, and word segmentation. The goal of the Arabic Treebank project is to support the development of data-driven approaches to natural language processing (NLP), human language technologies, automatic content extraction (topic extraction and/or grammar extraction), cross-lingual information retrieval, information detection, and general linguistic research on Modern Standard Arabic. LDC was sponsored to develop an Arabic POS and Treebank of 1 million words. This corpus is the fourth part of that project. The Penn Arabic Treebank, which started in November 2001 as part of the DARPA TIDES project, is particularly suitable for language developers, computational linguists, and computer scientists who are interested in various aspects of NLP. It started in the fall of 2001 with the objective of annotating a large Arabic machine-readable text corpus manually and automatically. *Data* The following table gives a breakdown of the data contained in the entire Arabic Treebank project, with discrepancies between versions for Parts 1 and 3. The fields include source, number of stories, total number of tokens, number of tokens after clitic separation, and number of Arabic word tokens after punctuation, numbers, and latin strings have been taken out. The totals given at the bottom are calculated from the latest versions where discrepencies exist, and do not include tokens after clitic separation since that number is missing from Part 4. Part Source Stories Total Tokens Tokens After Clitic Separation Arabic Word Tokens 1 (V 2.0) Agence France Presse 734 140,265 168,123 N/A 1 (V 3.0 and 4.1) Agence France Presse 734 145,386 166,068 123,795 2 Ummah Press 501 144,199 169,319 125,709 3 (V 1.0 and 2.0) An Nahar News Agency 600 340,281 400,213 293,035 3 (V 3.2) An Nahar News Agency 599 339,710 402,291 292,554 4 Assabah 397 161,915 N/A 146,491 Totals 2,231 791,210 688,549 For this corpus we selected text from Assabah, which is a Modern Standard Arabic newspaper published in Tunis, Tunisia. There are 397 stories (specified by the DOC ID) in this corpus, dated from September to November in 2004. The average number of words per story is slightly above 400. Files relating to sports, financial data, and other domains such as horoscopes, were not kept in the corpus. The data was annotated with stand-off markup. The .sgm files are read-only after the collection/processing. Same as Arabic Treebank: Part 3 v 1.0 (LDC2004T11), headlines are also annotated in this corpus. Tim Buckwalter's lexicon and morphological analyzer was used to generate a candidate list of POS tags for each word. (Please note that some words do not exist in this lexicon.) The POS task is just to select the correct POS tag. In the data/ directory, you will find the following: * sgm - Processed source files in SGML format. Please note that there is a parallel text corpus being developed at LDC for these same 397 source files. * xml - The AG xml files containing the POS annotation. The dtd files for the AG format are also included there. The xml files are compressed. * pos - POS annotation output in plain text. *Samples* To view an example of the data in this corpus, please view this sample POS file (TXT). *Updates* None at this time.
Identifier:LDC2005T30
https://catalog.ldc.upenn.edu/LDC2005T30
ISBN: 1-58563-343-7
ISLRN: 165-794-218-631-9
DOI: 10.35111/cmrf-mr79
Language:Standard Arabic
Language (ISO639):arb
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2005T30
Rights Holder:Portions © 2004 Assabah Press Group, © 2005 Trustees of the University Pennsylvania.
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2005T30
DateStamp:  2021-07-19
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Maamouri, Mohamed; Bies, Ann; Buckwalter, Tim; Jin, Hubert; Mekki, Wigdan. 2005. Linguistic Data Consortium.
Terms: area_Asia country_SA dcmi_Text iso639_arb olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2005T30
Up-to-date as of: Thu Oct 24 7:30:11 EDT 2024