OLAC Record

Title:NEMLAR Written Corpus
Abstract:The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories. The corpus is provided in 4 different versions: raw text, fully vowelized text, text with Arabic lexical analysis, text with Arabic POS-tags.
Access Rights:Rights available for: Research Use, Commercial Use
Date Available (W3CDTF):2006-08-11
Date Issued (W3CDTF):2006-08-11
Date Modified (W3CDTF):2007-02-22
Description:Written Corpora
This corpus was produced within the NEMLAR project (http://www.nemlar.org). Two other resources, produced within the same project, are also available: NEMLAR Broadcast News Speech Corpus (ELRA-S0219) and the NEMLAR Speech Synthesis Corpus (ELRA-S0220). The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories, aiming to achieve a well-balanced corpus that offers a representation of the variety in syntactic, semantic and pragmatic features of modern Arabic language. The different categories are: ? Political news: 48,000 words ? Political debate: 30,000 words ? Islamic text (Preaching and others): 29,000 words ? Phrases of common words: 8,500 words ? Text from broadcast news: 5,500 words ? Business: 20,000 words ? Arabic literature: 30,000 words ? General news: 100,000 words ? Interviews: 56,000 words ? Scientific press: 50,000 words ? Sports press: 50,000 words ? Dictionary entries explanation: 52,000 words ? Legal domain text: 21,000 words The time span of the data included goes from late 1990?s to 2005. The corpus is provided in 4 different versions: ? Raw text ? Fully vowelized text ? Text with Arabic lexical analysis ? Text with Arabic POS-tags Diacritics, lexical analysis and POS-tags were generated by RDI?s tool Fassieh?. The accuracy of the automatic analysis is around 95%. To reach about the 99% accuracy rate as defined for this corpus, the linguists used the visual revision mode of Fassieh? where the linguist has to either approve the 1st most likely analysis (most of the time) or select another one manually (in the 4% minority of the cases). The database is distributed on 1 ISO 9660 CD-ROM volume. It has been validated by an external partner and a validation report is provided.
Language (ISO639):ara
Publisher:ELRA (European Language Resources Association)
Type (DCMI):Text
Type (OLAC):primary_text


Archive:  ELRA Catalogue of Language Resources
Description:  http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:catalogue.elra.info:ELRA-W0042
DateStamp:  2006-08-11
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: n.a. 2006. ELRA (European Language Resources Association).
Terms: dcmi_Text iso639_ara olac_primary_text

Up-to-date as of: Wed Oct 2 8:21:50 EDT 2019