OLAC Record: LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1

OLAC Record
oai:www.ldc.upenn.edu:LDC2010L01

Metadata

Title: LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Maamouri, Mohamed, et al. LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 LDC2010L01. Web Download. Philadelphia: Linguistic Data Consortium, 2010

Contributor: Maamouri, Mohamed

Graff, David

Bouziri, Basma

Krouna, Sondos

Bies, Ann

Kulick, Seth

Date (W3CDTF): 2010

Date Issued (W3CDTF): 2010-07-19

Description: *Introduction* The LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 was developed by researchers at LDC. SAMA 3.1 is based on, and updates, Buckwalter Arabic Morphological Analyzer (BAMA) 2.0 (LDC2004L02), which was developed by Tim Buckwalter. Since this is the first public release of SAMA, it has been numbered continuously to reflect the continuity between this release and previous BAMA releases. SAMA 3.1 is a software tool for the morphological analysis of Standard Arabic. SAMA 3.1 considers each Arabic word token in all possible prefix-stem-suffix segmentations, and lists all known/possible annotation solutions, with assignment of all diacritic marks, morpheme boundaries (separating clitics and inflectional morphemes from stems), and all Part-of-Speech (POS) labels and glosses for each morpheme segment. The generated output may then be reviewed by users, and the most appropriate annotation selected from among several choices. The software layer of SAMA 3.1 relies on a data layer that consists primarily of three Arabic-English lexicon files: prefixes (1328 entries), suffixes (945 entries), and stems (79318 entries representing 40654 lemmas). The lexicons are supplemented by three morphological compatibility tables used for controlling prefix-stem combinations (2497 entries), stem-suffix combinations (1632 entries), and prefix-suffix combinations (1180 entries). *Differences since BAMA 2.0* The input format, output format, and data layer of SAMA 3.1 were designed to be backward compatible with BAMA. Incremental changes to the data layer in SAMA have resulted in: * increased lexicon coverage in the dictionary files * important changes and additions to the inventory of POS tags * more possible solutions generated for numerous word forms Data-layer changes are summarized in more detail in the table_updates*.txt documentation files included in the corpus documentation. The software implementation has been updated to allow more input/output options, installation and configuration options, and smoother incorporation in other Perl tools/services. The structure of the dictionary and morphotactic tables has remained the same (the tables provided with SAMA 3.1 differ from the BAMA 2.0 tables only in size and content, not in format). Logical separation between the software layer and data layer allows the new software tools to be used with previous versions of the tables (instructions are provided with software documentation). The basic logic that implements the segmentation and analysis look-up for Arabic words is essentially unchanged since BAMA 2.0. The perldoc documentation for the SAMA.pm Perl module gives a full account of the tokenization logic. The data layer is now accessed through Berkeley DB, with result-caching enabled by default, leading to improved performance. Various utility scripts have also been added to the software package to facilitate more flexible interaction with tools and data. UTF-8 is now the default input/output and internal character encoding, with automatic conversion of different input encodings (cp1256, iso-8859-6, and Buckwalter transliteration are also accepted). With this change, the use of UTF-8 as input is now fully supported, eliminating a range of problems that would result from having to convert to cp1256 for analysis. Full details about input/output options are provided in the SAMA.pm documentation. Further details on changes in software options and implementation may be found in the perldoc software tool documentation, and in the Changes*.txt documentation files. *Dependencies* There are two dependencies for installing and using SAMA 3.1: the DB_File.pm module (available from CPAN), and Encode::Buckwalter (included with the SAMA 3.1 distribution). The DB_File module in turn requires that the Berkeley DB libraries be present. *Samples* * Input * Output XML * Output HTML *Sponsorship* This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. *Updates* There are no updates available at this time. *Additional Licensing Instructions* This 'members-only' corpus is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.

Extent: Corpus size: 4505 KB

Identifier: LDC2010L01

https://catalog.ldc.upenn.edu/LDC2010L01

ISBN: 1-58563-555-3

ISLRN: 898-935-705-624-6

DOI: 10.35111/wgjk-zy44

Language: Standard Arabic

Arabic

Language (ISO639): arb

ara

License: LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 Agreement: https://catalog.ldc.upenn.edu/license/ldc-standard-arabic-morphological-analyzer-sama-version-3-dot-1-ldc2010l01.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2010L01

Rights Holder: Portions © 2002-2004 QAMUS LLC, © 2002-2010 Trustees of the University of Pennsylvania

Subject: Arabic language

Standard Arabic language

Subject (ISO639): ara

arb

Type (DCMI): Text

Type (OLAC): lexicon

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2010L01

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Maamouri, Mohamed; Graff, David; Bouziri, Basma; Krouna, Sondos; Bies, Ann; Kulick, Seth. 2010. Linguistic Data Consortium.
Terms: area_Asia country_SA dcmi_Text iso639_ara iso639_arb olac_lexicon

Inferred Metadata
Country: Saudi Arabia
Area: Asia

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2010L01
Up-to-date as of: Wed Oct 29 7:01:12 EDT 2025

Metadata
Title:		LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Maamouri, Mohamed, et al. LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 LDC2010L01. Web Download. Philadelphia: Linguistic Data Consortium, 2010
Contributor:		Maamouri, Mohamed
		Graff, David
		Bouziri, Basma
		Krouna, Sondos
		Bies, Ann
		Kulick, Seth
Date (W3CDTF):		2010
Date Issued (W3CDTF):		2010-07-19
Description:		Introduction The LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 was developed by researchers at LDC. SAMA 3.1 is based on, and updates, Buckwalter Arabic Morphological Analyzer (BAMA) 2.0 (LDC2004L02), which was developed by Tim Buckwalter. Since this is the first public release of SAMA, it has been numbered continuously to reflect the continuity between this release and previous BAMA releases. SAMA 3.1 is a software tool for the morphological analysis of Standard Arabic. SAMA 3.1 considers each Arabic word token in all possible prefix-stem-suffix segmentations, and lists all known/possible annotation solutions, with assignment of all diacritic marks, morpheme boundaries (separating clitics and inflectional morphemes from stems), and all Part-of-Speech (POS) labels and glosses for each morpheme segment. The generated output may then be reviewed by users, and the most appropriate annotation selected from among several choices. The software layer of SAMA 3.1 relies on a data layer that consists primarily of three Arabic-English lexicon files: prefixes (1328 entries), suffixes (945 entries), and stems (79318 entries representing 40654 lemmas). The lexicons are supplemented by three morphological compatibility tables used for controlling prefix-stem combinations (2497 entries), stem-suffix combinations (1632 entries), and prefix-suffix combinations (1180 entries). Differences since BAMA 2.0 The input format, output format, and data layer of SAMA 3.1 were designed to be backward compatible with BAMA. Incremental changes to the data layer in SAMA have resulted in: * increased lexicon coverage in the dictionary files * important changes and additions to the inventory of POS tags * more possible solutions generated for numerous word forms Data-layer changes are summarized in more detail in the table_updates.txt documentation files included in the corpus documentation. The software implementation has been updated to allow more input/output options, installation and configuration options, and smoother incorporation in other Perl tools/services. The structure of the dictionary and morphotactic tables has remained the same (the tables provided with SAMA 3.1 differ from the BAMA 2.0 tables only in size and content, not in format). Logical separation between the software layer and data layer allows the new software tools to be used with previous versions of the tables (instructions are provided with software documentation). The basic logic that implements the segmentation and analysis look-up for Arabic words is essentially unchanged since BAMA 2.0. The perldoc documentation for the SAMA.pm Perl module gives a full account of the tokenization logic. The data layer is now accessed through Berkeley DB, with result-caching enabled by default, leading to improved performance. Various utility scripts have also been added to the software package to facilitate more flexible interaction with tools and data. UTF-8 is now the default input/output and internal character encoding, with automatic conversion of different input encodings (cp1256, iso-8859-6, and Buckwalter transliteration are also accepted). With this change, the use of UTF-8 as input is now fully supported, eliminating a range of problems that would result from having to convert to cp1256 for analysis. Full details about input/output options are provided in the SAMA.pm documentation. Further details on changes in software options and implementation may be found in the perldoc software tool documentation, and in the Changes.txt documentation files. Dependencies There are two dependencies for installing and using SAMA 3.1: the DB_File.pm module (available from CPAN), and Encode::Buckwalter (included with the SAMA 3.1 distribution). The DB_File module in turn requires that the Berkeley DB libraries be present. Samples * Input * Output XML * Output HTML Sponsorship This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. Updates There are no updates available at this time. Additional Licensing Instructions This 'members-only' corpus is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.
Extent:		Corpus size: 4505 KB
Identifier:		LDC2010L01
		https://catalog.ldc.upenn.edu/LDC2010L01
		ISBN: 1-58563-555-3
		ISLRN: 898-935-705-624-6
		DOI: 10.35111/wgjk-zy44
Language:		Standard Arabic
Language:		Arabic
Language (ISO639):		arb
Language (ISO639):		ara
License:		LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 Agreement: https://catalog.ldc.upenn.edu/license/ldc-standard-arabic-morphological-analyzer-sama-version-3-dot-1-ldc2010l01.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2010L01
Rights Holder:		Portions © 2002-2004 QAMUS LLC, © 2002-2010 Trustees of the University of Pennsylvania
Subject:		Arabic language
Subject:		Standard Arabic language
Subject (ISO639):		ara
Subject (ISO639):		arb
Type (DCMI):		Text
Type (OLAC):		lexicon
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2010L01
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Maamouri, Mohamed; Graff, David; Bouziri, Basma; Krouna, Sondos; Bies, Ann; Kulick, Seth. 2010. Linguistic Data Consortium.
Terms:		area_Asia country_SA dcmi_Text iso639_ara iso639_arb olac_lexicon
Inferred Metadata
Country:		Saudi Arabia
Area:		Asia