OLAC Record
oai:catalogue.elra.info:ELRA-S0384

Metadata
Title:Arabic Speech Corpus
Abstract:This speech corpus was recorded through a Neumann TLM 103 Studio Microphone by one male speaker in South Levantine Arabic (Damascian accent) in a professional studio. Synthesized speech as an output using this corpus has produced a high quality, natural voice. It consists of 1813 utterances for a total of 3.7 hours, with orthographic and phonetic transcriptions. An extra set of 18 minutes of fully annotated corpus, used to evaluate the corpus, is also provided.
Access Rights:Rights available for: Research Use, Commercial Use
Date Available (W3CDTF):2016-08-19
Date Issued (W3CDTF):2016-08-19
Date Modified (W3CDTF):2017-07-05
Description:Desktop/Microphone
This speech corpus has been developed as part of a PhD work carried out by Nawar Halabi at the University of Southampton. The corpus was recorded through a Neumann TLM 103 Studio Microphone by one male speaker in South Levantine Arabic (Damascian accent) in a professional studio. The transcript was collected from ?Aljazeera Learn? (Aljazeera 2015), a language learning website which was chosen because it contained fully diacritised text which makes it easier to phonetise. The transcript was split into utterances based on punctuation, to make it easier for the speaker during the recording sessions. Synthesized speech as an output using this corpus has produced a high quality, natural voice. It consists of 1813 utterances for a total of 3.7 hours consisting of: - 2.1 hours of normal utterances, - 1.6 hours of nonsense utterances (utterances that are not semantically, orthographically or syntactically correct). This package corresponds to version 2.0 of the corpus and includes: - 1813 .wav files containing spoken utterances, - 1813 .lab files containing text utterances, - 1813 .TextGrid files containing the phoneme labels with time stamps of the boundaries where these occur in the .wav files. These files can be opened using Praat software (see http://www.fon.hum.uva.nl/praat/), - phonetic transcriptions are gathered in one single text file which has the form "[wav_filename]" "[Phoneme Sequence]" in every line. - orthographic transcriptions are gathered in one single text file which has the form "[wav_filename]" "[Orthographic Transcript]" in every line. Orthography is in Buckwalter Format (see http://www.qamus.org/transliteration.htm) which is friendlier where there is a software that does not read Arabic script. It can be easily converted back to Arabic. - An extra set of 18 minutes of fully annotated corpus, used to evaluate the corpus, is also provided (separate from above but with the same structure as above). Arabic Speech Corpus by Nawar Halabi is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Identifier:ELRA-S0384
http://catalog.elra.info/product_info.php?products_id=1276
Language:Arabic
Language (ISO639):ara
Publisher:ELRA (European Language Resources Association)
Type (DCMI):Sound
Type (OLAC):primary_text

OLAC Info

Archive:  ELRA Catalogue of Language Resources
Description:  http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:catalogue.elra.info:ELRA-S0384
DateStamp:  2016-08-19
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: n.a. 2016. ELRA (European Language Resources Association).
Terms: dcmi_Sound iso639_ara olac_primary_text


http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-S0384
Up-to-date as of: Wed Oct 2 8:22:55 EDT 2019