OLAC Record: GlobalPhone Thai Pronunciation Dictionary

OLAC Record
oai:catalogue.elra.info:ELRA-S0372

Metadata

Title: GlobalPhone Thai Pronunciation Dictionary

Access Rights: Rights available for: nonCommercialUse, commercialUse

Date Available (W3CDTF): 2014-11-25

Date Issued (W3CDTF): 2014-11-25

Date Modified (W3CDTF): 2014-11-25

Description: The GlobalPhone pronunciation dictionaries, created within the frame¬work of the multilingual speech and language corpus GlobalPhone, were developed in collaboration with the Karlsruhe Institute of Technology (KIT). The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The pronunciation dictionaries are currently available in 18 languages: Arabic (29230 entries/27059 words), Bulgarian (20193 entries), Croatian (23497 entries/20628 words), Czech (33049 entries/32942 words), French (36837 entries/20710 words), German (48979 entries/46035 words), Hausa (42662 entries/42079 words), Japanese (18094 entries), Polish (36484 entries), Portuguese (Brazilian) (54146 entries/54130 words), Russian (28818 entries/27667 words), Spanish (Latin American) (43264 entries/33960 words), Swedish (about 25000 entries), Turkish (31330 entries/31087 words), Vietnamese (38504 entries/29974 words), Chinese-Mandarin (73388 pronunciations), Korean (3500 syllables), and Thai (a small set with 12,420 pronunciation entries of 12,420 different words, and does not include pronunciation variants, and a larger set which contains 25,570 pronunciation entries of 22,462 different words units, and includes 3,108 entries of up to four pronunciation variants). 1) Dictionary Encoding: The pronunciation dictionary entries consist of full word forms and are either given in the original script of that language, mostly in UTF-8 encoding (Bulgarian, Croatian, Czech, French, Polish, Russian, Spanish, Thai) corresponding to the trl-files of the GlobalPhone transcriptions or in Romanized script (Arabic, German, Hausa, Japanese, Korean, Mandarin, Portuguese, Swedish, Turkish, Vietnamese) corresponding to the rmn-files of the GlobalPhone transcriptions, respectively. In the latter case the documentation mostly provides a mapping from the Romanized to the original script. 2) Dictionary Phone set: The phone sets for each language were derived individually from the literature following best practices for automatic speech processing. Each phone set is explained and described in the documentation using the international standards of the International Phonetic Alphabet (IPA). For most languages a mapping to the language independent GlobalPhone naming conventions (indicated by “M_”) is provided for the purpose of data sharing across languages to build multilingual acoustic models. 3) Dictionary Generation:Whenever the grapheme-to-phoneme relationship allowed, the dictionaries were created semi-automatically in a rule-based fashion using a set of grapheme-to-phoneme mapping rules. The number of rules highly depends on the language. After the automatic creation process, all dictionaries were manually cross-checked by native speakers, correcting potential errors of the automatic pronunciation generation process. Most of the dictionaries have been applied to large vocabulary speech recognition. In many cases the GlobalPhone dictionaries were compared to straight-forward grapheme-based speech recognition and to alternative sources, such as Wiktionary and usually demonstrated to be superior in terms of quality, coverage, and accuracy.4) Format: The format of the dictionaries is the same across languages and is straight-forward. Each line consists of one word form and its pronunciation separated by blank. The pronunciation consists of a concatenation of phone symbols separated by blanks. Both, words and their pronunciations are given in tcl-script list format, i.e. enclosed in “{}”, since phones can carry tags, indicating the tone and length of a vowel, or the word boundary tag “WB”, indicating the boundary of a dictionary unit. The WB tag can for example be included as a standard question in the decision tree questions for capturing crossword models in context-dependent modeling. Pronunciation variants are indicated by () with n = 2, 3, 4,… indicating the number of variants per word. The order in which variants occur in the dictionary is not necessarily related to their frequency in the corpus.{word} {{w WB} o r {d WB}}5) Documentation: The pronunciation dictionaries for each language are complemented by a documentation that describes the format of the dictionary, the phone set including its mapping to the International Phonetic Alphabet (IPA), and the frequency distribution of the phones in the dictionary. Most of the pronunciation dictionaries have been successfully applied to large vocabulary speech recognition and references to publications are given when available.

Identifier: ELRA-S0372

ISLRN: 639-494-159-021-0

Identifier (URI): https://catalog.elra.info/en-us/repository/browse/ELRA-S0372/

Language: Thai

Language (ISO639): tha

Medium: Not specified

Publisher: ELRA (European Language Resources Association)

Type (DCMI): Text

Sound

Type (OLAC): lexicon

OLAC Info

Archive: ELRA Catalogue of Language Resources

Description: http://www.language-archives.org/archive/catalogue.elra.info

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:catalogue.elra.info:ELRA-S0372

DateStamp: 2014-11-25

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: n.a. 2014. ELRA (European Language Resources Association).
Terms: area_Asia country_TH dcmi_Sound dcmi_Text iso639_tha olac_lexicon

http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-S0372
Up-to-date as of: Wed Oct 1 0:56:17 EDT 2025

Metadata
Title:		GlobalPhone Thai Pronunciation Dictionary
Access Rights:		Rights available for: nonCommercialUse, commercialUse
Date Available (W3CDTF):		2014-11-25
Date Issued (W3CDTF):		2014-11-25
Date Modified (W3CDTF):		2014-11-25
Description:		The GlobalPhone pronunciation dictionaries, created within the frame¬work of the multilingual speech and language corpus GlobalPhone, were developed in collaboration with the Karlsruhe Institute of Technology (KIT). The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The pronunciation dictionaries are currently available in 18 languages: Arabic (29230 entries/27059 words), Bulgarian (20193 entries), Croatian (23497 entries/20628 words), Czech (33049 entries/32942 words), French (36837 entries/20710 words), German (48979 entries/46035 words), Hausa (42662 entries/42079 words), Japanese (18094 entries), Polish (36484 entries), Portuguese (Brazilian) (54146 entries/54130 words), Russian (28818 entries/27667 words), Spanish (Latin American) (43264 entries/33960 words), Swedish (about 25000 entries), Turkish (31330 entries/31087 words), Vietnamese (38504 entries/29974 words), Chinese-Mandarin (73388 pronunciations), Korean (3500 syllables), and Thai (a small set with 12,420 pronunciation entries of 12,420 different words, and does not include pronunciation variants, and a larger set which contains 25,570 pronunciation entries of 22,462 different words units, and includes 3,108 entries of up to four pronunciation variants). 1) Dictionary Encoding: The pronunciation dictionary entries consist of full word forms and are either given in the original script of that language, mostly in UTF-8 encoding (Bulgarian, Croatian, Czech, French, Polish, Russian, Spanish, Thai) corresponding to the trl-files of the GlobalPhone transcriptions or in Romanized script (Arabic, German, Hausa, Japanese, Korean, Mandarin, Portuguese, Swedish, Turkish, Vietnamese) corresponding to the rmn-files of the GlobalPhone transcriptions, respectively. In the latter case the documentation mostly provides a mapping from the Romanized to the original script. 2) Dictionary Phone set: The phone sets for each language were derived individually from the literature following best practices for automatic speech processing. Each phone set is explained and described in the documentation using the international standards of the International Phonetic Alphabet (IPA). For most languages a mapping to the language independent GlobalPhone naming conventions (indicated by “M_”) is provided for the purpose of data sharing across languages to build multilingual acoustic models. 3) Dictionary Generation:Whenever the grapheme-to-phoneme relationship allowed, the dictionaries were created semi-automatically in a rule-based fashion using a set of grapheme-to-phoneme mapping rules. The number of rules highly depends on the language. After the automatic creation process, all dictionaries were manually cross-checked by native speakers, correcting potential errors of the automatic pronunciation generation process. Most of the dictionaries have been applied to large vocabulary speech recognition. In many cases the GlobalPhone dictionaries were compared to straight-forward grapheme-based speech recognition and to alternative sources, such as Wiktionary and usually demonstrated to be superior in terms of quality, coverage, and accuracy.4) Format: The format of the dictionaries is the same across languages and is straight-forward. Each line consists of one word form and its pronunciation separated by blank. The pronunciation consists of a concatenation of phone symbols separated by blanks. Both, words and their pronunciations are given in tcl-script list format, i.e. enclosed in “{}”, since phones can carry tags, indicating the tone and length of a vowel, or the word boundary tag “WB”, indicating the boundary of a dictionary unit. The WB tag can for example be included as a standard question in the decision tree questions for capturing crossword models in context-dependent modeling. Pronunciation variants are indicated by () with n = 2, 3, 4,… indicating the number of variants per word. The order in which variants occur in the dictionary is not necessarily related to their frequency in the corpus.{word} {{w WB} o r {d WB}}5) Documentation: The pronunciation dictionaries for each language are complemented by a documentation that describes the format of the dictionary, the phone set including its mapping to the International Phonetic Alphabet (IPA), and the frequency distribution of the phones in the dictionary. Most of the pronunciation dictionaries have been successfully applied to large vocabulary speech recognition and references to publications are given when available.
Identifier:		ELRA-S0372
Identifier:		ISLRN: 639-494-159-021-0
Identifier (URI):		https://catalog.elra.info/en-us/repository/browse/ELRA-S0372/
Language:		Thai
Language (ISO639):		tha
Medium:		Not specified
Publisher:		ELRA (European Language Resources Association)
Type (DCMI):		Text
Type (DCMI):		Sound
Type (OLAC):		lexicon
OLAC Info
Archive:		ELRA Catalogue of Language Resources
Description:		http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:catalogue.elra.info:ELRA-S0372
DateStamp:		2014-11-25
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		n.a. 2014. ELRA (European Language Resources Association).
Terms:		area_Asia country_TH dcmi_Sound dcmi_Text iso639_tha olac_lexicon