OLAC Record
oai:lindat.mff.cuni.cz:11234/1-4839

Metadata
Title:HinDialect 1.1: 26 Hindi-related languages and dialects of the Indic Continuum in North India
Bibliographic Citation:http://hdl.handle.net/11234/1-4839
Creator:Bafna, Niyati
Žabokrtský, Zdeněk
España-Bonet, Cristina
van Genabith, Josef
Kumar, Lalit "Samyak Lalit"
Suman, Sharda
Shivay, Rahul
Date (W3CDTF):2022-09-16T14:57:43Z
Date Available:2022-09-16T14:57:43Z
Description:HinDialect: 26 Hindi-related languages and dialects of the Indic Continuum in North India Languages This is a collection of folksongs for 26 languages that form a dialect continuum in North India and nearby regions. Namely Angika, Awadhi, Baiga, Bengali, Bhadrawahi, Bhili, Bhojpuri, Braj, Bundeli, Chhattisgarhi, Garhwali, Gujarati, Haryanvi, Himachali, Hindi, Kanauji, Khadi Boli, Korku, Kumaoni, Magahi, Malvi, Marathi, Nimadi, Panjabi, Rajasthani, Sanskrit. This data is originally collected by the Kavita Kosh Project at http://www.kavitakosh.org/ . Here are the main characteristics of the languages in this collection: - They are all Indic languages except for Korku. - The majority of them are closely related to the standard Hindi dialect genealogically (such as Hariyanvi and Bhojpuri), although the collection also contains languages such as Bengali and Gujarati which are more distant relatives. - They are all primarily spoken in (North) India (Bengali is also spoken in Bangladesh) - All except Sanksrit are alive languages Data Categorising them by pre-existing available NLP resources, we have: * Band 1 languages : Hindi, Panjabi, Gujarati, Bengali, Nepali. These languages already have other large standard datasets available. Kavita Kosh may have very little data for these languages. * Band 2 languages: Bhojpuri, Magahi, Awadhi, Braj. These languages have growing interest and some datasets of a relatively small size as compared to Band 1 language resources. * Band 3 languages: All other languages in the collection are previously zero-resource languages. These are the languages for which this dataset is the most relevant. Script This dataset is entirely in Devanagari. Content in the case of languages not written in Devanagari (such as Bengali and Gujarati) has been transliterated by the Kavita Kosh Project. Format The dataset contains a single text file containing folksongs per language. Folksongs are separated from each other by an empty line. The first line of a new piece is the title of the folksong, and line separation within folksongs is preserved.
Identifier (URI):http://hdl.handle.net/11234/1-4839
Language:Hindi
Marathi
Magahi
Awadhi
Bhojpuri
Braj
Haryanvi
Rajasthani
Korku
Garhwali
Chhattisgarhi
Bhili
Sanskrit
Angika
Bundeli
Kumaoni
Bhadrawahi
Bengali
Gujarati
Panjabi
Nimadi
Kanauji
Malvi
Uncoded languages
Language (ISO639):hin
mar
mag
awa
bho
bra
bgc
raj
kfq
gbm
hne
bhb
san
anp
bns
kfy
bhd
ben
guj
pan
noe
bjj
mup
mis
Publisher:Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Kavita Kosh Project
Replaces (URI):http://hdl.handle.net/11234/1-4787
Rights:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
http://creativecommons.org/licenses/by-nc-sa/4.0/
Subject:dialect continuum
dialect variation
Indic
Indo-Aryan
Indian
Hindi
Type:corpus
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:  http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:lindat.mff.cuni.cz:11234/1-4839
DateStamp:  2022-09-16
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Bafna, Niyati; Žabokrtský, Zdeněk; España-Bonet, Cristina; van Genabith, Josef; Kumar, Lalit "Samyak Lalit"; Suman, Sharda; Shivay, Rahul. 2022. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms: area_Asia country_BD country_IN dcmi_Text iso639_anp iso639_awa iso639_ben iso639_bgc iso639_bhb iso639_bhd iso639_bho iso639_bjj iso639_bns iso639_bra iso639_gbm iso639_guj iso639_hin iso639_hne iso639_kfq iso639_kfy iso639_mag iso639_mar iso639_mis iso639_mup iso639_noe iso639_pan iso639_raj iso639_san olac_primary_text


http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-4839
Up-to-date as of: Wed Nov 30 4:30:33 EST 2022