OLAC Record
oai:catalogue.elra.info:ELRA-W0320

Metadata
Title:Parallel Corpora for 6 Indian Languages
Access Rights: Rights available for: attribution
Date Available (W3CDTF):2022-02-16
Date Issued (W3CDTF):2022-02-16
Description:The Parallel Corpora for 6 Indian Languages contains data sets for Bengali (540,000 words – 20,000 parallel sentences), Hindi (1,200,000 words – 37 000 parallel sentences), Malayalam (660,000 words – 29,000 parallel sentences), Tamil (747,000 words – 35,000 parallel sentences), Telugu (951,000 words – 43,000 parallel sentences), and Urdu (1,200,000 words – 33,000 parallel sentences), translated into English. Each data set was created by taking around 100 Indian-language Wikipedia pages and obtaining four independent translations in English of each of the sentences in those documents via non-professional translators hired by crowdsourcing on Amazon Mechanical Turk.All data sets are provided in plain text format. For each of the 6 Indian language, the directory contains:- A metadata file which is organized into rows with four columns each. The rows correspond to the original documents that were translated, and the columns denote (1) the (internal) segment ID assigned to the document (2) the document's original title (3) a translation of the title (4) the manual category assignment we assigned to the document. - The data splits which were constructed by manually assigning the documents to one of eight categories (Technology, Sex, Language and Culture, Religion, Places, People, Events, and Things), and then selecting about 10% of the documents in each category for dev, devtest, and test data (that is, roughly 30% of the data), and the remaining for training data. - Dictionaries created in a separate Mechanical Turk job. - Votes files contain the results from a separate Mechanical Turk task wherein new Turkers were asked to vote on which of the four translations of a given sentence was the best. Such information is available for all languages except Malayalam.
Identifier:ELRA-W0320
ISLRN: 657-350-757-058-6
Identifier (URI):http://catalog.elra.info/en-us/repository/browse/ELRA-W0320/
Language:Tamil
Urdu
English
Telugu
Bengali
Malayalam
Hindi
Language (ISO639):tam
urd
eng
tel
ben
mal
hin
Medium:Not specified
Publisher:ELRA (European Language Resources Association)
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  ELRA Catalogue of Language Resources
Description:  http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:catalogue.elra.info:ELRA-W0320
DateStamp:  2022-02-16
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: n.a. 2022. ELRA (European Language Resources Association).
Terms: area_Asia area_Europe country_BD country_GB country_IN country_PK dcmi_Text iso639_ben iso639_eng iso639_hin iso639_mal iso639_tam iso639_tel iso639_urd olac_primary_text


http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-W0320
Up-to-date as of: Sun Dec 4 3:47:22 EST 2022