OLAC Record: Parallel Corpora for 6 Indian Languages

OLAC Record
oai:catalogue.elra.info:ELRA-W0320

Metadata

Title: Parallel Corpora for 6 Indian Languages

Access Rights: Rights available for: attribution

Date Available (W3CDTF): 2022-02-16

Date Issued (W3CDTF): 2022-02-16

Description: The Parallel Corpora for 6 Indian Languages contains data sets for Bengali (540,000 words – 20,000 parallel sentences), Hindi (1,200,000 words – 37 000 parallel sentences), Malayalam (660,000 words – 29,000 parallel sentences), Tamil (747,000 words – 35,000 parallel sentences), Telugu (951,000 words – 43,000 parallel sentences), and Urdu (1,200,000 words – 33,000 parallel sentences), translated into English. Each data set was created by taking around 100 Indian-language Wikipedia pages and obtaining four independent translations in English of each of the sentences in those documents via non-professional translators hired by crowdsourcing on Amazon Mechanical Turk.All data sets are provided in plain text format. For each of the 6 Indian language, the directory contains:- A metadata file which is organized into rows with four columns each. The rows correspond to the original documents that were translated, and the columns denote (1) the (internal) segment ID assigned to the document (2) the document's original title (3) a translation of the title (4) the manual category assignment we assigned to the document. - The data splits which were constructed by manually assigning the documents to one of eight categories (Technology, Sex, Language and Culture, Religion, Places, People, Events, and Things), and then selecting about 10% of the documents in each category for dev, devtest, and test data (that is, roughly 30% of the data), and the remaining for training data. - Dictionaries created in a separate Mechanical Turk job. - Votes files contain the results from a separate Mechanical Turk task wherein new Turkers were asked to vote on which of the four translations of a given sentence was the best. Such information is available for all languages except Malayalam.

Identifier: ELRA-W0320

ISLRN: 657-350-757-058-6

Identifier (URI): https://catalog.elra.info/en-us/repository/browse/ELRA-W0320/

Language: Tamil

Urdu

English

Telugu

Bengali

Malayalam

Hindi

Language (ISO639): tam

urd

eng

tel

ben

mal

hin

Medium: Not specified

Publisher: ELRA (European Language Resources Association)

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: ELRA Catalogue of Language Resources

Description: http://www.language-archives.org/archive/catalogue.elra.info

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:catalogue.elra.info:ELRA-W0320

DateStamp: 2022-02-16

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: n.a. 2022. ELRA (European Language Resources Association).
Terms: area_Asia area_Europe country_BD country_GB country_IN country_PK dcmi_Text iso639_ben iso639_eng iso639_hin iso639_mal iso639_tam iso639_tel iso639_urd olac_primary_text

http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-W0320
Up-to-date as of: Wed Oct 1 0:57:33 EDT 2025

Metadata
Title:		Parallel Corpora for 6 Indian Languages
Access Rights:		Rights available for: attribution
Date Available (W3CDTF):		2022-02-16
Date Issued (W3CDTF):		2022-02-16
Description:		The Parallel Corpora for 6 Indian Languages contains data sets for Bengali (540,000 words – 20,000 parallel sentences), Hindi (1,200,000 words – 37 000 parallel sentences), Malayalam (660,000 words – 29,000 parallel sentences), Tamil (747,000 words – 35,000 parallel sentences), Telugu (951,000 words – 43,000 parallel sentences), and Urdu (1,200,000 words – 33,000 parallel sentences), translated into English. Each data set was created by taking around 100 Indian-language Wikipedia pages and obtaining four independent translations in English of each of the sentences in those documents via non-professional translators hired by crowdsourcing on Amazon Mechanical Turk.All data sets are provided in plain text format. For each of the 6 Indian language, the directory contains:- A metadata file which is organized into rows with four columns each. The rows correspond to the original documents that were translated, and the columns denote (1) the (internal) segment ID assigned to the document (2) the document's original title (3) a translation of the title (4) the manual category assignment we assigned to the document. - The data splits which were constructed by manually assigning the documents to one of eight categories (Technology, Sex, Language and Culture, Religion, Places, People, Events, and Things), and then selecting about 10% of the documents in each category for dev, devtest, and test data (that is, roughly 30% of the data), and the remaining for training data. - Dictionaries created in a separate Mechanical Turk job. - Votes files contain the results from a separate Mechanical Turk task wherein new Turkers were asked to vote on which of the four translations of a given sentence was the best. Such information is available for all languages except Malayalam.
Identifier:		ELRA-W0320
Identifier:		ISLRN: 657-350-757-058-6
Identifier (URI):		https://catalog.elra.info/en-us/repository/browse/ELRA-W0320/
Language:		Tamil
		Urdu
		English
		Telugu
		Bengali
		Malayalam
		Hindi
Language (ISO639):		tam
		urd
		eng
		tel
		ben
		mal
		hin
Medium:		Not specified
Publisher:		ELRA (European Language Resources Association)
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		ELRA Catalogue of Language Resources
Description:		http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:catalogue.elra.info:ELRA-W0320
DateStamp:		2022-02-16
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		n.a. 2022. ELRA (European Language Resources Association).
Terms:		area_Asia area_Europe country_BD country_GB country_IN country_PK dcmi_Text iso639_ben iso639_eng iso639_hin iso639_mal iso639_tam iso639_tel iso639_urd olac_primary_text