OLAC Record: HindMonoCorp 0.5

OLAC Record
oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-6260-A

Metadata

Title: HindMonoCorp 0.5

Bibliographic Citation: http://hdl.handle.net/11858/00-097C-0000-0023-6260-A

Creator: Bojar, Ondřej

Diatka, Vojtěch

Rychlý, Pavel

Straňák, Pavel

Suchomel, Vít

Tamchyna, Aleš

Zeman, Daniel

Date (W3CDTF): 2014-03-21T22:36:19Z

Date Available: 2014-03-21T22:36:19Z

Description: Hindi monolingual corpus. It is based primarily on web crawls performed using various tools and at various times. Since the web is a living data source, we treat these crawls as completely separate sources, despite they may overlap. To estimate the magnitude of this overlap, we compared the total number of segments if we concatenate the individual sources (each source being deduplicated on its own) with the number of segments if we de-duplicate all sources to- gether. The difference is just around 1%, confirming, that various web crawls (or their subsequent processings) differ significantly. HindMonoCorp contains data from: Hindi web texts, a monolingual corpus containing mainly Hindi news articles has already been collected and released by Bojar et al. (2008). We use the HTML files as crawled for this corpus in 2010 and we add a small crawl performed in 2013 and re-process them with the current pipeline. These sources are denoted HWT 2010 and HWT 2013 in the following. Hindi corpora in W2C have been collected by Martin Majliš during his project to automatically collect corpora in many languages (Majliš and Žabokrtský, 2012). There are in fact two corpora of Hindi available—one from web harvest (W2C Web) and one from the Wikipedia (W2C Wiki). SpiderLing is a web crawl carried out during November and December 2013 using SpiderLing (Suchomel and Pomikálek, 2012). The pipeline includes extraction of plain texts and deduplication at the level of documents, see below. CommonCrawl is a non-profit organization that regu- larly crawls the web and provides anyone with the data. We are grateful to Christian Buck for extracting plain text Hindi segments from the 2012 and 2013-fall crawls for us. Intercorp – 7 books with their translations scanned and manually alligned per paragraph RSS Feeds from Webdunia.com and the Hindi version of BBC International followed by our custom crawler from September 2013 till January 2014.

LM2010013,

Identifier (URI): http://hdl.handle.net/11858/00-097C-0000-0023-6260-A

Language: Hindi

Language (ISO639): hin

Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)

Replaces (URI): http://hdl.handle.net/11858/00-097C-0000-0001-CC1E-B

Rights: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)

http://creativecommons.org/licenses/by-nc-sa/3.0/

Subject: corpus

Type: corpus

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Description: http://www.language-archives.org/archive/lindat.mff.cuni.cz

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-6260-A

DateStamp: 2021-06-29

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Bojar, Ondřej; Diatka, Vojtěch; Rychlý, Pavel; Straňák, Pavel; Suchomel, Vít; Tamchyna, Aleš; Zeman, Daniel. 2014. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms: area_Asia country_IN dcmi_Text iso639_hin olac_primary_text

http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-6260-A
Up-to-date as of: Mon Jun 16 1:03:21 EDT 2025

Metadata
Title:		HindMonoCorp 0.5
Bibliographic Citation:		http://hdl.handle.net/11858/00-097C-0000-0023-6260-A
Creator:		Bojar, Ondřej
		Diatka, Vojtěch
		Rychlý, Pavel
		Straňák, Pavel
		Suchomel, Vít
		Tamchyna, Aleš
		Zeman, Daniel
Date (W3CDTF):		2014-03-21T22:36:19Z
Date Available:		2014-03-21T22:36:19Z
Description:		Hindi monolingual corpus. It is based primarily on web crawls performed using various tools and at various times. Since the web is a living data source, we treat these crawls as completely separate sources, despite they may overlap. To estimate the magnitude of this overlap, we compared the total number of segments if we concatenate the individual sources (each source being deduplicated on its own) with the number of segments if we de-duplicate all sources to- gether. The difference is just around 1%, confirming, that various web crawls (or their subsequent processings) differ significantly. HindMonoCorp contains data from: Hindi web texts, a monolingual corpus containing mainly Hindi news articles has already been collected and released by Bojar et al. (2008). We use the HTML files as crawled for this corpus in 2010 and we add a small crawl performed in 2013 and re-process them with the current pipeline. These sources are denoted HWT 2010 and HWT 2013 in the following. Hindi corpora in W2C have been collected by Martin Majliš during his project to automatically collect corpora in many languages (Majliš and Žabokrtský, 2012). There are in fact two corpora of Hindi available—one from web harvest (W2C Web) and one from the Wikipedia (W2C Wiki). SpiderLing is a web crawl carried out during November and December 2013 using SpiderLing (Suchomel and Pomikálek, 2012). The pipeline includes extraction of plain texts and deduplication at the level of documents, see below. CommonCrawl is a non-profit organization that regu- larly crawls the web and provides anyone with the data. We are grateful to Christian Buck for extracting plain text Hindi segments from the 2012 and 2013-fall crawls for us. Intercorp – 7 books with their translations scanned and manually alligned per paragraph RSS Feeds from Webdunia.com and the Hindi version of BBC International followed by our custom crawler from September 2013 till January 2014.
Description:		LM2010013,
Identifier (URI):		http://hdl.handle.net/11858/00-097C-0000-0023-6260-A
Language:		Hindi
Language (ISO639):		hin
Publisher:		Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Replaces (URI):		http://hdl.handle.net/11858/00-097C-0000-0001-CC1E-B
Rights:		Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
Rights:		http://creativecommons.org/licenses/by-nc-sa/3.0/
Subject:		corpus
Type:		corpus
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:		http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:lindat.mff.cuni.cz:11858/00-097C-0000-0023-6260-A
DateStamp:		2021-06-29
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Bojar, Ondřej; Diatka, Vojtěch; Rychlý, Pavel; Straňák, Pavel; Suchomel, Vít; Tamchyna, Aleš; Zeman, Daniel. 2014. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms:		area_Asia country_IN dcmi_Text iso639_hin olac_primary_text