OLAC Record oai:catalogue.elra.info:ELRA-W0091 |
Metadata | ||
Title: | Linguatools Webcrawl Parallel Corpus German-English 2015 | |
Abstract: | The corpus consists of 10 million German-English parallel sentences that were crawled from the internet between 10/2013 and 04/2015. Web pages have been automatically categorized for subject area. The corpus is available in TMX and Moses format (encoding UTF-8). | |
Access Rights: | Rights available for: Research Use, Commercial Use | |
Coverage: | between 10/2013 and 04/2015 | |
Date Available (W3CDTF): | 2016-03-07 | |
Date Issued (W3CDTF): | 2016-03-07 | |
Date Modified (W3CDTF): | 2016-03-07 | |
Description: | Written Corpora | |
The corpus consists of 10 million German-English parallel sentences that were crawled from the internet between 10/2013 and 04/2015. The sentences were gathered from over 112,000 different hosts. An elaborate multi-step quality filtering was applied, including language identification filter, machine translation filter, grammaticality filter, etc. to get as clean data as possible. There are no duplicate sentence pairs, and there is no overlap with existing publicly available corpora like europarl, DGT-TM, etc. Web pages have been automatically categorized for subject area. The corpus is available in TMX and Moses format (encoding UTF-8). | ||
Identifier: | ELRA-W0091 | |
http://catalog.elra.info/product_info.php?products_id=1262 | ||
Language: | German | |
English | ||
Language (ISO639): | deu | |
eng | ||
Publisher: | ELRA (European Language Resources Association) | |
Type (DCMI): | Text | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | ELRA Catalogue of Language Resources | |
Description: | http://www.language-archives.org/archive/catalogue.elra.info | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:catalogue.elra.info:ELRA-W0091 | |
DateStamp: | 2016-03-07 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | n.a. 2016. ELRA (European Language Resources Association). | |
Terms: | area_Europe country_DE country_GB dcmi_Text iso639_deu iso639_eng olac_primary_text |