OLAC Record: Web 1T 5-gram, 10 European Languages Version 1

OLAC Record
oai:www.ldc.upenn.edu:LDC2009T25

Metadata

Title: Web 1T 5-gram, 10 European Languages Version 1

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Brants, Thorsten, and Alex Franz. Web 1T 5-gram, 10 European Languages Version 1 LDC2009T25. Web Download. Philadelphia: Linguistic Data Consortium, 2009

Contributor: Brants, Thorsten

Franz, Alex

Date (W3CDTF): 2009

Date Issued (W3CDTF): 2009-10-20

Description: *Introduction* Web 1T 5-gram, 10 European Languages Version 1 was created by Google, Inc. It consists of word n-grams and their observed frequency counts for ten European languages: Czech, Dutch, French, German, Italian, Polish, Portuguese, Romanian, Spanish and Swedish. The length of the n-grams ranges from unigrams (single words) to five-grams. The n-gram counts were generated from approximately one hundred billion word tokens of text for each language, or approximately one trillion total tokens. The n-grams were extracted from publicly-accessible web pages from October 2008 to December 2008. This data set contains only n-grams that appeared at least 40 times in the processed sentences. Less frequent n-grams were discarded. While the aim was to identify and collect pages from the specific target languages only, it is likely that some text from other languages may be in the final data. This dataset will be useful for statistical language modeling, including machine translation, speech recognition and other uses. *Data* The input encoding of documents was automatically detected, and all text was converted to UTF8. The following table contains statistics for the entire release. File sizes (entire corpus): approximately 27.9 GB compressed (bzip2) text files Total number of tokens: 1,306,807,412,486 Total number of sentences: 150,727,365,731 Total number of unigrams: 95,998,281 Total number of bigrams: 646,439,858 Total number of trigrams: 1,312,972,925 Total number of fourgrams: 1,396,154,236 Total number of fivegrams: 1,149,361,413 Total number of n-grams: 4,600,926,713 *Samples* For an example of the data in this corpus please examine this sample file.

Extent: Corpus size: 29255270 KB

Identifier: LDC2009T25

https://catalog.ldc.upenn.edu/LDC2009T25

ISBN: 1-58563-525-1

ISLRN: 930-499-840-946-0

DOI: 10.35111/mesn-fv79

Language: Swedish

Spanish

Romanian

Portuguese

Polish

Dutch

Italian

French

German

Czech

Language (ISO639): swe

spa

ron

por

pol

nld

ita

fra

deu

ces

License: Web 1T 5-gram, 10 European Languages Version 1 Agreement: https://catalog.ldc.upenn.edu/license/web-1t-5-gram-10-european-languages-version-1.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2009T25

Rights Holder: Portions © 2009 Google Inc., © 2009 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2009T25

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Brants, Thorsten; Franz, Alex. 2009. Linguistic Data Consortium.
Terms: area_Europe country_CZ country_DE country_ES country_FR country_IT country_NL country_PL country_PT country_RO country_SE dcmi_Text iso639_ces iso639_deu iso639_fra iso639_ita iso639_nld iso639_pol iso639_por iso639_ron iso639_spa iso639_swe olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2009T25
Up-to-date as of: Wed Oct 29 7:01:09 EDT 2025

Metadata
Title:		Web 1T 5-gram, 10 European Languages Version 1
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Brants, Thorsten, and Alex Franz. Web 1T 5-gram, 10 European Languages Version 1 LDC2009T25. Web Download. Philadelphia: Linguistic Data Consortium, 2009
Contributor:		Brants, Thorsten
Contributor:		Franz, Alex
Date (W3CDTF):		2009
Date Issued (W3CDTF):		2009-10-20
Description:		Introduction Web 1T 5-gram, 10 European Languages Version 1 was created by Google, Inc. It consists of word n-grams and their observed frequency counts for ten European languages: Czech, Dutch, French, German, Italian, Polish, Portuguese, Romanian, Spanish and Swedish. The length of the n-grams ranges from unigrams (single words) to five-grams. The n-gram counts were generated from approximately one hundred billion word tokens of text for each language, or approximately one trillion total tokens. The n-grams were extracted from publicly-accessible web pages from October 2008 to December 2008. This data set contains only n-grams that appeared at least 40 times in the processed sentences. Less frequent n-grams were discarded. While the aim was to identify and collect pages from the specific target languages only, it is likely that some text from other languages may be in the final data. This dataset will be useful for statistical language modeling, including machine translation, speech recognition and other uses. Data The input encoding of documents was automatically detected, and all text was converted to UTF8. The following table contains statistics for the entire release. File sizes (entire corpus): approximately 27.9 GB compressed (bzip2) text files Total number of tokens: 1,306,807,412,486 Total number of sentences: 150,727,365,731 Total number of unigrams: 95,998,281 Total number of bigrams: 646,439,858 Total number of trigrams: 1,312,972,925 Total number of fourgrams: 1,396,154,236 Total number of fivegrams: 1,149,361,413 Total number of n-grams: 4,600,926,713 Samples For an example of the data in this corpus please examine this sample file.
Extent:		Corpus size: 29255270 KB
Identifier:		LDC2009T25
		https://catalog.ldc.upenn.edu/LDC2009T25
		ISBN: 1-58563-525-1
		ISLRN: 930-499-840-946-0
		DOI: 10.35111/mesn-fv79
Language:		Swedish
		Spanish
		Romanian
		Portuguese
		Polish
		Dutch
		Italian
		French
		German
		Czech
Language (ISO639):		swe
		spa
		ron
		por
		pol
		nld
		ita
		fra
		deu
		ces
License:		Web 1T 5-gram, 10 European Languages Version 1 Agreement: https://catalog.ldc.upenn.edu/license/web-1t-5-gram-10-european-languages-version-1.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2009T25
Rights Holder:		Portions © 2009 Google Inc., © 2009 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2009T25
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Brants, Thorsten; Franz, Alex. 2009. Linguistic Data Consortium.
Terms:		area_Europe country_CZ country_DE country_ES country_FR country_IT country_NL country_PL country_PT country_RO country_SE dcmi_Text iso639_ces iso639_deu iso639_fra iso639_ita iso639_nld iso639_pol iso639_por iso639_ron iso639_spa iso639_swe olac_primary_text