OLAC Record

Title:Danish Gigaword Corpus
Access Rights: Rights available for: attribution
Date Available (W3CDTF):2022-01-28
Date Issued (W3CDTF):2022-01-28
Description:The Danish Gigaword Project (DAGW) maintains a corpus for Danish with over a billion words. The general goals are to create a dataset that is:1. representative;2. accessible;3. a suitable common starting point for Danish NLP models.The present version 1.0 was collected from various websites. Domains are distributed as follows:-Legal : 308.8 million words-Social Media : 261.4 million words-Subtitles : 130.1 million words-Debates : 108.4 million words-Conversations : 0.7 million words-Web : 101.02 million words-Encyclopedia : 55.6 million words-Literature : 31.3 million words-Manuals : 2.6 million words-Books : 2.1 million words-Religion : 600k words-News: 40 million words-Other :1.2 million wordsData is presented in plaintext, UTF8, one file per document. Accompanying metadata gives information about (among others) the author, the time or location of the document's creation, an API hook for re-retrieval of the document.
ISLRN: 024-504-318-388-3
Identifier (URI):https://catalog.elra.info/en-us/repository/browse/ELRA-W0318/
Language (ISO639):dan
Medium:Not specified
Publisher:ELRA (European Language Resources Association)
Type (DCMI):Text
Type (OLAC):primary_text


Archive:  ELRA Catalogue of Language Resources
Description:  http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:catalogue.elra.info:ELRA-W0318
DateStamp:  2022-01-28
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: n.a. 2022. ELRA (European Language Resources Association).
Terms: area_Europe country_DK dcmi_Text iso639_dan olac_primary_text

Up-to-date as of: Fri Mar 8 7:27:10 EST 2024