OLAC Record
oai:www.clarin.si:11356/1141

Metadata
Title:Beseda Corpus Lemmatisation Lexicon
Bibliographic Citation:http://hdl.handle.net/11356/1141
Creator:Jakopin, Primož
Date (W3CDTF):2017-09-25T10:21:07Z
Date Available:2017-09-25T10:21:07Z
Description:Beseda Corpus Lemmatisation Lexicon for Slovenian language was generated at the Fran Ramovš Institute of Slovenian Language, primarily through inflection of open class words from the Dictionary of Standard Slovenian (Slovar slovenskega knjižnega jezika), augmented by wordforms, their part of speech tags and their lemmas used during the PoS tagging and lemmatization of the Beseda corpus. It was initially (2000) composed of 1 million words from the following texts: Ciril Kosmač Opus - 408,000 words Tomo Križnar: O iskanju ljubezni / On Search for Love or Around the World by Bicycle - 132,000 words George Orwell: 1984 / 1984 - 91,000 words Plato: Država / Republic - 93,000 words Sveto pismo Nove zaveze / The Bible - New Testament - 150,000 words Gustave Flaubert: Bouvard in Pécuchet / Bouvard and Pécuchet - 86,000 words Časopis DELO na internetu (vzorec iz 6.5.1997 - 17.6.1997) / Newspaper DELO on Internet (a sample from 5/6/1997 - 6/17/1997) - 52,000 words After 2000 the following texts were added: Marko Uršič: Štirje časi / Four Seasons - 171,000 words Državni zbor RS 3. sklica - dobesedni zapisi sej: 29. redna seja, zasedanje 01.10.2003 / National Assembly of the Republic of Slovenia - session transcripts: 29th regular session, meeting of 10/1/2003 - 47,000 words Časopis DELO za 3.1.2004 / Newspaper DELO for 1/3/2004 - 75,000 words to round the corpus to 1,300,000 words. Current lexicon was taken from the database of the online "Determination of Lemmas and PoS Tags for a List of Words" service at the Institute, available through the web page: http://bos.zrc-sazu.si/dol_lem1.html Wordform frequencies were compiled from the latest update of the abovementioned corpus (version 138, 1,300,626 words, August 2017) and are therefore approximate. Lexicon is UTF-8 coded, has 3,228,128 lines, each of the following 4 data fields, tab separated: 1. wordform 2. lemma (102,346 different lemmas) 3. PoS tag (explained at http://bos.zrc-sazu.si/bibliografija/o_oznake.html - in Slovenian) 4. approximate corpus frequency; wordform-lemma-PoS entries not in corpus have zero frequency
Identifier (URI):http://hdl.handle.net/11356/1141
Language:Slovenian
Language (ISO639):slv
Publisher:ZRC SAZU
Rights:Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
Subject:morphology
inflection
word forms
lemmatisation
Slovenian language
Subject (ISO639):slv
Type:lexicalConceptualResource
Type (DCMI):Text
Type (OLAC):lexicon

OLAC Info

Archive:  Slovenian language resource repository CLARIN.SI
Description:  http://www.language-archives.org/archive/clarin.si
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.clarin.si:11356/1141
DateStamp:  2017-09-25
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Jakopin, Primož. 2017. ZRC SAZU.
Terms: area_Europe country_SI dcmi_Text iso639_slv olac_lexicon

Inferred Metadata

Country: Slovenia
Area: Europe


http://www.language-archives.org/item.php/oai:www.clarin.si:11356/1141
Up-to-date as of: Wed Jul 17 9:50:45 EDT 2019