OLAC Record

Title:Keywords and n-grams from a textbook corpus
Bibliographic Citation:http://hdl.handle.net/11356/1215
Creator:Kosem, Iztok
Pori, Eva
Arhar Holdt, Špela
Date (W3CDTF):2019-03-08T13:37:07Z
Date Available:2019-03-08T13:37:07Z
Description:Wordlists, keywords and n-grams were extracted from a corpus of textbooks for Slovenian elementary and secondary schools. The corpus contains 4,302,857 words (5,373,268 tokens), and consists of 127 textbooks from 16 different subjects: - Biology (6 textbooks; 293,935 words), - State, society and ethics (1 textbook; 21,881 words), - Society (4 textbooks; 64,126), - Physics (5 textbooks; 185,171), - Geography (7 textbooks; 202,101 words), - Music (8 textbooks; 224,034 words), - Home Economics (3 textbooks; 33.803), - Chemistry (7 textbooks; 282,543 words), - Art (3 textbooks; 146,681), - Mathematics (23 textbooks; 764,012), - Science (5 textbooks; 226,191 words), - Science and technology (6 textbooks; 183,749 words), - Slovene language (37 textbooks; 1,437,945 words), - Environmental Education (7 textbooks; 38,645 words), - Technology (1 textbook; 24,733 words) - History (4 textbooks; 173,307 words). The lists were manually cleaned, most items not found in the reference morphological lexicon Sloleks (http://hdl.handle.net/11356/1039) were removed, which mainly consisted of conversion errors. The lists include only those words, keywords or n-grams that were found in at least 8 different subjects. Keyword lists were extracted using the Sketch Engine tool, minimum frequency was set to 5, the statistics used was average relative frequency. Minimum frequency for n-grams was 10.
Identifier (URI):http://hdl.handle.net/11356/1215
Language (ISO639):slv
Publisher:Centre for Language Resources and Technologies, University of Ljubljana
Rights:Creative Commons - Attribution 4.0 International (CC BY 4.0)
textbook corpus
Slovenian language
Subject (ISO639):slv
Type (DCMI):Text
Type (OLAC):lexicon


Archive:  Slovenian language resource repository CLARIN.SI
Description:  http://www.language-archives.org/archive/clarin.si
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.clarin.si:11356/1215
DateStamp:  2019-04-19
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Kosem, Iztok; Pori, Eva; Arhar Holdt, Špela. 2019. Centre for Language Resources and Technologies, University of Ljubljana.
Terms: area_Europe country_SI dcmi_Text iso639_slv olac_lexicon

Inferred Metadata

Country: Slovenia
Area: Europe

Up-to-date as of: Fri Jan 10 9:22:55 EST 2020