Title:Janes corpus n-grams 1.0
Bibliographic Citation:http://hdl.handle.net/11356/1192
Creator:Dobrovoljc, Kaja
Date (W3CDTF):2018-08-01T17:32:32Z
Date Available:2018-08-01T17:32:32Z
Description:A collection of n-grams extracted from the Janes corpus of Slovenian user-generated content version 1.0 (cf. http://nl.ijs.si/janes/). Three sets of n-gram lists are provided for lowercased word n-grams of length 1 to 5: - extensive frequency lists of all extracted n-grams - filtered frequency lists of n-grams with minimum frequency 10/mil. - adjusted frequency list of all n-grams with minimum frequency 10/mil. Only n-grams within sentences have been counted, ignoring punctuation. For the filtered and adjusted list, only n-grams occurring in at least 2 different texts have been extracted. Key references: - K. Dobrovoljc, 2018. N-gram frequency lists for reference corpora of Slovenian language. Proceedings of the Language Technologies & Digital Humanities Conference 2018. - T. Erjavec, N. Ljubešić, D. Fišer, 2018. Korpus slovenskih spletnih uporabniških vsebin Janes. V: FIŠER, Darja (ur.). Viri, orodja in metode za analizo spletne slovenščine. Znanstvena založba Filozofske fakultete Univerze v Ljubljani. https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/book/111 - M. B. O’Donnell, 2010. The adjusted frequency list: A method to produce cluster-sensitive frequency lists. ICAME Journal 35, 135–169.
Identifier (URI):http://hdl.handle.net/11356/1192
Language (ISO639):slv
Publisher:Centre for Language Resources and Technologies, University of Ljubljana
Rights:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
multiword expressions
Slovenian language
Subject (ISO639):slv
Type (DCMI):Text
Type (OLAC):lexicon


