OLAC Record

Title:Twitter corpus Janes-Tweet 1.0
Bibliographic Citation:http://hdl.handle.net/11356/1142
Creator:Ljubešić, Nikola
Erjavec, Tomaž
Fišer, Darja
Date (W3CDTF):2017-09-05T14:23:23Z
Date Available:2017-09-05T14:23:23Z
Description:Janes-Tweet is an annotated corpus of almost 10 million tweets posted from 2013-06 to 2017-06 by approx. 9,000 users that tweet mostly in Slovene. The corpus is structured into individual tweets, together with their metadata. The tweets in the corpus are tokenised, sentence segmented, word normalised, morphosyntactically tagged, lemmatised and annotated with named entities. Due to Twitter terms-of-service, the corpus is distributed in an encoded version. The included tweetpub program (also available and documented on https://github.com/clarinsi/tweetpub) should be used to decode it, which it does by fetching the original tweets and applying a diff operation on the distributed corpus. Note that the retrieved corpus can have fewer tweets than the distributed version if some have been removed from Twitter by their authors in the meantime.
Identifier (URI):http://hdl.handle.net/11356/1142
Language (ISO639):slv
Publisher:Jožef Stefan Institute
Rights:Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
Subject:computer-mediated communication
word normalisation
named entities
Type (DCMI):Text
Type (OLAC):primary_text


Archive:  Slovenian language resource repository CLARIN.SI
Description:  http://www.language-archives.org/archive/clarin.si
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.clarin.si:11356/1142
DateStamp:  2019-10-10
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Ljubešić, Nikola; Erjavec, Tomaž; Fišer, Darja. 2017. Jožef Stefan Institute.
Terms: area_Europe country_SI dcmi_Text iso639_slv olac_primary_text

Up-to-date as of: Thu Dec 5 9:50:18 EST 2019