OLAC Record

Title:Engineering job ads corpus
Bibliographic Citation:http://hdl.handle.net/11234/1-2673
Creator:Cardenas Acosta, Ronald
Bello Medina, Kevin
Coronado, Alberto
Villota, Elizabeth
Date (W3CDTF):2018-04-09T07:56:28Z
Date Available:2018-04-09T07:56:28Z
Description:The corpus presented consists of job ads in Spanish related to Engineering positions in Peru. The documents were preprocessed and annotated for POS tagging, NER, and topic modeling tasks. The corpus is divided in two components: - POS tagging/ NER training data: Consisting of 800 job ads, each one tokenized and manually annotated with POS tag information (EAGLE format) and Entity Label in BIO format. - Topic modeling training data: containing 9000 documents stripped from stopwords. Comes in two formats: * Whole text documents: containing all the information originally posted in the ad. * Extracted chunks documents: containing chunks extracted by custom NER models (expected skills, tasks to perform, and preferred major), as described in Improving Topic Coherence Using Entity Extraction Denoising (to appear)
Identifier (URI):http://hdl.handle.net/11234/1-2673
Language (ISO639):spa
Publisher:National University of Engineering, Peru
Rights:Creative Commons - Attribution 4.0 International (CC BY 4.0)
PoS tagging
text corpora
Type (DCMI):Text
Type (OLAC):primary_text


Archive:  LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:  http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:lindat.mff.cuni.cz:11234/1-2673
DateStamp:  2021-06-29
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Cardenas Acosta, Ronald; Bello Medina, Kevin; Coronado, Alberto; Villota, Elizabeth. 2018. National University of Engineering, Peru.
Terms: area_Europe country_ES dcmi_Text iso639_spa olac_primary_text

Up-to-date as of: Thu Oct 5 0:40:52 EDT 2023