OLAC Record: SumeCzech-NER

OLAC Record
oai:lindat.mff.cuni.cz:11234/1-3505

Metadata

Title: SumeCzech-NER

Bibliographic Citation: http://hdl.handle.net/11234/1-3505

Creator: Marek, Petr

Müller, Štěpán

Date (W3CDTF): 2021-02-03T08:30:28Z

Date Available: 2021-02-03T08:30:28Z

Description: SumeCzech-NER SumeCzech-NER contains named entity annotations of SumeCzech 1.0 (Straka et al. 2018, SumeCzech: Large Czech News-Based Summarization Dataset). Format The dataset is split into four files. Files are in jsonl format. There is one JSON object on each line of the file. The most important fields of JSON objects are: - dataset: train, dev, test, oodtest - ne_abstract: list of named entity annotations of article's abstract - ne_headline: list of named entity annotations of article's headline - ne_text: list of name entity annotations of article's text - url: article's URL that can be used to match article across SumeCzech and SumeCzech-NER Annotations We used SpaCy's NER model trained on CoNLL-based extended CNEC 2.0. The model achieved a 78.45 F-Score on the dataset's testing set. The annotations are in IOB2 format. The entity types are: Numbers in addresses, Geographical names, Institutions, Media names, Artifact names, Personal names, and Time expressions. Tokenization We used the following Python code for tokenization: from typing import List from nltk.tokenize import word_tokenize def tokenize(text: str) -> List[str]: for mark in ('.', ',', '?', '!', '-', '–', '/'): text = text.replace(mark, f' {mark} ') tokens = word_tokenize(text) return tokens

Identifier (URI): http://hdl.handle.net/11234/1-3505

Language: Czech

Language (ISO639): ces

Publisher: Czech Technical University in Prague

Rights: Mozilla Public License 2.0

http://opensource.org/licenses/MPL-2.0

Subject: SumeCzech

named entity recognition

named entitity corpus

summarization

Type: corpus

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Description: http://www.language-archives.org/archive/lindat.mff.cuni.cz

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:lindat.mff.cuni.cz:11234/1-3505

DateStamp: 2021-06-29

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Marek, Petr; Müller, Štěpán. 2021. Czech Technical University in Prague.
Terms: area_Europe country_CZ dcmi_Text iso639_ces olac_primary_text

http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-3505
Up-to-date as of: Mon Jun 16 1:05:36 EDT 2025

Metadata
Title:		SumeCzech-NER
Bibliographic Citation:		http://hdl.handle.net/11234/1-3505
Creator:		Marek, Petr
Creator:		Müller, Štěpán
Date (W3CDTF):		2021-02-03T08:30:28Z
Date Available:		2021-02-03T08:30:28Z
Description:		SumeCzech-NER SumeCzech-NER contains named entity annotations of SumeCzech 1.0 (Straka et al. 2018, SumeCzech: Large Czech News-Based Summarization Dataset). Format The dataset is split into four files. Files are in jsonl format. There is one JSON object on each line of the file. The most important fields of JSON objects are: - dataset: train, dev, test, oodtest - ne_abstract: list of named entity annotations of article's abstract - ne_headline: list of named entity annotations of article's headline - ne_text: list of name entity annotations of article's text - url: article's URL that can be used to match article across SumeCzech and SumeCzech-NER Annotations We used SpaCy's NER model trained on CoNLL-based extended CNEC 2.0. The model achieved a 78.45 F-Score on the dataset's testing set. The annotations are in IOB2 format. The entity types are: Numbers in addresses, Geographical names, Institutions, Media names, Artifact names, Personal names, and Time expressions. Tokenization We used the following Python code for tokenization: from typing import List from nltk.tokenize import word_tokenize def tokenize(text: str) -> List[str]: for mark in ('.', ',', '?', '!', '-', '–', '/'): text = text.replace(mark, f' {mark} ') tokens = word_tokenize(text) return tokens
Identifier (URI):		http://hdl.handle.net/11234/1-3505
Language:		Czech
Language (ISO639):		ces
Publisher:		Czech Technical University in Prague
Rights:		Mozilla Public License 2.0
Rights:		http://opensource.org/licenses/MPL-2.0
Subject:		SumeCzech
		named entity recognition
		named entitity corpus
		summarization
Type:		corpus
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:		http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:lindat.mff.cuni.cz:11234/1-3505
DateStamp:		2021-06-29
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Marek, Petr; Müller, Štěpán. 2021. Czech Technical University in Prague.
Terms:		area_Europe country_CZ dcmi_Text iso639_ces olac_primary_text