OLAC Record
oai:lindat.mff.cuni.cz:11234/1-5047

Metadata
Title:DaMuEL 1.0: A Large Multilingual Dataset for Entity Linking
Bibliographic Citation:http://hdl.handle.net/11234/1-5047
Creator:Kubeša, David
Straka, Milan
Date (W3CDTF):2023-06-19T09:03:22Z
Date Available:2023-06-19T09:03:22Z
Description:We present DaMuEL, a large Multilingual Dataset for Entity Linking containing data in 53 languages. DaMuEL consists of two components: a knowledge base that contains language-agnostic information about entities, including their claims from Wikidata and named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART, MANUFACTURED); and Wikipedia texts with entity mentions linked to the knowledge base, along with language-specific text from Wikidata such as labels, aliases, and descriptions, stored separately for each language. The Wikidata QID is used as a persistent, language-agnostic identifier, enabling the combination of the knowledge base with language-specific texts and information for each entity. Wikipedia documents deliberately annotate only a single mention for every entity present; we further automatically detect all mentions of named entities linked from each document. The dataset contains 27.9M named entities in the knowledge base and 12.3G tokens from Wikipedia texts. The dataset is published under the CC BY-SA licence.
Identifier (URI):http://hdl.handle.net/11234/1-5047
Language:Afrikaans
Arabic
Armenian
Basque
Belarusian
Bulgarian
Catalan
Chinese
Croatian
Czech
Danish
Dutch
English
Estonian
Finnish
French
Galician
German
Hebrew
Hindi
Hungarian
Indonesian
Irish
Italian
Japanese
Korean
Latin
Latvian
Lithuanian
Maltese
Marathi
Modern Greek (1453-)
Northern Sami
Norwegian Nynorsk
Persian
Polish
Portuguese
Romanian
Russian
Scottish Gaelic
Serbian
Slovak
Slovenian
Spanish
Swedish
Tamil
Telugu
Uighur
Ukrainian
Urdu
Vietnamese
Wolof
Language (ISO639):afr
ara
hye
eus
bel
bul
cat
zho
hrv
ces
dan
nld
eng
est
fin
fra
glg
deu
heb
hin
hun
ind
gle
ita
jpn
kor
lat
lav
lit
mlt
mar
ell
sme
nno
fas
pol
por
ron
rus
gla
srp
slk
slv
spa
swe
tam
tel
uig
ukr
urd
vie
wol
Publisher:Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Rights:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
http://creativecommons.org/licenses/by-sa/4.0/
Subject:entity linking
NEL
NER
dataset
knowledge base
Type:corpus
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:  http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:lindat.mff.cuni.cz:11234/1-5047
DateStamp:  2023-06-19
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Kubeša, David; Straka, Milan. 2023. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms: area_Africa area_Asia area_Europe country_AM country_BG country_BY country_CN country_CZ country_DE country_DK country_ES country_FI country_FR country_GB country_GR country_HR country_HU country_ID country_IE country_IL country_IN country_IT country_JP country_KR country_LT country_MT country_NL country_NO country_PK country_PL country_PT country_RO country_RS country_RU country_SE country_SI country_SK country_SN country_UA country_VA country_VN country_ZA dcmi_Text iso639_afr iso639_ara iso639_bel iso639_bul iso639_cat iso639_ces iso639_dan iso639_deu iso639_ell iso639_eng iso639_est iso639_eus iso639_fas iso639_fin iso639_fra iso639_gla iso639_gle iso639_glg iso639_heb iso639_hin iso639_hrv iso639_hun iso639_hye iso639_ind iso639_ita iso639_jpn iso639_kor iso639_lat iso639_lav iso639_lit iso639_mar iso639_mlt iso639_nld iso639_nno iso639_pol iso639_por iso639_ron iso639_rus iso639_slk iso639_slv iso639_sme iso639_spa iso639_srp iso639_swe iso639_tam iso639_tel iso639_uig iso639_ukr iso639_urd iso639_vie iso639_wol iso639_zho olac_primary_text


http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-5047
Up-to-date as of: Thu Oct 5 0:43:33 EDT 2023