OLAC Record
oai:www.ldc.upenn.edu:LDC2025T08

Metadata
Title:LoReHLT Uzbek Representative Language Pack
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Tracey, Jennifer, et al. LoReHLT Uzbek Representative Language Pack LDC2025T08. Web Download. Philadelphia: Linguistic Data Consortium, 2025
Contributor:Tracey, Jennifer
Strassel, Stephanie
Graff, David
Wright, Jonathan
Chen, Song
Ryant, Neville
Kulick, Seth
Delgado, Dana
Arrigo, Michael
Date (W3CDTF):2025
Date Issued (W3CDTF):2025-07-15
Description:*Introduction* LoReHLT Uzbek Representative Language Pack consists of Uzbek monolingual text, Uzbek-English parallel text, annotations, audio recordings, supplemental resources and related software tools developed by the Linguistic Data Consortium for LoReHLT, a companion project of the DARPA LORELEI program. The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. Linguistic resources for LORELEI include Representative Language Packs and Incident Language Packs for over two dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons and grammatical resources. Representative languages were selected to provide broad typological coverage, while incident languages were selected to evaluate system performance on a language whose identity was disclosed at the start of the evaluation. *Data* Uzbek is spoken across central Asia; it is the official language of Uzbekistan. This release is the result of a pilot effort preceding the LORELEI program. Text data was collected in the following genres: news, discussion forum, reference, social network, and weblogs. Both monolingual text collection and parallel text creation involved a combination of manual and automatic methods. Also collected were broadcast news recordings and amateur web audio recordings related to disaster events covered in the text data. Data volumes are as follows: * 47 million words of Uzbek monolingual text, over 886,000 of which were translated into English * 563,000 words of found Uzbek-English parallel text * 100,000 Uzbek words translated from English text * 6.41 hours of Uzbek audio recordings (broadcast news, amateur web recordings) Approximately 151,000 words were annotated for named entities, and over 28,000 words were annotated for full entity including nominals and pronouns. Noun-phrase chunking was applied to more than 13,000 words and over 20,890 words were labeled with simple semantic annotation. Topic annotation was applied to the audio recordings. Lexical resources and software tools are also included in this release. The tools recreate original source data from the processed XML material, condition text data users download from Twitter, apply sentence segmentation to raw text, and support named entity tagging. Monolingual and parallel text are presented in XML with associated dtds. Annotation data is presented as tab delimited files or XML. All text is UTF-8 encoded. The audio recordings are presented in FLAC-compressed MS-WAV and .mp4 format. *Sponsorship* This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-15-C-0123. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA. *Samples* Please view these samples: * English LTF XML * English PSM XML * Uzbek LTF XML * Uzbek PSM XML * Full Entity Annotation XML * Semantic Annotation XML * Noun Phrase XML * Audio file (mp4) * Audio file (flac) *Updates* No updates at this time.
Extent:Corpus size: 2411724 KB
Format:Sampling Rate: 16000, 44100
Sampling Format: flac, mp4
Identifier:LDC2025T08
https://catalog.ldc.upenn.edu/LDC2025T08
ISLRN: 370-274-581-227-7
DOI: 10.35111/t5qx-jc85
Language:English
Uzbek
Language (ISO639):eng
uzb
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2025T08
Rights Holder:Portions © 2005 12us.com, © 2012 21Asr.uz, © 2002-2007, 2009-2010 Agence France Presse,© 2013 ajoyib.net, © 2013, 2014 AKIpress News Agency, © 2014 albuxority.com, © 2000 American Broadcasting Company, © 2013 amuziyo.com, © 2014 anon.uz, © 2012, 2014 ARXIV, © 2014 BePuL.NeT, © 2013 bil.uz, © 2013 biznes.daily.uz, © 2013 bizstrener.uz, © 2000 Cable News Network LP, LLLP, © 2012 CDMEP, © 2008 Central News Agency (Taiwan), © 2009 Centre of Hydrometerological Service at Cabinet Ministers of the Republic of Uzbekistan (Uzhydromed), © 2013, 2014 championat.asia, © 2014 darakchi.uz, © 2009, 2011 Daryo, © 2013 Distlik Bayrogi, © 2013 diyormedia.uz, © 2014 DMP under DPE, © 1989 Dow Jones & Company, Inc., © 2010 econews.uz, © 2013 Embassy of the Republic Uzbekistan to the United Kingdom of Great Britain and Northern Ireland, © 2007, 2011 Ferghana News Agency, Moscow, © 2007-2010, 2012-2014 Google LLC, © 2014 Gooper.uz, © 2004-2006 Harakat, © 2012 Human Rights Society of Uzbekistan, © 2011 Huquq, © 2014 Huquq Burch, © 2012 intiqom.uz, © 2009 Islambio.com, © 2006 islom.uz, © 2010 jamiyatgzt.uz, © 2012 kamolon.uz, © 2014 Karachik, © 2014 Kokand, © 2011-2014 Kun.uz, © 2005 Los Angeles Times - Washington Post News Service, Inc., © 2013 LUKOIL Uzbekistan Operating Company LLC, © 2004, 2006 Marifat, © 2013 Medislam, © 2014 megauz.uz, © 2014 mirjahon.weebly.com, © 2013 MoDISaNyntymak, © 2010 Mohiyat, © 2014 Mp3lar.com, © 2014 Mulkdor.com, © 2014 Muloqot, © 2012 muslimaat.uz, © 2000 National Broadcasting Company, Inc., © 2014 National Television and Radio Company of Uzbekistan, © 2011 Navoiy Press, © 2014 news24.uz, © 1999, 2005, 2006, 2010 New York Times, © 2013 Odnoklassniki, © 2014 Oila Davrasida, © 2013 Olam Asia, © 2009 oriftolib.uz, © 2001, 2012 Ozbekiston Elektron Ommaviy Axborot Vositalari Milliy Assotsiatsiyasi, © 2014 pressnews.uz, © 2010 Public Health of Uzbekistan, © 2000 Public Radio International, © 2013 Qadriyat.uz, © 2012 Qashqadaryogz, © 2014 Questpedia, © 2013, 2014 Qulnoma, © 2014 quvnoq.com, © 2011 Rambler, © 2014 Sadolar.net, © 2014 Shamsutdinovs Business Group, © 2014 shejot.com, © 2005 sof-olam.6te.net, © 2012 Software.uz, © 2014 Soglik.Uz, © 2014 Soyabon Group, © 2014 Sports.uz, © 2014 Takewap Group, © 2014 Tarona.net, © 2008 Tashkentskaya Pravda, © 2014 TDPU, © 2009, 2010 Termiz Okshomi, © 2003, 2005-2008, 2010 The Associated Press, © 2013 The GEF Small Grants Program, © 2009 usfayl.com, © 2011, 2012 uskinozal.com, © 2011, 2014 us-world.ru, © 2014 uz24.uz, © 2007-2012 UzA, © 2012 Uzbaby.uz, © 2012 Uzbegim, © 2013 Uzbek.Fm, © 2014 Uzbek Huquq, © 2012 Uzbekislam.com, © 2014 Uzbekistan news- UzReport.uz, © 2012 UZBnews, © 2014 Uzclub.Net, © 2011 UzCinema, © 2011, 2013 Uzfunfactory & Sayyod Media Group, © 2010, 2013 uzhurriyat.com, © 2013 UzLider.Mobi, © 2007, 2011 UZNEWS.NET, © 2012, 2014 Vatandosh, Inc., © 2013 Vatanparvar, © 2013 viloyat-arm.uz, © 2012 www.welcomebackuz.com, © 2014 www.zamonaviy.uz, © 2011 xabar.org, © 2011, 2014 xayol.uz, © 2003, 2005-2008 Xinhua News Agency, © 2012 xorazamtibbiyoti.com, (c) 2014 xs.uz, © 2012 Yangi Dunya, © 2013 zamondosh.uz, © 2014, 2025 Trustees of the University of Pennsylvania
Subject:Uzbek language
Subject (ISO639):uzb
Type (DCMI):Software
Sound
Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2025T08
DateStamp:  2025-07-15
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Tracey, Jennifer; Strassel, Stephanie; Graff, David; Wright, Jonathan; Chen, Song; Ryant, Neville; Kulick, Seth; Delgado, Dana; Arrigo, Michael. 2025. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Software dcmi_Sound dcmi_Text iso639_eng iso639_uzb olac_primary_text

Inferred Metadata

Country: 
Area: 


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2025T08
Up-to-date as of: Wed Jul 16 1:19:49 EDT 2025