OLAC Record
oai:www.ldc.upenn.edu:LDC2025S03

Metadata
Title:MATERIAL Kazakh-English Language Pack
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Bekkozhanova, Gulnar, et al. MATERIAL Kazakh-English Language Pack LDC2025S03. Web Download. Philadelphia: Linguistic Data Consortium, 2025
Contributor:Bekkozhanova, Gulnar
Bills, Aric
Chouder, Sarra
Jaralve, Vanessa
Corey, Cassian
Dubinski, Eyal
Ellis, Corinna
Gibby, Paul
Kazi, Michael
Lam, Julie
Le, Hanh
Malyska, Nicolas
Marcucci, Giorgia
Marvi, Sarah
McConnell, Sara
Melot, Jennifer
Mensch, Alyssa
Morrison, Michelle
Paget, Shelley
Ramizo, Katerina
Richardson, Frederick
Roberts, Annette
Rubino, Carl
Sarseke, Gulnar
Taubayev, Zharas
Date (W3CDTF):2025
Date Issued (W3CDTF):2025-04-15
Description:*Introduction* MATERIAL Kazakh-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 57 hours of Kazakh conversational telephone speech, transcripts, English translations, annotations and queries. The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries. *Data* The Kazakh speech in this release represents that spoken in the Northern and Southern dialect regions of Kazakhstan. Speakers were 18 years of age or older. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle. Transcripts cover approximately 17% of the speech data, all of which was translated into English. Further information about transcription and translation methodologies is contained in the documentation accompanying this release. Kazakh-English Language Pack also includes English queries and their relevance annotations. Annotators marked transcripts by query (simple, conceptual, hybrid) and by their relevance to query search terms. Speech data is presented mostly as two channel wav or single channel sphere files, both in 8kHz A-law format. Some wav files are 48kHz PCM. All text data is UTF-8 encoded. *Samples* * Kazakh Transcription Sample (TXT) * Romanized Kazakh Transcription Sample (TXT) * English Translation Sample (TXT) * Audio Sample (WAV) *Updates* None at this time.
Extent:Corpus size: 16425268 KB
Format:Sampling Rate: 8000
Sampling Format: alaw
Identifier:LDC2025S03
https://catalog.ldc.upenn.edu/LDC2025S03
ISLRN: 798-646-667-992-4
DOI: 10.35111/k4ey-kj75
Language:English
Kazakh
Language (ISO639):eng
kaz
License:MATERIAL Kazakh-English Agreement (For-Profit): https://catalog.ldc.upenn.edu/license/material-kazakh-english-agreement-for-profit.pdf
MATERIAL Kazakh-English Agreement (Non-Member): https://catalog.ldc.upenn.edu/license/material-kazakh-english-agreement-non-member.pdf
MATERIAL Kazakh-English Agreement (Not-For-Profit): https://catalog.ldc.upenn.edu/license/material-kazakh-english-agreement-not-for-profit.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2025S03
Rights Holder:Portions © 2025 U.S. Government, © 2025 Trustees of the University of Pennsylvania
Type (DCMI):Sound
Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2025S03
DateStamp:  2025-04-15
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Bekkozhanova, Gulnar; Bills, Aric; Chouder, Sarra; Jaralve, Vanessa; Corey, Cassian; Dubinski, Eyal; Ellis, Corinna; Gibby, Paul; Kazi, Michael; Lam, Julie; Le, Hanh; Malyska, Nicolas; Marcucci, Giorgia; Marvi, Sarah; McConnell, Sara; Melot, Jennifer; Mensch, Alyssa; Morrison, Michelle; Paget, Shelley; Ramizo, Katerina; Richardson, Frederick; Roberts, Annette; Rubino, Carl; Sarseke, Gulnar; Taubayev, Zharas. 2025. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_GB country_KZ dcmi_Sound dcmi_Text iso639_eng iso639_kaz olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2025S03
Up-to-date as of: Wed Apr 16 0:09:24 EDT 2025