OLAC Record
oai:www.ldc.upenn.edu:LDC2023S01

Metadata
Title:AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Delgado, Dana, et al. AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts LDC2023S01. Web Download. Philadelphia: Linguistic Data Consortium, 2023
Contributor:Delgado, Dana
Walker, Kevin
Graff, David
Strassel, Stephanie
Date (W3CDTF):2023
Date Issued (W3CDTF):2023-01-17
Description:*Introduction* AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 156 hours of Ukrainian conversational telephone speech (CTS) and broadcast news audio (BN) with 1.2 million words of corresponding orthographic transcripts. The broadcast recordings and transcripts were produced to support the DARPA AIDA (Active Interpretation of Disparate Alternatives) program which aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. LDC supported AIDA by collecting, creating and annotating multimodal linguistic resources in multiple languages. The telephone speech audio recordings were collected to support the NIST 2011 Language Recognition Evaluation which focused on pair discrimination for 24 languages/dialects. These recording are also contained in Multi-Language Conversational Telephone Speech 2011 – Slavic Group LDC2016S11. The goal of NIST’s LRE series is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. *Data* The CTS audio data was generated from telephone calls by native Ukrainian speakers to acquaintances in their social network. It was collected using LDC's telephone infrastructure comprised of three computer telephony systems. Human auditors labeled calls for callee gender, dialect type and noise. All CTS audio files were originally collected as 2-channel u-law and were converted to 8KHz 16-bit pcm and flac compressed for release. The BN data was taken from 87 news recordings broadcast by various Ukrainian sources. All BN audio files were originally collected as mp3 via web-download or as live streaming broadcast captures and were downsampled to either 16KHz or 22KHz 16-bit pcm and flac compressed for release. Native Ukrainian speakers manually segmented the data into sentence-level units as part of the transcription process. All transcripts are delivered as *.tsv tab delimited files that include metadata and statistics. *Sponsorship* This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract Nos. HR0011-15-C-0123 and FA8750-18-C-0013. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA. *Samples* Please view these samples: * Audio Sample (FLAC) * Transcript Sample (TSV) *Updates* None at this time.
Extent:Corpus size: 10246124 KB
Format:Sampling Rate: CTS 8KHz 16-bit pcm, BN 16KHz or 22KHz 16-bit pcm
Sampling Format: CTS 8KHz 16-bit pcm, BN 16KHz or 22KHz 16-bit pcm
Identifier:LDC2023S01
https://catalog.ldc.upenn.edu/LDC2023S01
ISLRN: 699-485-644-732-3
DOI: 10.35111/qge4-4f15
Language:Ukrainian
Language (ISO639):ukr
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2023S01
Rights Holder:Portions © 2017 Crimean Radio and Television Company, © 2017-2018 Hromadske Radio, © 2017-2018 LiveOnlineRadio.Net, © 2017-2018 Radio of Ukraine, © 2017-2018 Radio Vesti, © 2017-2018 RFE/RL, Inc., © 2016, 2018, 2022, 2023 Trustees of the University of Pennsylvania
Type (DCMI):Sound
Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2023S01
DateStamp:  2024-01-01
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Delgado, Dana; Walker, Kevin; Graff, David; Strassel, Stephanie. 2023. Linguistic Data Consortium.
Terms: area_Europe country_UA dcmi_Sound dcmi_Text iso639_ukr olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2023S01
Up-to-date as of: Mon Mar 25 7:21:18 EDT 2024