OLAC Record: USC-SFI MALACH Interviews and Transcripts English

OLAC Record
oai:www.ldc.upenn.edu:LDC2019S11

Metadata

Title: USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Ramabhadran, Bhuvana, et al. USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition LDC2019S11. Web Download. Philadelphia: Linguistic Data Consortium, 2019

Contributor: Ramabhadran, Bhuvana

Gustman, Samuel

Byrne, William

Hajič, Jan

Oard, Douglas

Olsson, J. Scott

Picheny, Michael

Psutka, Josef

Date (W3CDTF): 2019

Date Issued (W3CDTF): 2019-06-17

Description: *Introduction* USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition was developed by IBM as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. This edition augments USC-SFI MALACH Interviews and Transcripts English (LDC2012S05) by modifying and updating a subset of the original corpus for use with the Kaldi toolkit in speech recognition work, and is easily portable for use by other speech recognition systems as well. It contains approximately 168 hours of interviews from 682 Holocaust witnesses along with transcripts, a lexicon, Kaldi specific files, and other documentation. Inspired by his experience making Schindler’s List, Steven Spielberg established the Survivors of the Shoah Visual History Foundation in 1994 to gather video testimonies from survivors and other witnesses of the Holocaust. While most of those who gave testimony were Jewish survivors, the Foundation also interviewed homosexual survivors, Jehovah’s Witness survivors, liberators and liberation witnesses, political prisoners, rescuers and aid providers, Roma and Sinti (Gypsy) survivors, survivors of eugenics policies, and war crimes trials participants. The Foundation’s Visual History Archive holds nearly 55,000 video testimonies in 43 languages, representing 65 countries; it is the largest archive of its kind in the world. In 2006, the Foundation became part of the Dana and David Dornsife College of Letters, Arts and Sciences at the University of Southern California in Los Angeles and was renamed as the USC Shoah Foundation Institute for Visual History and Education. The goal of the MALACH project was to develop methods for improved access to large multinational spoken archives; the focus was advancing the state of the art of automatic speech recognition and information retrieval. The characteristics of the USC-SFI collection -- unconstrained, natural speech filled with disfluencies, heavy accents, age-related coarticulations, un-cued speaker and language switching and emotional speech -- were considered well-suited for that task. The work centered on five languages: English, Czech, Russian, Polish and Slovak. LDC has also released USC-SFI MALACH Interviews and Transcripts Czech (LDC2014S04). *Data* The original MALACH English data set (LDC2012S05) consists of unsegmented audio interviews in mp2 format and speaker-turn, time-marked transcripts in Transcriber (.trs) format presented in a single flat file. In this release, the speech files are segmented and converted to flac format, and the transcripts are updated to an utterance-by-utterance format. Additionally, a lexicon mapping words to phonemes is provided, and the data is divided into development and training sets. See the included documentation for more details on these changes, and the documentation and catalog entry for LDC2012S05 for further information about the source files. *Samples* Please view the following samples. Approximately 40 seconds of silence was left at the start of the speech file to preserve the time stamps' accuracy. * Speech * Segments * Transcript *Updates* None at this time.

Extent: Corpus size: 13481632 KB

Format: Sampling Rate: 16000

Sampling Format: pcm

Identifier: LDC2019S11

https://catalog.ldc.upenn.edu/LDC2019S11

ISBN: 1-58563-889-7

ISLRN: 465-555-380-050-7

DOI: 10.35111/mq64-hm19

Language: English

Language (ISO639): eng

License: USC-SFI MALACH Interviews and Transcripts English - Speech Recognition Edition For-Profit Member Agreement: https://catalog.ldc.upenn.edu/license/usc-sfi-malach-interviews-and-transcripts-english-speech-recognition-edition-for-profit-member-agreement.pdf

USC-SFI MALACH Interviews and Transcripts English - Speech Recognition Edition Non-Member Agreement: https://catalog.ldc.upenn.edu/license/usc-sfi-malach-interviews-and-transcripts-english-speech-recognition-edition-non-member-agreement.pdf

USC-SFI MALACH Interviews and Transcripts English - Speech Recognition Edition Not-for-Profit Member Agreement: https://catalog.ldc.upenn.edu/license/usc-sfi-malach-interviews-and-transcripts-english-speech-recognition-edition-not-for-profit-member-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2019S11

Rights Holder:
Portions © 2012, 2019 USC Shoah Foundation Institute, © 2012, 2019 Trustees of the University of Pennsylvania

The USC-SFI Malach Data is from the archive of the University of Southern California Shoah Foundation Institute for Visual History and Education.

Type (DCMI): Sound

Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2019S11

DateStamp: 2022-12-22

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Ramabhadran, Bhuvana; Gustman, Samuel; Byrne, William; Hajič, Jan; Oard, Douglas; Olsson, J. Scott; Picheny, Michael; Psutka, Josef. 2019. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2019S11
Up-to-date as of: Wed Oct 29 7:01:54 EDT 2025

Metadata
Title:		USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Ramabhadran, Bhuvana, et al. USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition LDC2019S11. Web Download. Philadelphia: Linguistic Data Consortium, 2019
Contributor:		Ramabhadran, Bhuvana
		Gustman, Samuel
		Byrne, William
		Hajič, Jan
		Oard, Douglas
		Olsson, J. Scott
		Picheny, Michael
		Psutka, Josef
Date (W3CDTF):		2019
Date Issued (W3CDTF):		2019-06-17
Description:		Introduction USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition was developed by IBM as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. This edition augments USC-SFI MALACH Interviews and Transcripts English (LDC2012S05) by modifying and updating a subset of the original corpus for use with the Kaldi toolkit in speech recognition work, and is easily portable for use by other speech recognition systems as well. It contains approximately 168 hours of interviews from 682 Holocaust witnesses along with transcripts, a lexicon, Kaldi specific files, and other documentation. Inspired by his experience making Schindler’s List, Steven Spielberg established the Survivors of the Shoah Visual History Foundation in 1994 to gather video testimonies from survivors and other witnesses of the Holocaust. While most of those who gave testimony were Jewish survivors, the Foundation also interviewed homosexual survivors, Jehovah’s Witness survivors, liberators and liberation witnesses, political prisoners, rescuers and aid providers, Roma and Sinti (Gypsy) survivors, survivors of eugenics policies, and war crimes trials participants. The Foundation’s Visual History Archive holds nearly 55,000 video testimonies in 43 languages, representing 65 countries; it is the largest archive of its kind in the world. In 2006, the Foundation became part of the Dana and David Dornsife College of Letters, Arts and Sciences at the University of Southern California in Los Angeles and was renamed as the USC Shoah Foundation Institute for Visual History and Education. The goal of the MALACH project was to develop methods for improved access to large multinational spoken archives; the focus was advancing the state of the art of automatic speech recognition and information retrieval. The characteristics of the USC-SFI collection -- unconstrained, natural speech filled with disfluencies, heavy accents, age-related coarticulations, un-cued speaker and language switching and emotional speech -- were considered well-suited for that task. The work centered on five languages: English, Czech, Russian, Polish and Slovak. LDC has also released USC-SFI MALACH Interviews and Transcripts Czech (LDC2014S04). Data The original MALACH English data set (LDC2012S05) consists of unsegmented audio interviews in mp2 format and speaker-turn, time-marked transcripts in Transcriber (.trs) format presented in a single flat file. In this release, the speech files are segmented and converted to flac format, and the transcripts are updated to an utterance-by-utterance format. Additionally, a lexicon mapping words to phonemes is provided, and the data is divided into development and training sets. See the included documentation for more details on these changes, and the documentation and catalog entry for LDC2012S05 for further information about the source files. Samples Please view the following samples. Approximately 40 seconds of silence was left at the start of the speech file to preserve the time stamps' accuracy. * Speech * Segments * Transcript Updates None at this time.
Extent:		Corpus size: 13481632 KB
Format:		Sampling Rate: 16000
Format:		Sampling Format: pcm
Identifier:		LDC2019S11
		https://catalog.ldc.upenn.edu/LDC2019S11
		ISBN: 1-58563-889-7
		ISLRN: 465-555-380-050-7
		DOI: 10.35111/mq64-hm19
Language:		English
Language (ISO639):		eng
License:		USC-SFI MALACH Interviews and Transcripts English - Speech Recognition Edition For-Profit Member Agreement: https://catalog.ldc.upenn.edu/license/usc-sfi-malach-interviews-and-transcripts-english-speech-recognition-edition-for-profit-member-agreement.pdf
		USC-SFI MALACH Interviews and Transcripts English - Speech Recognition Edition Non-Member Agreement: https://catalog.ldc.upenn.edu/license/usc-sfi-malach-interviews-and-transcripts-english-speech-recognition-edition-non-member-agreement.pdf
		USC-SFI MALACH Interviews and Transcripts English - Speech Recognition Edition Not-for-Profit Member Agreement: https://catalog.ldc.upenn.edu/license/usc-sfi-malach-interviews-and-transcripts-english-speech-recognition-edition-not-for-profit-member-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2019S11
Rights Holder:		Portions © 2012, 2019 USC Shoah Foundation Institute, © 2012, 2019 Trustees of the University of Pennsylvania The USC-SFI Malach Data is from the archive of the University of Southern California Shoah Foundation Institute for Visual History and Education.
Type (DCMI):		Sound
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2019S11
DateStamp:		2022-12-22
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Ramabhadran, Bhuvana; Gustman, Samuel; Byrne, William; Hajič, Jan; Oard, Douglas; Olsson, J. Scott; Picheny, Michael; Psutka, Josef. 2019. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Sound dcmi_Text iso639_eng olac_primary_text