OLAC Record: Broadcast News Lattices

OLAC Record
oai:www.ldc.upenn.edu:LDC2011T06

Metadata

Title: Broadcast News Lattices

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Zweig, Geoffrey, Damianos Karakos, and Patrick Nguyen. Broadcast News Lattices LDC2011T06. Web Download. Philadelphia: Linguistic Data Consortium, 2011

Contributor: Zweig, Geoffrey

Karakos, Damianos

Nguyen, Patrick

Date (W3CDTF): 2011

Date Issued (W3CDTF): 2011-04-15

Description: *Introduction * Broadcast News Lattices, Linguistic Data Consortium (LDC) catalog number LDC2011T06 and isbn 1-58563-578-2, was developed by researchers at Microsoft and Johns Hopkins Unviersity (JHU) for the Johns Hopkins 2010 Summer Workshop on Speech Recognition with Conditional Random Fields. The lattices were generated using the IBM Attila speech recognition toolkit and were derived from transcripts of approximately 400 hours of English broadcast news recordings. They are intended to be used for training and decoding with Microsofts segmental CRF toolkit for speech recogntion, SCARF. The goal of the JHU 2010 workshop was to advance the state-of-the-art in core speech recognition by developing new kinds of features for use in a Segmental Conditional Random Field (SCRF). The SCRF approach generalizes Condtional Random Fields to operate at the segment level, rather than at the traditional frame level. Every segment is labeled directly with a word. Features are then extracted which each measure some form of consistency between the underlying audio and the word hypothesis for a segment. These are combined in a log-linear model (lattice) to produce the posterior possibility of a word sequence given the audio. *Data * Broadcast News Lattices consists of training and test material, the source data for which was taken from various corpora distributed by LDC. Training Data The training lattices total 152251 and were derived from the following data sets: 1996 English Broadcast News Speech LDC97S44 1996 English Broadcast News Transcripts (HUB4) LDC97T22 (104 hours) 1997 English Broadcast News Speech (HUB4) LDC98S71 1997 English Broadcast News Transcripts (HUB4) LDC98T28 (97 hours) TDTD4 Multilingual Broadcast News Speech Corpus LDC2005S11 TDT4 Multilingual Text and Annotations LDC2005T16 (300 hours) The lattices can be related to the original audio files via the file train.db.gz which lists for each segment a tag-name, segment number, the original audio file, channel (always 0), start time, and end time (in seconds). A sample line is as follows: 19960510_NPR_ATC#Ailene_Leblanc 0001 19960510_NPR_ATC.sph 0 76.767 89.404 | This sample line corresponds to the release lattice labeled: 19960510_NPR_ATC#Ailene_Leblanc@0001.dc The file train.Bdc contains denominator lattices. The file train.Bnc has the numerator lattices containing the subset of paths consistent with the training transcriptions. The file train.Btr consists of the transcriptions. The file train.Bbase contains the baseline (one-best) word detections from the Attila system. The lattices were generated from an acoustic model that included LDA+MLLT, VTLN, fMLLR based SAT training, fMMI and mMMI discriminative training, and MLLR. The lattices are annotated with a field indicating the results of a second confirmatory decoding made with an independent speech recognizer. When there was a correspondence between a lattice link and the 1-best secondary output, the link was annotated with +1. Silence links are denominated with 0 and all others with -1. Correspondence was computed by finding the midpoint of a lattice link and comparing the link label with that of the word in the secondary decoding at that position. Thus, there are some cases where the same word shifted slightly in time receives a different confirmation score. Test Data The test lattices are derived from the English broadcast news material in 2003 NIST Rich Transcription Evaluation Data LDC2007S10. Bbase and Bdc files are provided, along with the db file rt03.db.gz to link the segments to times in the original waveform. Scoring scripts may be obtained from the NIST Rich Transcription website. *SCARF Toolkit* The SCARF toolkit is available for download from the SCARF website. *Related Publications* A full description of the lattice generation process can be found in Zweig et al., Speech Recognition with Segmental Conditional Random Fields: Final Report from the 2010 JHU Summer Workshop, MSR Technical Report MSR-TR-2010-173. *Updates * Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2011T06. *Samples* Source Denominator Lattices 20010206_1830_1900_ABC_WNT#aaron_brown@0001.base # baseline 2 A 5 HALF 20 CENTURY 56 AGO 95 LORRAINE 132 WAGNER 175 WAS 207 A 219 KID 239 WITH 263 A 270 CRUSH 300 THE 376 OBJECT 416 OF 446 HER 458 AFFECTION 497 AND 565 HER 583 CONSIDERABLE 637 ATTENTION 716 WAS 817 A 826 HUNKY 847 YOUNG 880 ACTOR 909 NAMED 934 RONALD 960 REAGAN 995 1012 20010206_1830_1900_ABC_WNT#aaron_brown@0001.dc 1 2 confirm=0 3 5 A confirm=1 6 31 HALF confirm=1 32 77 CENTURY confirm=1 78 110 AGO confirm=1 111 151 LORRAINE confirm=1 111 151 LORAINE confirm=-1 152 196 WAGNER confirm=1 197 212 WAS confirm=1 197 215 WAS confirm=1 213 221 THE confirm=-1 216 219 A confirm=-1 220 253 KIT confirm=-1 220 254 KIT confirm=-1 220 255 KID confirm=-1 222 253 KIT confirm=-1 222 254 KIT confirm=-1 222 255 KID confirm=1 254 265 WITH confirm=1 254 267 WITH confirm=1 255 265 WITH confirm=1 255 267 WITH confirm=1 256 265 WITH confirm=1 256 267 WITH confirm=1 266 272 THE confirm=-1 268 270 A confirm=-1 271 327 CRUSH confirm=-1 271 327 CRASH confirm=-1 273 327 CRUSH confirm=-1 328 360 ~SIL confirm=0 The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

Extent: Corpus size: 945152 KB

Identifier: LDC2011T06

https://catalog.ldc.upenn.edu/LDC2011T06

ISBN: 1-58563-578-2

ISLRN: 990-903-200-829-9

DOI: 10.35111/nvbn-sz63

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2011T06

Rights Holder: Portions ©1996-1998, 2000-2001 American Broadcasting Company, Inc., © 1996-1998, 2000-2001 Cable News Network LP, LLLP, © 2000-2001 National Broadcasting Company, © 1996-1998 National Public Radio, Inc., © 1996-1998 National Satellite Cable Corporation, © 1996-1998, 2005, 2007, 2011 Trustees of the University of Pennsylvania

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2011T06

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Zweig, Geoffrey; Karakos, Damianos; Nguyen, Patrick. 2011. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2011T06
Up-to-date as of: Wed Oct 29 7:01:15 EDT 2025

Metadata
Title:		Broadcast News Lattices
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Zweig, Geoffrey, Damianos Karakos, and Patrick Nguyen. Broadcast News Lattices LDC2011T06. Web Download. Philadelphia: Linguistic Data Consortium, 2011
Contributor:		Zweig, Geoffrey
		Karakos, Damianos
		Nguyen, Patrick
Date (W3CDTF):		2011
Date Issued (W3CDTF):		2011-04-15
Description:		Introduction Broadcast News Lattices, Linguistic Data Consortium (LDC) catalog number LDC2011T06 and isbn 1-58563-578-2, was developed by researchers at Microsoft and Johns Hopkins Unviersity (JHU) for the Johns Hopkins 2010 Summer Workshop on Speech Recognition with Conditional Random Fields. The lattices were generated using the IBM Attila speech recognition toolkit and were derived from transcripts of approximately 400 hours of English broadcast news recordings. They are intended to be used for training and decoding with Microsofts segmental CRF toolkit for speech recogntion, SCARF. The goal of the JHU 2010 workshop was to advance the state-of-the-art in core speech recognition by developing new kinds of features for use in a Segmental Conditional Random Field (SCRF). The SCRF approach generalizes Condtional Random Fields to operate at the segment level, rather than at the traditional frame level. Every segment is labeled directly with a word. Features are then extracted which each measure some form of consistency between the underlying audio and the word hypothesis for a segment. These are combined in a log-linear model (lattice) to produce the posterior possibility of a word sequence given the audio. Data Broadcast News Lattices consists of training and test material, the source data for which was taken from various corpora distributed by LDC. Training Data The training lattices total 152251 and were derived from the following data sets: 1996 English Broadcast News Speech LDC97S44 1996 English Broadcast News Transcripts (HUB4) LDC97T22 (104 hours) 1997 English Broadcast News Speech (HUB4) LDC98S71 1997 English Broadcast News Transcripts (HUB4) LDC98T28 (97 hours) TDTD4 Multilingual Broadcast News Speech Corpus LDC2005S11 TDT4 Multilingual Text and Annotations LDC2005T16 (300 hours) The lattices can be related to the original audio files via the file train.db.gz which lists for each segment a tag-name, segment number, the original audio file, channel (always 0), start time, and end time (in seconds). A sample line is as follows: 19960510_NPR_ATC#Ailene_Leblanc 0001 19960510_NPR_ATC.sph 0 76.767 89.404 \| This sample line corresponds to the release lattice labeled: 19960510_NPR_ATC#Ailene_Leblanc@0001.dc The file train.Bdc contains denominator lattices. The file train.Bnc has the numerator lattices containing the subset of paths consistent with the training transcriptions. The file train.Btr consists of the transcriptions. The file train.Bbase contains the baseline (one-best) word detections from the Attila system. The lattices were generated from an acoustic model that included LDA+MLLT, VTLN, fMLLR based SAT training, fMMI and mMMI discriminative training, and MLLR. The lattices are annotated with a field indicating the results of a second confirmatory decoding made with an independent speech recognizer. When there was a correspondence between a lattice link and the 1-best secondary output, the link was annotated with +1. Silence links are denominated with 0 and all others with -1. Correspondence was computed by finding the midpoint of a lattice link and comparing the link label with that of the word in the secondary decoding at that position. Thus, there are some cases where the same word shifted slightly in time receives a different confirmation score. Test Data The test lattices are derived from the English broadcast news material in 2003 NIST Rich Transcription Evaluation Data LDC2007S10. Bbase and Bdc files are provided, along with the db file rt03.db.gz to link the segments to times in the original waveform. Scoring scripts may be obtained from the NIST Rich Transcription website. SCARF Toolkit The SCARF toolkit is available for download from the SCARF website. Related Publications A full description of the lattice generation process can be found in Zweig et al., Speech Recognition with Segmental Conditional Random Fields: Final Report from the 2010 JHU Summer Workshop, MSR Technical Report MSR-TR-2010-173. Updates Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2011T06. Samples Source Denominator Lattices 20010206_1830_1900_ABC_WNT#aaron_brown@0001.base # baseline 2 A 5 HALF 20 CENTURY 56 AGO 95 LORRAINE 132 WAGNER 175 WAS 207 A 219 KID 239 WITH 263 A 270 CRUSH 300 THE 376 OBJECT 416 OF 446 HER 458 AFFECTION 497 AND 565 HER 583 CONSIDERABLE 637 ATTENTION 716 WAS 817 A 826 HUNKY 847 YOUNG 880 ACTOR 909 NAMED 934 RONALD 960 REAGAN 995 1012 20010206_1830_1900_ABC_WNT#aaron_brown@0001.dc 1 2 confirm=0 3 5 A confirm=1 6 31 HALF confirm=1 32 77 CENTURY confirm=1 78 110 AGO confirm=1 111 151 LORRAINE confirm=1 111 151 LORAINE confirm=-1 152 196 WAGNER confirm=1 197 212 WAS confirm=1 197 215 WAS confirm=1 213 221 THE confirm=-1 216 219 A confirm=-1 220 253 KIT confirm=-1 220 254 KIT confirm=-1 220 255 KID confirm=-1 222 253 KIT confirm=-1 222 254 KIT confirm=-1 222 255 KID confirm=1 254 265 WITH confirm=1 254 267 WITH confirm=1 255 265 WITH confirm=1 255 267 WITH confirm=1 256 265 WITH confirm=1 256 267 WITH confirm=1 266 272 THE confirm=-1 268 270 A confirm=-1 271 327 CRUSH confirm=-1 271 327 CRASH confirm=-1 273 327 CRUSH confirm=-1 328 360 ~SIL confirm=0 The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
Extent:		Corpus size: 945152 KB
Identifier:		LDC2011T06
		https://catalog.ldc.upenn.edu/LDC2011T06
		ISBN: 1-58563-578-2
		ISLRN: 990-903-200-829-9
		DOI: 10.35111/nvbn-sz63
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2011T06
Rights Holder:		Portions ©1996-1998, 2000-2001 American Broadcasting Company, Inc., © 1996-1998, 2000-2001 Cable News Network LP, LLLP, © 2000-2001 National Broadcasting Company, © 1996-1998 National Public Radio, Inc., © 1996-1998 National Satellite Cable Corporation, © 1996-1998, 2005, 2007, 2011 Trustees of the University of Pennsylvania The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2011T06
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Zweig, Geoffrey; Karakos, Damianos; Nguyen, Patrick. 2011. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text