OLAC Record: Arabic Broadcast News Transcripts

OLAC Record
oai:www.ldc.upenn.edu:LDC2006T20

Metadata

Title: Arabic Broadcast News Transcripts

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Maamouri, Mohamed, David Graff, and Christopher Cieri. Arabic Broadcast News Transcripts LDC2006T20. Web Download. Philadelphia: Linguistic Data Consortium, 2006

Contributor: Maamouri, Mohamed

Graff, David

Cieri, Christopher

Date (W3CDTF): 2006

Date Issued (W3CDTF): 2006-12-19

Description: *Introduction* Arabic Broadcast News Transcripts was developed by the Linguistic Data Consortium (LDC) and consists of 10 hours of transcribed speech from Voice of America satellite radio news broadcasts in Arabic recorded by LDC between June 2000 and January 2001. The corresponding speech files are available in Arabic Broadcast News Speech (LDC2006S46). This work was undertaken in the Networking Data Centers (NetDC) project (MLIS-5017, NSF IIS-9982201) in conjunction with the European Language Resources Association (ELRA). ELRA transcribed 22.5 hours of Arabic broadcast data from Radio Orient (France) that is available in NetDC Arabic BNSC (Broadcast News Speech Corpus) (ELRA-S0157). The goal of the NetDC project was to improve the infrastructure for language resources by designing and implementing new modes of cooperation between LDC and ELRA. *Data* The character encoding is entirely in ASCII; Buckwalter transliteration is used for rendering the Arabic text content. Time alignment and structural markup are rendered via "pseudo-SGML" tags, which are presented one tag per line, with the first character of the line being an open angle bracket. The lines of transcription text (i.e. the speech and annotation content between the time-stamp tags) all begin with a single space character, and present exactly one token per line. A "token" may be a spoken Arabic word, a punctuation mark, or a single Arabic word enclosed by "(%" and ")", which represents an annotation of a non-speech condition or event (e.g. "music", "noise", "laugh", etc). *Updates* None at this time. *Samples* Please view this transcript sample.

Extent: Corpus size: 3584 KB

Identifier: LDC2006T20

https://catalog.ldc.upenn.edu/LDC2006T20

ISBN: 1-58563-420-4

ISLRN: 476-762-568-967-9

DOI: 10.35111/afrz-9s73

Language: Standard Arabic

Language (ISO639): arb

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2006T20

Rights Holder: Portions © 2000, 2001, 2002, 2005, 2006 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2006T20

DateStamp: 2021-02-05

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Maamouri, Mohamed; Graff, David; Cieri, Christopher. 2006. Linguistic Data Consortium.
Terms: area_Asia country_SA dcmi_Text iso639_arb olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2006T20
Up-to-date as of: Fri Aug 8 0:27:41 EDT 2025

Metadata
Title:		Arabic Broadcast News Transcripts
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Maamouri, Mohamed, David Graff, and Christopher Cieri. Arabic Broadcast News Transcripts LDC2006T20. Web Download. Philadelphia: Linguistic Data Consortium, 2006
Contributor:		Maamouri, Mohamed
		Graff, David
		Cieri, Christopher
Date (W3CDTF):		2006
Date Issued (W3CDTF):		2006-12-19
Description:		Introduction Arabic Broadcast News Transcripts was developed by the Linguistic Data Consortium (LDC) and consists of 10 hours of transcribed speech from Voice of America satellite radio news broadcasts in Arabic recorded by LDC between June 2000 and January 2001. The corresponding speech files are available in Arabic Broadcast News Speech (LDC2006S46). This work was undertaken in the Networking Data Centers (NetDC) project (MLIS-5017, NSF IIS-9982201) in conjunction with the European Language Resources Association (ELRA). ELRA transcribed 22.5 hours of Arabic broadcast data from Radio Orient (France) that is available in NetDC Arabic BNSC (Broadcast News Speech Corpus) (ELRA-S0157). The goal of the NetDC project was to improve the infrastructure for language resources by designing and implementing new modes of cooperation between LDC and ELRA. Data The character encoding is entirely in ASCII; Buckwalter transliteration is used for rendering the Arabic text content. Time alignment and structural markup are rendered via "pseudo-SGML" tags, which are presented one tag per line, with the first character of the line being an open angle bracket. The lines of transcription text (i.e. the speech and annotation content between the time-stamp tags) all begin with a single space character, and present exactly one token per line. A "token" may be a spoken Arabic word, a punctuation mark, or a single Arabic word enclosed by "(%" and ")", which represents an annotation of a non-speech condition or event (e.g. "music", "noise", "laugh", etc). Updates None at this time. Samples Please view this transcript sample.
Extent:		Corpus size: 3584 KB
Identifier:		LDC2006T20
		https://catalog.ldc.upenn.edu/LDC2006T20
		ISBN: 1-58563-420-4
		ISLRN: 476-762-568-967-9
		DOI: 10.35111/afrz-9s73
Language:		Standard Arabic
Language (ISO639):		arb
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2006T20
Rights Holder:		Portions © 2000, 2001, 2002, 2005, 2006 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2006T20
DateStamp:		2021-02-05
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Maamouri, Mohamed; Graff, David; Cieri, Christopher. 2006. Linguistic Data Consortium.
Terms:		area_Asia country_SA dcmi_Text iso639_arb olac_primary_text