OLAC Record: Hansard French/English

OLAC Record
oai:www.ldc.upenn.edu:LDC95T20

Metadata

Title: Hansard French/English

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Roukos, Salim, David Graff, and Dan Melamed. Hansard French/English LDC95T20. Web Download. Philadelphia: Linguistic Data Consortium, 1995

Contributor: Roukos, Salim

Graff, David

Melamed, Dan

Date (W3CDTF): 1995

Description: The Hansard Corpus consists of parallel texts in English and Canadian French, drawn from official records of the proceedings of the Canadian Parliament. While the content is therefore limited to legislative discourse, it spans a broad assortment of topics and the stylistic range includes spontaneous discussion and written correspondance along with legislative propositions and prepared speeches. The collection presented here has been assembled by the LDC by way of archives from two distinct secondary sources. Material from one time period of parliamentary proceedings was acquired through the IBM T. J. Watson Research Center, while material from another period was acquired through Bell Communications Research Inc. (Bellcore). The combined collection covers a time span from the mid-1970's through 1988, with no apparent duplication between the two data sources. Aside from covering different time periods, the two archives have different organization and have undergone different amounts and kinds of processing in being prepared as a parallel language resource. In addition, the Bellcore set itself comprises two distinct types of data -- one appears to be the main parliamentary proceedings (similar in nature to the IBM set), while the other consists of transcripts from committee hearings. The three sets have been kept distinct in this publication and each is described in greater detail in separate documentation files. In terms of what the three sets have in common: * They are rendered here using the 8-bit ISO-Latin1 character encoding standard. * They use a minimal amount of SGML tagging to identify sentences or paragraphs. * All sets are organized using a parallel file structure, in which the content of a given English text file is matched by the content of a corresponding French text file. * The SGML text files for the IBM and the Bellcore committee-hearings data are published in compressed form, using the public-domain GNU-Zip utility (gzip). The Bellcore main-session files are not compressed. In terms of differences between the three sets: * The IBM collection is presented as a sequence of parallel sentences (there are nearly 2.87 million parallel sentence pairs in the set). * The Bellcore data are presented as sequences of paragraphs. * The Bellcore main-session data is accompanied by mapping files that provide computed paragraph alignments and word-token correspondences; no additional alignment data are provided for the Bellcore committee texts (and none are needed for the IBM sentences).

Identifier: LDC95T20

https://catalog.ldc.upenn.edu/LDC95T20

ISBN: 1-58563-048-9

ISLRN: 711-183-299-010-5

DOI: 10.35111/jhgn-rv21

Language: French

English

Language (ISO639): fra

eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC95T20

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC95T20

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Roukos, Salim; Graff, David; Melamed, Dan. 1995. Linguistic Data Consortium.
Terms: area_Europe country_FR country_GB dcmi_Text iso639_eng iso639_fra olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC95T20
Up-to-date as of: Wed Oct 29 7:00:34 EDT 2025

Metadata
Title:		Hansard French/English
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Roukos, Salim, David Graff, and Dan Melamed. Hansard French/English LDC95T20. Web Download. Philadelphia: Linguistic Data Consortium, 1995
Contributor:		Roukos, Salim
		Graff, David
		Melamed, Dan
Date (W3CDTF):		1995
Description:		The Hansard Corpus consists of parallel texts in English and Canadian French, drawn from official records of the proceedings of the Canadian Parliament. While the content is therefore limited to legislative discourse, it spans a broad assortment of topics and the stylistic range includes spontaneous discussion and written correspondance along with legislative propositions and prepared speeches. The collection presented here has been assembled by the LDC by way of archives from two distinct secondary sources. Material from one time period of parliamentary proceedings was acquired through the IBM T. J. Watson Research Center, while material from another period was acquired through Bell Communications Research Inc. (Bellcore). The combined collection covers a time span from the mid-1970's through 1988, with no apparent duplication between the two data sources. Aside from covering different time periods, the two archives have different organization and have undergone different amounts and kinds of processing in being prepared as a parallel language resource. In addition, the Bellcore set itself comprises two distinct types of data -- one appears to be the main parliamentary proceedings (similar in nature to the IBM set), while the other consists of transcripts from committee hearings. The three sets have been kept distinct in this publication and each is described in greater detail in separate documentation files. In terms of what the three sets have in common: * They are rendered here using the 8-bit ISO-Latin1 character encoding standard. * They use a minimal amount of SGML tagging to identify sentences or paragraphs. * All sets are organized using a parallel file structure, in which the content of a given English text file is matched by the content of a corresponding French text file. * The SGML text files for the IBM and the Bellcore committee-hearings data are published in compressed form, using the public-domain GNU-Zip utility (gzip). The Bellcore main-session files are not compressed. In terms of differences between the three sets: * The IBM collection is presented as a sequence of parallel sentences (there are nearly 2.87 million parallel sentence pairs in the set). * The Bellcore data are presented as sequences of paragraphs. * The Bellcore main-session data is accompanied by mapping files that provide computed paragraph alignments and word-token correspondences; no additional alignment data are provided for the Bellcore committee texts (and none are needed for the IBM sentences).
Identifier:		LDC95T20
		https://catalog.ldc.upenn.edu/LDC95T20
		ISBN: 1-58563-048-9
		ISLRN: 711-183-299-010-5
		DOI: 10.35111/jhgn-rv21
Language:		French
Language:		English
Language (ISO639):		fra
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC95T20
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC95T20
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Roukos, Salim; Graff, David; Melamed, Dan. 1995. Linguistic Data Consortium.
Terms:		area_Europe country_FR country_GB dcmi_Text iso639_eng iso639_fra olac_primary_text