OLAC Record: C-ORAL-ROM - Integrated reference corpora for spoken romance languages. Multi-media edition; tools of analysis; standard linguistic measurements for validation in HLT

OLAC Record
oai:catalogue.elra.info:ELRA-S0172

Metadata

Title: C-ORAL-ROM - Integrated reference corpora for spoken romance languages. Multi-media edition; tools of analysis; standard linguistic measurements for validation in HLT

Access Rights: Rights available for: nonCommercialUse, commercialUse

Date Available (W3CDTF): 2004-12-23

Date Issued (W3CDTF): 2004-12-23

Date Modified (W3CDTF): 2009-01-14

Description: DescriptionThe C-ORAL-ROM resource is a multilingual corpus of spontaneous1 speech for the main romance languages of around 1,200,000 words (IST 2000-26228). The resource comprises three components:a)Multimedia corpus;b)Speech software;c)Appendix.The corpus consists of four comparable recording collections of Italian, French, Portuguese and Spanish spontaneous speech sessions (around 300,000 words for each Language). The collections are delivered respectively by the following providers: * Università di Firenze (Dipartimento di Italianistica, LABLITA); * Université de Provence (Description Linguistique Informatisée sur Corpus); * Fundação da Universidade de Lisboa/Centro de Linguística da Universidade de Lisboa * Universidad Autónoma de Madrid (Departamento de Lingüística, Lenguas Modernas, Lógica y F. de la Ciencia, Laboratorio de Lingüística Informática). The C-ORAL-ROM corpus provides the acoustic source of each session together with the following main annotations: * The orthographic transcription, in CHAT format, enriched with the tagging of terminal and non terminal prosodic breaks * Session metadata * The text to speech synchronization, in WIN PITCH CORPUS format, based on the alignment of each transcribed utterance, The multimedia corpus comes with the speech software Win Pitch Corpus (© Pitch France. Minimal configuration: Pentium III, 1 GHz, 252 mega Ram, S-blaster or compatible sound card, running under Windows 2000 or XP only. GDPLUS.dll installed on the same directory of the program required).2 A series of appendix are also provided containing: a) the purely textual corpus in .TXT and .XML format; b) the PoS tagging of all and the corresponding frequency list of lemmas forms in .TXT files; c) a set of linguistic measurements extracted from the main corpus annotations, in .EXCEL files; d) the specifications and validation of the resource, e) corpus metadata.Package1. DVDs 1 to 8 contain the multimedia corpus edition (DVDs1-2 French; DVDs 3-4 Italian; DVDs 5-6 Portuguese; DVDs 7-8 Spanish). All collections have the same folder's structure, that mirrors directly the C-ORAL-ROM corpus design (see. below). For each session into folders the following is delivered: * the uncompressed .WAV files (Windows PCM: 22,050 hz; 16 bit) * the .TXT file of the transcripts; * the .XML file defining the text to speech alignment in WIN PITCH CORPUS format and its .DTD2. The CD contains the speech software and the Appendix:a)Speech softwareThe speech software Win Pitch Corpus (10 licenses)b) AppendixThe C-ORAL-ROM transcription files in .TXT and .XML formatThe C-ORAL-ROM transcription files with PoS tagging in .TXT filesThe frequency list of lemmas for each language collection in TXT filesMeasurements of spoken language variability in EXCEL filesThe Corpus specifications:a)Corpus design;b)Metadata description;c)Dialogue representation format;d)Prosodic tagging;e)Alignment format;f)XML format;g)PoS tagging and lemma formatsh)Glossaries.Resource Validation reportsMultimedia sample filesMain FeaturesThe resource aims to represent the variety of speech acts performed in everyday language and to enable the induction of prosodic and syntactic structures in the four romance languages, from a quantitative and qualitative point of view. The resource has been designed for prosodic modeling, test bed procedures in HLT and corpus based studies of spontaneous speech. C-ORAL-ROM have a relevant added value at the following levels: * Corpus design * Metadata * Dialogue representation * Prosodic annotation * PoS tagging * Multimedia storage * Speech analysis CORPUS DESIGNThe corpus design of the C-ORAL-ROM resource aim to ensure a possibility of occurrence for a large variety of speech act typologies and natural prosodic contours, which are the most peculiar linguistic feature found in spontaneous speech. To this end the main variation parameters of the spoken domain (Channel variation, Dialogue structure, sociological domain of use, and semantic domain of application) are represented in a corpus design schema, covering a wide range of semantic and pragmatic domains of application.The four language collection are considered comparable as far as they fit with the corpus design schema. More specifically each language collection in the C-ORAL-ROM corpus is consistent with the following average structure (check documentation for deviations):INFORMAL/150,000 words from at least 64 texts of 1500 words each and 10 texts of 4500 words eachINFORMAL/ Family-Private context/124,500 wordsINFORMAL/Family-Private context/ Monologues/42,000 wordsINFORMAL/Family-Private context/Dialogues-Conversations /82,500 wordsINFORMAL/Public context /25.500 wordsINFORMAL/Public context/Monologues/6,000 wordsINFORMAL/Public context/ Dialogues-Conversations/19,500 wordsFORMAL 150,000 wordsFORMAL/Formal in natural context/2 or 3 samples of 3000 words average for each of the following typical domain of use for 65,000 words in total.FORMAL/Formal in natural context/ political speechFORMAL/Formal in natural context/ political debateFORMAL/Formal in natural context/ preachingFORMAL/Formal in natural context/ teachingFORMAL/Formal in natural context/professional explanationFORMAL/Formal in natural context/ conferenceFORMAL/Formal in natural context/ businessFORMAL/Formal in natural context/law (through media allowed)FORMAL/Media context/2 or 3 samples of 3000 words average for each of the following typical domain of use for 60,000 words in totalFORMAL/Media context/news (small sample)FORMAL/Media context/meteo (small sample)FORMAL/Media context/interviewsFORMAL/Media context/reportageFORMAL/Media context/scientific pressFORMAL/Media context/sport talk showsFORMAL/Media context/political debateFORMAL/Media context/talk shows thematic discussionsFORMAL/Media context/talk shows cultureFORMAL/Media context/talk shows scienceFORMAL/Telephone 25,000 words3FORMAL/Telephone/private conversationsFORMAL/Telephone/phone to call services or man-machine interaction (10,000 words) 4METADATAFor each session a rich series of metadata is delivered in CHAT format, ensuring multitask exploitation of the resource for Linguistics and Human language technologies. Metadata contain essential information regarding the speakers, the recording situation, the topic, the acoustic quality, the source of the collected data .DIALOGUE REPRESENTATIONCorpora are orthographically transcribed in standard textual format (CHAT format; Mac Whinney, 1994) with the annotation of speaker's turns. The textual string is divided into utterances. The main non linguistic and paralinguistic acoustic events in the speech flow are reported into transcriptsPROSODIC ANNOTATIONThe four romance collections are completely tagged with respect to prosodic breaks. Terminal and non terminal breaks, are discriminated through perceptive judgments and reported in the transcripts. The level of inter-annotator agreement on prosodic tags assignment has been validated by an external institution.MULTIMEDIA STORAGEThe multimedia storage ensures a natural and meaningful text / sound correspondence for both prosodic modeling, test bed procedures and corpus based studies of spontaneous speech.SPEECH SOFTWAREWin Pitch Corpus is an innovative software program for computer-aided alignment of large corpora. It provides a method for easy and precise selection of alignment units, ranging from syllable to whole sentences in a hierarchical storing system of aligned data. The method is based on the ability to link visually a moving target with the perception of corresponding speech sound played back at a rate reduced by at least 30% or more.Segments derived from alignment can be defined on 8 independent layers, with automatic generation of the corresponding database, which can be saved directly in both XML and Excel formats. Besides text to speech alignment, Win Pitch Corpus, which is Unicode compliant, has numerous features allowing easy and efficient acoustical analysis of speech, such as real-time fundamental frequency tracking, spectrographic display, re-synthesis after editing of prosodic parameters, etc...For more information: http://www.elda.org/en/proj/coralrom.html___________________(1) As defined according to C-ORAL-ROM as: comprising formal and informal speech.(2) ELDA does not take responsibility on software products coming with the distributed resources. Pitch France is fully responsible for this Software.(3) text length not defined (by preference 1500 words upper limit, no lower limit)(4) Field not present in the Portuguese corpus. The texts in this field are not delivered aligned to the acoustic source.

Identifier: ELRA-S0172

ISLRN: 318-977-046-077-4

Identifier (URI): https://catalog.elra.info/en-us/repository/browse/ELRA-S0172/

Language: French

Portuguese

Italian

Spanish; Castilian

Language (ISO639): fra

por

ita

spa

Medium: Not specified

Publisher: ELRA (European Language Resources Association)

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: ELRA Catalogue of Language Resources

Description: http://www.language-archives.org/archive/catalogue.elra.info

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:catalogue.elra.info:ELRA-S0172

DateStamp: 2004-12-23

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: n.a. 2004. ELRA (European Language Resources Association).
Terms: area_Europe country_ES country_FR country_IT country_PT dcmi_Sound iso639_fra iso639_ita iso639_por iso639_spa olac_primary_text

http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-S0172
Up-to-date as of: Wed Oct 1 0:55:25 EDT 2025

Metadata
Title:		C-ORAL-ROM - Integrated reference corpora for spoken romance languages. Multi-media edition; tools of analysis; standard linguistic measurements for validation in HLT
Access Rights:		Rights available for: nonCommercialUse, commercialUse
Date Available (W3CDTF):		2004-12-23
Date Issued (W3CDTF):		2004-12-23
Date Modified (W3CDTF):		2009-01-14
Description:		DescriptionThe C-ORAL-ROM resource is a multilingual corpus of spontaneous1 speech for the main romance languages of around 1,200,000 words (IST 2000-26228). The resource comprises three components:a)Multimedia corpus;b)Speech software;c)Appendix.The corpus consists of four comparable recording collections of Italian, French, Portuguese and Spanish spontaneous speech sessions (around 300,000 words for each Language). The collections are delivered respectively by the following providers: * Università di Firenze (Dipartimento di Italianistica, LABLITA); * Université de Provence (Description Linguistique Informatisée sur Corpus); * Fundação da Universidade de Lisboa/Centro de Linguística da Universidade de Lisboa * Universidad Autónoma de Madrid (Departamento de Lingüística, Lenguas Modernas, Lógica y F. de la Ciencia, Laboratorio de Lingüística Informática). The C-ORAL-ROM corpus provides the acoustic source of each session together with the following main annotations: * The orthographic transcription, in CHAT format, enriched with the tagging of terminal and non terminal prosodic breaks * Session metadata * The text to speech synchronization, in WIN PITCH CORPUS format, based on the alignment of each transcribed utterance, The multimedia corpus comes with the speech software Win Pitch Corpus (© Pitch France. Minimal configuration: Pentium III, 1 GHz, 252 mega Ram, S-blaster or compatible sound card, running under Windows 2000 or XP only. GDPLUS.dll installed on the same directory of the program required).2 A series of appendix are also provided containing: a) the purely textual corpus in .TXT and .XML format; b) the PoS tagging of all and the corresponding frequency list of lemmas forms in .TXT files; c) a set of linguistic measurements extracted from the main corpus annotations, in .EXCEL files; d) the specifications and validation of the resource, e) corpus metadata.Package1. DVDs 1 to 8 contain the multimedia corpus edition (DVDs1-2 French; DVDs 3-4 Italian; DVDs 5-6 Portuguese; DVDs 7-8 Spanish). All collections have the same folder's structure, that mirrors directly the C-ORAL-ROM corpus design (see. below). For each session into folders the following is delivered: * the uncompressed .WAV files (Windows PCM: 22,050 hz; 16 bit) * the .TXT file of the transcripts; * the .XML file defining the text to speech alignment in WIN PITCH CORPUS format and its .DTD2. The CD contains the speech software and the Appendix:a)Speech softwareThe speech software Win Pitch Corpus (10 licenses)b) AppendixThe C-ORAL-ROM transcription files in .TXT and .XML formatThe C-ORAL-ROM transcription files with PoS tagging in .TXT filesThe frequency list of lemmas for each language collection in TXT filesMeasurements of spoken language variability in EXCEL filesThe Corpus specifications:a)Corpus design;b)Metadata description;c)Dialogue representation format;d)Prosodic tagging;e)Alignment format;f)XML format;g)PoS tagging and lemma formatsh)Glossaries.Resource Validation reportsMultimedia sample filesMain FeaturesThe resource aims to represent the variety of speech acts performed in everyday language and to enable the induction of prosodic and syntactic structures in the four romance languages, from a quantitative and qualitative point of view. The resource has been designed for prosodic modeling, test bed procedures in HLT and corpus based studies of spontaneous speech. C-ORAL-ROM have a relevant added value at the following levels: * Corpus design * Metadata * Dialogue representation * Prosodic annotation * PoS tagging * Multimedia storage * Speech analysis CORPUS DESIGNThe corpus design of the C-ORAL-ROM resource aim to ensure a possibility of occurrence for a large variety of speech act typologies and natural prosodic contours, which are the most peculiar linguistic feature found in spontaneous speech. To this end the main variation parameters of the spoken domain (Channel variation, Dialogue structure, sociological domain of use, and semantic domain of application) are represented in a corpus design schema, covering a wide range of semantic and pragmatic domains of application.The four language collection are considered comparable as far as they fit with the corpus design schema. More specifically each language collection in the C-ORAL-ROM corpus is consistent with the following average structure (check documentation for deviations):INFORMAL/150,000 words from at least 64 texts of 1500 words each and 10 texts of 4500 words eachINFORMAL/ Family-Private context/124,500 wordsINFORMAL/Family-Private context/ Monologues/42,000 wordsINFORMAL/Family-Private context/Dialogues-Conversations /82,500 wordsINFORMAL/Public context /25.500 wordsINFORMAL/Public context/Monologues/6,000 wordsINFORMAL/Public context/ Dialogues-Conversations/19,500 wordsFORMAL 150,000 wordsFORMAL/Formal in natural context/2 or 3 samples of 3000 words average for each of the following typical domain of use for 65,000 words in total.FORMAL/Formal in natural context/ political speechFORMAL/Formal in natural context/ political debateFORMAL/Formal in natural context/ preachingFORMAL/Formal in natural context/ teachingFORMAL/Formal in natural context/professional explanationFORMAL/Formal in natural context/ conferenceFORMAL/Formal in natural context/ businessFORMAL/Formal in natural context/law (through media allowed)FORMAL/Media context/2 or 3 samples of 3000 words average for each of the following typical domain of use for 60,000 words in totalFORMAL/Media context/news (small sample)FORMAL/Media context/meteo (small sample)FORMAL/Media context/interviewsFORMAL/Media context/reportageFORMAL/Media context/scientific pressFORMAL/Media context/sport talk showsFORMAL/Media context/political debateFORMAL/Media context/talk shows thematic discussionsFORMAL/Media context/talk shows cultureFORMAL/Media context/talk shows scienceFORMAL/Telephone 25,000 words3FORMAL/Telephone/private conversationsFORMAL/Telephone/phone to call services or man-machine interaction (10,000 words) 4METADATAFor each session a rich series of metadata is delivered in CHAT format, ensuring multitask exploitation of the resource for Linguistics and Human language technologies. Metadata contain essential information regarding the speakers, the recording situation, the topic, the acoustic quality, the source of the collected data .DIALOGUE REPRESENTATIONCorpora are orthographically transcribed in standard textual format (CHAT format; Mac Whinney, 1994) with the annotation of speaker's turns. The textual string is divided into utterances. The main non linguistic and paralinguistic acoustic events in the speech flow are reported into transcriptsPROSODIC ANNOTATIONThe four romance collections are completely tagged with respect to prosodic breaks. Terminal and non terminal breaks, are discriminated through perceptive judgments and reported in the transcripts. The level of inter-annotator agreement on prosodic tags assignment has been validated by an external institution.MULTIMEDIA STORAGEThe multimedia storage ensures a natural and meaningful text / sound correspondence for both prosodic modeling, test bed procedures and corpus based studies of spontaneous speech.SPEECH SOFTWAREWin Pitch Corpus is an innovative software program for computer-aided alignment of large corpora. It provides a method for easy and precise selection of alignment units, ranging from syllable to whole sentences in a hierarchical storing system of aligned data. The method is based on the ability to link visually a moving target with the perception of corresponding speech sound played back at a rate reduced by at least 30% or more.Segments derived from alignment can be defined on 8 independent layers, with automatic generation of the corresponding database, which can be saved directly in both XML and Excel formats. Besides text to speech alignment, Win Pitch Corpus, which is Unicode compliant, has numerous features allowing easy and efficient acoustical analysis of speech, such as real-time fundamental frequency tracking, spectrographic display, re-synthesis after editing of prosodic parameters, etc...For more information: http://www.elda.org/en/proj/coralrom.html___________________(1) As defined according to C-ORAL-ROM as: comprising formal and informal speech.(2) ELDA does not take responsibility on software products coming with the distributed resources. Pitch France is fully responsible for this Software.(3) text length not defined (by preference 1500 words upper limit, no lower limit)(4) Field not present in the Portuguese corpus. The texts in this field are not delivered aligned to the acoustic source.
Identifier:		ELRA-S0172
Identifier:		ISLRN: 318-977-046-077-4
Identifier (URI):		https://catalog.elra.info/en-us/repository/browse/ELRA-S0172/
Language:		French
		Portuguese
		Italian
		Spanish; Castilian
Language (ISO639):		fra
		por
		ita
		spa
Medium:		Not specified
Publisher:		ELRA (European Language Resources Association)
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		ELRA Catalogue of Language Resources
Description:		http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:catalogue.elra.info:ELRA-S0172
DateStamp:		2004-12-23
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		n.a. 2004. ELRA (European Language Resources Association).
Terms:		area_Europe country_ES country_FR country_IT country_PT dcmi_Sound iso639_fra iso639_ita iso639_por iso639_spa olac_primary_text