OLAC Record: AnnoDIFP Session Audio and Transcripts

OLAC Record
oai:www.ldc.upenn.edu:LDC2025S06

Metadata

Title: AnnoDIFP Session Audio and Transcripts

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Cieri, Christopher, et al. AnnoDIFP Session Audio and Transcripts LDC2025S06. Web Download. Philadelphia: Linguistic Data Consortium, 2025

Contributor: Cieri, Christopher

Fiumara, James

Walker, Kevin

Liberman, Mark

Ryant, Neville

Date (W3CDTF): 2025

Date Issued (W3CDTF): 2025-07-15

Description: *Introduction* AnnoDIFP (Annotated Data for the Investigation of Facets of Personality) Session Audio and Transcripts was developed by the Linguistic Data Consortium (LDC), the Florida Institute of Technology (FIT), and the University of New Haven (UNH) to support algorithm development for predicting personality traits. It contains 438.34 hours of English audio and transcripts from in-person interviews of 366 participants paired with scores from two self-reported personality assessments, HEXACO Personality Inventory (Revised) (HEXACO-PI-R) and Short Dark Triad (SD3). Survey and behavioral data were collected in three phases. Phase 1 consisted of online questionnaires. Selected participants were invited to participate in Phase 2a, collecting behavioral and linguistic data in a laboratory setting. In Phase 2b, participants engaged in a telephone speech collection by calling other particpants. This release covers the activities in Phase 2a. *Data* In-person interviews were recorded at LDC, FIT and UNH. In each session, the participant and interviewer sat in separate sound-isolated rooms with communication between them supplied by audio/video hardware. Sessions consisted of the following tasks: rapport building, a YouTube task, a map task, and a business task. Further details on collection methodology and session tasks are contained in the documentation accompanying this release. There were a total of 386 participants in Phase 2a. This corpus contains audio data and transcripts from 301 participants and transcripts only for 65 participants. Recordings for 20 participants were not usable. Each session (or session part in the case of multipart sessions) is accompanied by a transcript produced automatically using the Rev.ai speech-to-text service. Speech data is presented as 16 kHz, 16-bit mono-channel FLAC-compressed MS-WAV files. Text data is UTF-8 encoded. *Samples* Please view these samples: * Audio (flac) * Transcript (tsv) * Sections (tsv) *Updates* No Updates at this time. * *

Extent: Corpus size: 72000000 KB

Format: Sampling Rate: 16000

Sampling Format: 16-bit FLAC

Identifier: LDC2025S06

https://catalog.ldc.upenn.edu/LDC2025S06

ISLRN: 831-339-304-772-0

DOI: 10.35111/kbj5-9864

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2025S06

Rights Holder: Portions © 2025 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2025S06

DateStamp: 2025-08-11

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Cieri, Christopher; Fiumara, James; Walker, Kevin; Liberman, Mark; Ryant, Neville. 2025. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound dcmi_Text iso639_eng

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2025S06
Up-to-date as of: Wed Oct 29 7:02:17 EDT 2025

Metadata
Title:		AnnoDIFP Session Audio and Transcripts
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Cieri, Christopher, et al. AnnoDIFP Session Audio and Transcripts LDC2025S06. Web Download. Philadelphia: Linguistic Data Consortium, 2025
Contributor:		Cieri, Christopher
		Fiumara, James
		Walker, Kevin
		Liberman, Mark
		Ryant, Neville
Date (W3CDTF):		2025
Date Issued (W3CDTF):		2025-07-15
Description:		Introduction AnnoDIFP (Annotated Data for the Investigation of Facets of Personality) Session Audio and Transcripts was developed by the Linguistic Data Consortium (LDC), the Florida Institute of Technology (FIT), and the University of New Haven (UNH) to support algorithm development for predicting personality traits. It contains 438.34 hours of English audio and transcripts from in-person interviews of 366 participants paired with scores from two self-reported personality assessments, HEXACO Personality Inventory (Revised) (HEXACO-PI-R) and Short Dark Triad (SD3). Survey and behavioral data were collected in three phases. Phase 1 consisted of online questionnaires. Selected participants were invited to participate in Phase 2a, collecting behavioral and linguistic data in a laboratory setting. In Phase 2b, participants engaged in a telephone speech collection by calling other particpants. This release covers the activities in Phase 2a. Data In-person interviews were recorded at LDC, FIT and UNH. In each session, the participant and interviewer sat in separate sound-isolated rooms with communication between them supplied by audio/video hardware. Sessions consisted of the following tasks: rapport building, a YouTube task, a map task, and a business task. Further details on collection methodology and session tasks are contained in the documentation accompanying this release. There were a total of 386 participants in Phase 2a. This corpus contains audio data and transcripts from 301 participants and transcripts only for 65 participants. Recordings for 20 participants were not usable. Each session (or session part in the case of multipart sessions) is accompanied by a transcript produced automatically using the Rev.ai speech-to-text service. Speech data is presented as 16 kHz, 16-bit mono-channel FLAC-compressed MS-WAV files. Text data is UTF-8 encoded. Samples Please view these samples: * Audio (flac) * Transcript (tsv) * Sections (tsv) Updates No Updates at this time. * *
Extent:		Corpus size: 72000000 KB
Format:		Sampling Rate: 16000
Format:		Sampling Format: 16-bit FLAC
Identifier:		LDC2025S06
		https://catalog.ldc.upenn.edu/LDC2025S06
		ISLRN: 831-339-304-772-0
		DOI: 10.35111/kbj5-9864
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2025S06
Rights Holder:		Portions © 2025 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (DCMI):		Text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2025S06
DateStamp:		2025-08-11
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Cieri, Christopher; Fiumara, James; Walker, Kevin; Liberman, Mark; Ryant, Neville. 2025. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Sound dcmi_Text iso639_eng