CORPORA from CSLU: Stories v1.2

Case ID:
Web Published:

The Stories Corpus is made up of extemporaneous speech collected from English speakers in the CSLU Multi-language Telephone Speech data collection. Each speaker was asked to speak on a topic of their choice for one minute. These utterances make up the Stories Corpus.

Recording Details:
The data were recorded from an analog line using a Gradient Technologies analog-to-digital conversion box. The file format used is 8 khz 16-bit linear with a 1024-byte NIST Sphere header.

File Naming Convention:
File naming follows the following convention: ENcall-1003-G.story-bt.txt

The first field ("ENcall") is the prefix indicating the corpus to which this data belongs, and the second field ("100") represents a unique ID number for the speaker. The remainder of the information is irrelevant.

These audio and text files are subdivided into directories based on their call number divided by 10. So, the files for call 103 could be found in the /10 subdirectory.

The /trans and /labels directory file structures exactly parallel the structure of the /speech directory.

File Formats:
The data were recorded from an analog line using a Gradient Technologies analog-to-digital conversion box. The .wav file format used is the RIFF standard file format. This file format is 16-bit linearly encoded.

The text transcriptions were performed according to the non time-aligned word-level conventions described in the CSLU Labeling Guide.

Phonetic transcriptions are plain text files that carry time-aligned phonetic labels. The first two lines of the file are a header which defines the length of a "frame" in milliseconds. The rest of the files consists of two numbers that define a frame range, and a label that applies to that region. For example:


MillisecondsPerFrame: 1.000000
2 113 .pau
113 191 w
191 267 ^
267 395 n

So, we can see here that a frame corresponds to 1 millisecond (ms) of time, and that from 2 to 113 ms into the file, there is a pause (.pau), with the first phoneme (w) starting at 113 ms and stretching to 191 ms.

The word-level transcription files follow the same format, with word labels in place of the phonetic labels. The .com files that are found with the .wrd files contain information about breathing during the speech. They are in a similar time-aligned format.

The lola files are ASCII "location and label" files. They are similar to the ".phn" files of the TIMIT database except:

  1. the locations are given in a unit of time other than the sample.
  2. there is a short header saying what this unit is

Each file in this distribution has the header:
MillisecondsPerFrame: 3.0

After that are a series of lines, one per segment, of the form:
[begin frame][end frame + 1] label

For example
200 237 ah
237 289 m

The [ah] segment extends from from 200 to frame 236 inclusive. The end label is 237 for historical reasons. The Center for Spoken Language Understanding (CSLU) distributes corpora to commercial entities and academic institutions for a fee. Commercial entities can use these corpora for research but also for creating commercial products such as generating acoustic models for speech recognition.


To place your order:

1. Click on the type of license you wish to order: Academic or non-profit entity or Commercial entity.

2. Terms of the license agreement can be viewed by clicking on the word "terms".

3. You agree to the terms of the license agreement when you click on "Add to Order" and proceed to the next screen.

4. If information on the "Order Contents" screen is correct, press "Check out".

5. On the next screen, a brief "Intended Use" is required. For "Recipient Scientist Information" enter the appropriate information for yourself or if you are placing the order for another person enter that information. We will use this information should we have questions about the order, payment or shipping address.

6. Once your payment has been received and verified by OHSU, your order will be approved by Technology Transfer & Business Development and then the DVD will be sent out by the Center for Spoken Language Understanding by FedEx within 5-10 business days.  


For demos and more information, visit the CSLU Corpora website at:


Files will be made available by download from which requires customers to set up a free account. 

Patent Information:
Speech & Language
For Information, Contact:
Arvin Paranjpe
Technology Development Manager
Oregon Health & Science University
(503) 494-8200
Education & Training
Education & Training - Speech & Language
© 2022. All Rights Reserved. Powered by Inteum