CORPORA from CSLU: 22 Language

Case ID:
Web Published:

The 22 Language corpus consists of telephone speech from 22 languages: Eastern Arabic, Cantonese, Czech, Farsi, German, Hindi, Hungarian, Japanese, Korean, Malay, Mandarin, Italian, Polish, Portuguese, Russian, Spanish, Swedish, Swahili, Tamil, Vietnamese, and English. Unfortunately French is not available. The corpus contains fixed vocabulary utterances (e.g. days of the week) as well as fluent continuous speech. We were expecting at least 300 callers in each language. Each utterance is verified by a native speaker to determine if the caller followed instructions when answering the prompts. Some of the calls in each language are transcribed orthographically.

Recording Details:
All of the data in this corpus were collected over digital telephone lines. The digital data were recorded with the CSLU T1 digital data collection system. These files were sampled at 8 khz 8-bit and stored as ulaw files.

All of the wave files were converted to riff format with 16-bit linear coding.

Directory Structure:
There are several top-level directories in this distribution: docs, labels, misc, speech, trans.

The speech directory contains the speech data files. Each speech filename has the following structure:



= language abbreviation


= call number


= utterance type code

For example:


This utterance is from the English speaker 105 and contains the answer to the question "What is your native language?"

As a participant proceeds through the data collection protocol, he is asked a series of questions. Each of the responses is stored as a separate speech file. The utterance type code relates the recorded utterance to the protocol questions. The description of the protocol shows all of the utterance codes.

These audio and text files are subdivided into directories based on their call number mod 10. So, these files would be found in /speech/10.

Each utterance included in the 22 Language Corpus has gone through a process of verification. Native speakers of each language did verification. The verifiers were asked to listen to each utterance and decide if the speaker responded appropriately to the prompt. In addition, the verifiers made judgements about the age, gender, and dialect of each speaker.

Two native talkers verified the utterances in each language independently. Subsequently, they reexamined each utterance for which there was disagreement and produced an info file containing the 'resolved' judgements. Note: we resolved differences in Spanish, Vietnamese and Swahili by chosing the person with the overwhelmingly correct responses. For the other languages in the corpus we resolved every disagreement by hand.

Initially we asked the verifiers to make two judgement that are not now included in the release:

  • whether or not the speech was cutoff at either end of the utterance
  • whether or not there was missing information in the file.

Because these judgement were unreliable, the information regarding cut off speech and missing information is not included in the distribution.

The Center for Spoken Language Understanding (CSLU) distributes corpora to commercial entities and academic institutions for a fee. Commercial entities can use these corpora for research but also for creating commercial products such as generating acoustic models for speech recognition.


To place your order:
1. Click on the type of license you wish to order: Academic or non-profit entity or Commercial entity.
2. Terms of the license agreement can be viewed by clicking on the word "terms".
3. You agree to the terms of the license agreement when you click on "Add to Order" and proceed to the next screen. 4. If information on the "Order Contents" screen is correct, press "Check out".
5. On the next screen, a brief "Intended Use" is required. For "Recipient Scientist Information" enter the appropriate information for yourself or if you are placing the order for another person enter that information. We will use this information should we have questions about the order, payment or shipping address.
6. Once your payment has been received and verified by OHSU, your order will be approved by Technology Transfer & Business Development and then the DVD will be sent out by the Center for Spoken Language Understanding by FedEx within 5-10 business days.  


For demos and more information, visit the CSLU website at:


 Files will be made available by download from which requires customers to set up a free account. 

Patent Information:
Speech & Language
For Information, Contact:
Arvin Paranjpe
Technology Development Manager
Oregon Health & Science University
(503) 494-8200
Education & Training
Education & Training - Speech & Language
© 2021. All Rights Reserved. Powered by Inteum