Corpora in the Classroom

All users can Search the corpora.

You can search a corpus for recordings and transcripts of a particular language (or all languages at once):

We hope to soon have the capability to search for specific text within transcripts, prior to downloading. Currently, however, files must be downloaded prior to searching.

We have 3 corpora available:

The HerLD corpus of the Heritage Language Variation and Change Project, Naomi Nagy, PI. [Click on link to corpus for details about its contents]
Sali Tagliamonte's Ontario English Corpus
York English Corpus

The corpus/corpora to search is selected from the Search drop-down menu.

Searches can be limited to data in a particular language. (If no language is selected, data for all languages will be searched.)

Language

Corpus

English

Ontario English Corpus

York English Corpus

Cantonese

Faetar

Hungarian

Italian

Korean

Polish

Russian

Ukrainian

HerLD Corpus

The search results display an Interview ID, a Speaker ID, the Sex and Age of the Speaker, and the Language and Date of the interview recording, the Community (for the English corpora ), and the names of the corresponding Recording File(s) (in .zipped .wav format; mp3 coming soon) and Transcript File(s).

The HerLD corpus has transcript files in .eaf format, to be used with ELAN software.

The Toronto and York corpora have .txt files.

Ex: Searching all languages of the HerLD corpus returns a table like this, providing info about each file:

This table can be sorted by any column by clicking on the corresponding header. You can sort by multiple columns simultaneously by holding down the Shift key and clicking a second, third or more column headers.

The blank fields under each header can be used to filter results. For example, entering "C" in the Speaker ID column returns a list of Cantonese speakers. Entering "C1M" returns only male Cantonese speakers from the first generation. [See Speaker ID information.]

Transcriptions and sound files are downloaded by clicking on the appropriate link.

More information about each speaker can be obtained by clicking on their Interview ID.

[Return to Corpora in the Classroom site.]		Updated May 16, 2013 by Naomi Nagy.

Corpora in the Classroom

Linguistics Department • University of Toronto