Voice Match

We have gone through the 'Voices of Berkeley' database and found five audio files that match a randomly selected target talker. Click the map markers to hear the matching audio files and read below about the method we used to find voice matches. [9201215]

Audio matching method

One central task in speech technology is to match a new audio speech sample to a repository of example audio recordings. That is how computers recognize words. The Voices of Berkeley system finds matches among the audio recordings in a two step process.

In step one, the system converts each audio sound file into a sequence of "frequency spectra". We trim the starting and ending silences off of the sentence first. This representation (the technical name is MFCC - Mel-frequency Cepstral Coefficients) mimics the auditory processing done by the human ear, so that what sounds similar to us is supposed to sound similar to the computer.

In step two, the system compares MFCC representations against each other looking for files that are similar. This is done using a dynamic time-warping algorithm that stretches or shrinks the files being compared with each other so that small differences in timing can be ignored.

If you've ever used a speech recognition system (for example in a telephone help line), you know that these systems can be really good, or really bad. A part of the Voices of Berkeley research project is to study how well MFCC coding and dynamic time warping works with web-based audio capture (vs. telephone audio).