logopetria

dot tumblr dot com
Apr 27
Permalink

Separating form from content in recorded speech

Imagine you’re listening to an audio file of recorded speech, with no other background noise. What information are you getting when you listen to it? First, you get the words the speaker is using - what you’d get from a transcript, or what a speech-recognition program might produce. But there’s more data beyond that. There’s timing information: how quickly was each word spoken, what were the gaps and pauses between words? There’s pronunciation data: what were the precise phonemes used to produce each word? There’s intonation: did the speaker’s voice go up or down? And then various other characteristics that don’t fall into any of those categories.

What I’m wondering is how to separate that information into different “channels”, in order to deal with each one on its own. I think you might be able to compress the data more efficiently if it were divided up into packages of similar content - if you compressed the transcript as pure text, and the timing data as just a sequence of intervals, and so on. Also, it might be possible to manipulate each channel separately, which could be interesting. You could change the intonation of speech without changing the underlying content, or have different words spoken but with the same timing and emphasis.

So here’s what I’m thinking for a first approximation: first you pass some rudimentary speech recognition software over the audio, and have it recognise the phoenemes. Store this away somewhere. Then apply some kind of degrading filter to the data, to strip away the information that makes it recognisable as words - we don’t need that data any more, since we’ve already extracted and saved it. The remaining audio should sound like a “mwa-mwa-mwa” noise, like Charlie Brown’s teacher! You’ll be able to hear the timing and the tone, but with no verbal content. Since the filters are throwing away information, the resulting file can be compressed to a smaller size.

My guess is that the two files together (text + degraded audio) can be compressed to something smaller than the original audio alone can. The trick, of course, is to re-assemble the components to get back the original audio file (or a fair approximation to it!)  That part I don’t know how to do yet.

Afterthought: the problem of re-assembly has similarities to the difficulty of text-to-speech generation.  But whereas there the problem is figuring out what timing and intonation to apply to each word (and then stitching together the right phonemes from a pre-built library, I guess), here we already have that information stored.