"You sick little monkey!"
If you speak the exclamatory sentence written above, you will find there are certain words that will be stressed and some will be spoken as if 'gliding" through. When this feature is measured along time, we call it the rhythm of speech. The remark- "You sick little monkey!", possibly expressed with some surprise, may also have elements of amusement in it, in which case one will use pitch variation (intonation) to highlight that emotion. These elements of stress, rhythm, intonation constitute the prosody which among other things, defines the emotional mood of the speaker and rhetorical aspects like sarcasm, humor, irony etc.
I am hoping to annoy few linguists here with my very crude understanding of language,speech and the whole deal! If you are annoyed, please help me understand the concepts better.
So how does one study these elements rigorously? Researchers have found few features of speech which can be mapped to various prosodic elements.
I will try to explain in plain english what I have understood so far from various research papers.
Generally taken as units larger than phonemes, when these features/elements are studied rigorously as global statistics, they are studied as contours of mean value, standard deviation of fundamental frequency, energy distribution, and temporal changes in spectral coefficients like jitter and tremors, and sometimes when studied locally- pitch and energy contours.
So where does emotion fit in the above gobbledygook??
It has been found that sadness has been associated with low standard deviation of pitch and slow speaking rate, while anger implies higher pitch deviation and rate. But by just studying pitch deviation and speaking rate, one is most likely to fall in a trap of prosodic confusion rather than discern emotional qualities from it. For example, an interrogative sentence will most likely have wider pitch contour than affirmative sentence, which mean interrogative sentences will have higher standard deviation, which has nothing to do with emotional content!
So these measures of deviations and statistics are necessary conditions- but not sufficient conditions!
Therefore, researchers are toying with something called Hidden Markov Models(HMM), which works with state transitions. To study temporal behaviours of speech, not as stationary statistics but as concatenation of states of local feature statistics.
So far, people have experimented with different features of speech, running one or more HMM on one emotion, short time, semi continuous models and various kinds of mathematical frameworks.
Frankly, for this class project, I would be happy to just use someone's software just like a text to speech conversion software. But I cannot find any ready-to-use software for emotion recognition. Therefore, I am hoping to implement one such model before the end of the semester. But so far, no work to show!
Arturo suggested that I look from the eye of an artist or better yet; "hear from the ear of an artist", rather than work with obscure statistics and mathematical models. Or at least that's what I understood from that inspiring talk with Arturo! :)
I would request people to suggest ideas on different perspective to this problem of emotion recognition. People with music background? Sound engineers? theatre? Linguists? Psychics?!