Sunday, March 21, 2010

Project Talk and Call For Help

"You sick little monkey!" 

If you speak the exclamatory sentence written above, you will find there are certain words that will be stressed and some will be spoken as if 'gliding" through. When this feature is measured along time, we call it the rhythm of speech.  The remark- "You sick little monkey!", possibly expressed with some surprise, may also have elements of amusement in it, in which case one will use pitch variation (intonation) to highlight that emotion. These elements of stress, rhythm, intonation constitute the prosody which among other things, defines the emotional mood of the speaker and rhetorical aspects like sarcasm, humor, irony etc.

I am hoping to annoy few linguists here with my very crude understanding of language,speech and the whole deal! If you are annoyed, please help me understand the concepts better.

So how does one study these elements rigorously? Researchers have found few features of speech which can be mapped to various prosodic elements.

I will try to explain in plain english what I have understood so far from various research papers.

Generally taken as units larger than phonemes, when these features/elements are studied rigorously as global statistics, they are studied as contours of mean value, standard deviation of fundamental frequency, energy distribution, and temporal changes in spectral coefficients like jitter and tremors, and sometimes when studied locally- pitch and energy contours.

So where does emotion fit in the above gobbledygook?? 

It has been found that sadness has been associated with low standard deviation of pitch and slow speaking rate, while anger implies higher pitch deviation and rate. But by just studying pitch deviation and speaking rate, one is most likely to fall in a trap of prosodic confusion  rather than discern emotional qualities from it. For example, an interrogative sentence will most likely have wider pitch contour than affirmative sentence, which mean interrogative sentences will have higher standard deviation, which has nothing to do with emotional content!

So these measures of deviations and statistics are necessary conditions- but not sufficient conditions!

Therefore, researchers are toying with something called Hidden Markov Models(HMM), which works with state transitions. To study temporal behaviours of speech, not as stationary statistics but as concatenation of states of local feature statistics.

So far, people have experimented with different features of speech, running one or more HMM on one emotion, short time, semi continuous models and various kinds of mathematical frameworks.

Frankly, for this class project, I would be happy to just use someone's software just like a text to speech conversion software. But I cannot find any ready-to-use software for emotion recognition.  Therefore, I am hoping to implement one such model before the end of the semester. But so far, no work to show!

Arturo suggested that I look from the eye of an artist or better yet; "hear from the ear of an artist", rather than work with obscure statistics and mathematical models. Or at least that's what I understood from that inspiring talk with Arturo! :)

I would request people to suggest ideas on different perspective to this problem of emotion recognition. People with music background? Sound engineers? theatre? Linguists? Psychics?!

12 comments:

Audrey Green said...

I took a forensic phonetician class, and from what I remember, taking a look at the articulatory and phonatory systems might help you, since vocal fold movements are controlled by muscle groups. This muscle groups can become tense under psychological stress. Besides that, this two systems changes systematically when the speaker is experiencing different cognitive states.
In terms of softwares, there's a Layered Voice Analysis (LVA) that claims to detect negative feelings, anxiety, arousal, lie stress, tension, thinking level, etc. and claims to predict possible alcohol intoxication, temporary metal illness, sarcasm, imagination level, overconfidence, physical attraction, white noise, and you name it. It seems quite like a mind reader to me, and quite difficult to believe. There have been an attempt to examine it, but very little information is given. LVA claims to work based on the basis of voice frequency applying "8,000 mathematical algorithms to 129 voice frequencies"

Ashu said...

Thanks for the comment. I will definitely read about things you suggested here.

LVA does seem like a mind reader! Have you used this product? It is not freely available, and the technology it uses seems to be a closely guarded secret! I cannot find much information on Nemesysco's website apart from the claim that it does some patented voodoo magic by measuring “brain activity traces”! Do you know what that is all about?

toha said...

I believe that there are several solutions, but I would like to suggest one that is easy to implement.
You need to create a system that can look for a patterns in different parameters of the sound. You can start with pitch and amplitude. Pitch and amplitude are parameters. A pattern can be how frequently a pitch or amplitude is changing. Another parameter can be how fast a pitch or amplitude is changing. An so on.
Next, you need a database of "collected statistics" that contains patterns that were detected by the program and "emotions" that were reported by speakers.
The quality of the system may be improved by adding new parameters and new patterns that can be detected.
More over, parameters that your system uses may come not only from the analysis of the sound, but also from the external bio devices, analysis of video captured during speaking, e.t.c.
The challenge here is that the system should be abstract enough to work with very different parameters, yet be able to detect "patterns".
The structure of the system might look like:
1. Parameter collectors.
2. Parameter analyzers (pattern detectors).
3. Database
4. User interface to "teach" the system.
5. User interface to "detect emotions".

P.S. Sorry if this method is not "challenging enough" :-)

Ashu said...

@Anton
When you talk about patterns of change in amplitude and pitch, the question to ask is, change with respect to what?

There is no initial state from which
you try to measure change in amplitude.

Unless you are suggesting that every sentence that has to be analysed should be stored with a sort of "neutral" pattern of pitch and amplitude.

I don't see any objective state (neutral speech) of amplitude and pitch which can be used to compare the spoken sentence.

This is further aggravated by inherent intonation of english language.

With this approach, I will have to look at the overall pattern of amplitude and pitch change, which is what I talked about when I mentioned standard deviation or mean of such patterns. And what all challenges there are in this method.

By all means, I am hoping to find a method which is not "challenging enough" :)

So please don't stop with your ideas!

toha said...

For example. You can detect, that if you have changes in the pitch that relate to each other as 1.0/2.0/1.0 during the time interval of 5 seconds, it means that the person is speaking sarcastically. So you have to try to match all your patterns starting with every single new sample of the sound. If you detect, that the pitch was 440Hz for 1 second, than the pitch changed to 880Hz (x 2.0) for 1 second, and than changed back to 440Hz for 1 second. Now you can add a "sarcasm weight" to this piece.
But you should analyze different parameters and look for different patterns.
The program should be able to look for the patterns. For example:
1. You analyze piece N1 (4 seconds) and you know that it is sarcastic.
2. You analyze piece N2 (5 seconds) and you know that it is sarcastic.
3. You try to detect that the pitch in the middle of those pieces changes according to the relation 1.0/2.0/1.0.
4. You record this pattern to the database and add a minimum weight to it (because you found only two examples).

...

toha said...

To detect a new pattern, you need to do something like this:

1. set blockSize to 10ms
2. divide you audio piece into small blocks of blockSize.
3. detect the pitch for all these small blocks.
4. detect what pitch is "dominant" for the whole piece. what pitch is "the second dominant" for the piece. what pitch is "the third dominant" for the piece.
5. Record the pattern of a relation between the three most frequent pitches. You will get something like 1.0/2.0/1.0/1.3/2.0/1.3/2.0.
6. Mark this pattern as "possible pattern".
7. set blockSize = blockSize * 2;
8. goto step 2.

After these steps you will get a list of "possible" patterns.
Next step is to analyze the next piece that is known to be in the same "mood" (for example sarcasm). After the next analysis you will get another list of "possible" patterns. Now you try to detect, which patterns can be found in both lists. And mark them as "real" patterns.
Of course you need to analyze more than two pieces.
During the phase of trying to detect the "mood", you will have an initial list of modes and initial weight set to 0. like: sad: 0, happy: 0, sarcasm: 0, lie: 0, e.t.c. Then you go through all your patterns, and increase the weight for the modes for which you matched patterns from the database. And finally you will get a list of weights, for example:
sad: 0.5
happy: 0.7
sarcasm: 0.9
lie: 0.1
And the final response will be "probably sarcasm".

arturo said...

The conversation is getting "HOT":-)

BTW, your comment seems to suggest that I recommended a (politically incorrect spoiler) artsy-fartsy approach to your quest. I rather suggested that you look at the problem more as a designer and not a scientist or engineer. That way you are free from artificial constraints and can approach the knowledge of other very disparate fields. The challenge of course i using those "lenses" to look at the problem anew.

Like we know, we humans can easily detect the changes in intonation, cadence and other parameters of the voice when it expresses sarcasm or anger etc.

I believe that a rather simple acoustical analysis, the type that is used to detect the acoustical properties of a room or space, can be applied to the problem, because our head is a "resonance box" and when we change our voice, for some reason, we change the acoustics of our box by "moving" it to the nasal region, or to the back of the neck (like throat singers do) etc. I don't know why we change our voice in that way, I don't know if it is a cultural or biological trait, but in any case I believe it is the type of analysis that can give you some insights.

Like this example there are other fields where you can find useful parallels if you approach it with the mind of a designer looking to solve a problem. Good questin'

toha said...

So as I understand, Arturo is suggesting to analyze the way we produce the sound instead of the sound itself. I think that in this case the algorithm (method) will be less usable, because different people even having the same language as a first language, can speak physically different. Not talking about people speaking in a second language.
On the other hand, my method is pretty universal, because it can actually detect "patterns" that are related to the mechanism of producing the sound and, as a result, include Arturo's method.
More over, Arturo mentioned "designer"'s approach. In this case it means "graphic" designer? or designer in general? If we are talking about designers in general, scientists and engineers are not less "designers" than "art/design people".
And "being" free from "artificial constraints" is a good thing... if you actually can do something that is producing valuable results.

arturo said...

I understand designer as someone (in whatever area of expertise) who is able to think outside of the box. And the "box" is of course what keeps some experts in a particular field from enriching their experience. By collaborating with others they might be able to look at the problem in a fresh and novel way.

toha said...

Arturo, so break your box of thinking that if somebody is good at programming, he does not think outside of the box. Some people are able to evaluate some ideas pretty quickly and build an implementation of the idea in mind pretty fast.. to understand that it is not interesting, for example. That does not mean that this person is not thinking outside of the box. That just means that a problem is pretty simple and is not interesting for that person ...

arturo said...

Anton, you seem to think we are talking about you! but in fact we are not. I know you are able to think outside the box because I work with you. You do not represent everyone else in the world, so there is no need to take this personally, specially when you are not the subject of the conversation but rather a participant who is "thinking outside the box". jeezz!

Audrey Green said...

ok to break the ice and go back to what I was talking about on comment number ONE lol...
LVA- "can read the DNA of thought" that's one of v's biggest claims. V is the company that patented LVA.

anyways, I have never used LVA, but we did some analysis of it during class. My professor, Dr. James Harnsberger in coalaboration with Dr. Harry Hollien,have done some research about it, and they found out that LVA was not sensitive to either deception or stress. But again deception and stress were the only subjects they were focusing on.
I remember my professor saying how V didn't want to release any imformation about the product...However,AFRL did some studies funded by the department of justice and came to the conclusion that the product is 90% reliable (http://www.docstoc.com/docs/26424595/How-does-LVA-Compare-to-other-voice-analysis-systems). But there's no actual report that support such claim. I think is a cospirasy to make people confess...since pretty much evrybody is aware of how unreliable lie ditectors are. LVA claiming to dectect "DNA of thought" can actually create fear on people making them confess...