EMBODIED AGENTS

IN AUGMENTED & VIRTUAL REALITIES

Course E6998-004, Dept. of Computer Science, Columbia University, Fall 2002
Prof. Kris Thórisson, Ph.D.
 
 
 

 

LECTURE NOTES

LECTURE 2 PART 3

Multimodal Perception

September 19, 2002

 

 
     







22

Prosody

 
 

The “form” of the speech - not what, but how it’s said

 
 

Prosody is a continuous, acoustic signal, with pitch, volume, timbre

 
 

Goal: Identify pitch accents, pauses, “rhythm”

 
 

Types of information recognizable from intonation only:

  • filler (relatively short and flat pitch pattern)

  • questions (final rise)

  • command (final fall)

  • differentiates some nouns and verbs (convict, conduct) in English

 
 

Problem: Speakers have different ranges, pitch

 
     










23

Prosody & Intonation

 
 

Examples of segmentation

  • Real-time analysis: 40 ms time delay, 300 ms window

 

 
 

Example of intonation for the utterance “Take me to Jupiter” plotted to a logrithmic frequency scale.

  • The right-hand plot shows segmentation of pitch direction, marked with vertical bars, timing (in msec) and direction of the intonation.

 

 

 

 
 

 

 

Output of real-time analysis of the utterances “What planet is that?”

  • Direction of intonation and timestamps are shown
    Questions most often have a rising intonation at the end, which is reliably detected here

 
     









24

Unimodal Perceptors

 
 

 
 

 

 












25

Unimodal Perceptors

 
 

Object code example (1)

  • NAME: facing-work-screen

  • TYPE: body-sensor-fix-ref

  • DATA-1: nil

  • DATA-2: work-screen

  • INDEX-1: get-head-direction

  • INDEX-2: nil

  • FUNCTION: facing?

 
 

 

 










26

Unimodal Perceptors

 
 

Object code example (2)

  • NAME: looking-at-l-hand

  • TYPE: body-sensor-var-ref

  • DATA-1: nil

  • DATA-2: nil

  • INDEX-1: get-gaze-direction

  • INDEX-2: get-l-wrist-position

  • FUNCTION: u-looking-at-hand?

 
     










27

Multimodal Perceptors

 
 

 
     








28

Multimodal Perceptors

 
 

Module example (1)

  • NAME: communicative -gesture

  • Type: Static-MI

  • Post: Functional-Sketchboard

  • Read: Functional-Sketchboard

  • CONDITIONS: (hand-in-gest-space T) AND (speaking T)

 
     







29

Multimodal Perceptors

 
 

Module example (2)

  • NAME: giving-turn

  • Type: Static-MI

  • Post: Functional-Sketchboard

  • Read: Functional-Sketchboard

  • CONDITIONS:
    ((speaking F) AND
    (looking-at-me T) AND (facing-me T)) OR
    ((gesturing F) AND (speaking F)) OR
    ((looking-at-me T) AND (speaking F)) OR
    ((gesturing F) AND (facing-me T)) OR
    ((facing-me T) AND speaking F))
    ((looking-at-me T) AND (gesturing F))

 
     







30

Blackboards

 
 

Blackboards simplify design

 
 
 
     
 

Blackboards solve:

  • The ‘wire’ problem - having to ‘string’ hundreds of wires between modules, manually

  • Different levels of description, e.g. ‘index-finger-extended’ vs. ‘deictic-gesture’, living in the same space without overwhelming complexity

  • Isolates processes from data, enables us to trace the progression of a solution, and possibly improving the system

 
     








31

Blackboard Example

 
 

Example of blackboard data stream with timestamps

 
 
(TAKING-TURN T 8804072)
(SPEAKING T 8804071)
(TURNED-TO-ME T 8804070)
(FACING-ME T 8804069)
(FACING-DOMAIN NIL 8804069)
(TURNED-TO-ME NIL 8803965)
(FACING-ME NIL 8803939)
(FACING-DOMAIN T 8803886)
(COMPLETE-PRAGM NIL 8803862)
(COMPLETE-SYNT T 8803804)
(COMPLETE-GRAM T 8803804)
(COMPLETE-PRAGM T 8803803)
(R-DEICTIC-MORPH NIL 8803719)
(FACING-DOMAIN NIL 8803717)
(TAKING-TURN NIL 8803695)
(R-DEICTIC-MORPH T 8803694)
(FACING-ME T 8803692)
(WANTING-TURN NIL 8803664)
(GESTURING NIL 8803663)
(HAND-IN-GEST-SPACE NIL 8803663)
(COMPLETE-SYNT NIL 8803662)
(COMPLETE-GRAM NIL 8803662)
(SPEAKING NIL 8803661)
(RHAND-IN-GEST-SPACE NIL 8803660)
(WANTING-TURN T 8803632)
(GESTURING T 8803631)
(HAND-IN-GEST-SPACE T 8803631)
(LOOKING-AT-HANDS NIL 8803630)
(TAKING-TURN T 8803630)
 
     







32

Perception Modules + BBs

 
 

Blackboards and Perception Modules enable us to...

 
 
  • Monitor information from multiple sources, and integrate these as necessary

  • Integrate multiple data types in a modular fashion

  • Modules can be constructed for looking at morphology, functionality, and for the integration of the two

  • Takes advantage of redundancies in data, does not break down when there are deficiencies (graceful degradation)

 
 
 









33

Broad-Stroke Hypothesis of Real-Time Perception

 
 

Collect ‘evidence’ from raw & processed data

  • Isolate broad strokes of all modes first
    - A gesture might reveal the meaning of the seemingly meaningless utterances
    - a nod might indicate the direction to look to grasp the meaning of the speech
    - intonation might indicate sarcasm, etc.

Broad-stroke !== top-down! - We can use evidence from bottom-up AND top-down to find broad strokes

 
 

 
 

Example

 
 

Two people talking, Alan and Beth
Beth moves her hand up and looks to her right
We, the viewers, know that she’s surprised to see a elephant in the middle of Manhattan, and that in 460 ms her hand motion will turn into a deictic gesture, her eyebrows will rise, and her mouth will open with surprise, at which point Alan will probably recognize the signs and look over at the elephant. But before that all happens, in the next 460 ms, Alan has to decide what to do.

At t-minus-460 ms, Alan has to decide whether
1: This constitutes a communicative gesture
2: If so, what kind of gesture
3: Because Alan is speaking, and has the turn, he’s reluctant to let himself be interrupted

 
 

4: Based on Beth’s expression, he’s pursuaded to pause his speech (t-minus-350 ms)
5: Using Beth’s gaze and the state of the dialogue, Alan decides he will try to figure out what Betht’s multimodal actions mean, and thus delay his utterance further (t-minus-250 ms)
6: Alan figures out that Beth is making a deictic gesture (he’s not sure, but “it’s worth a gaze”) so, based on the direction of Beth’s gaze, he looks over in the direction of the elephant (t-minus-150 ms)
7: Beth’s gesture becomes fully-fledged pointing
8: Alan should have delayed looking, because he had just reached out for a beer, and now he knocks it over

 
     
 

Use top-down hypotheses to help differentatiate between alternative interpretations, based on broad-stroke information, using...

  • interaction and dialogue history

  • world knowledge

  • models of the user

 
     

 





<- PART 1<- PART 2

2002©K.R.Thórisson