EMBODIED AGENTS

IN AUGMENTED & VIRTUAL REALITIES

Course E6998-004, Dept. of Computer Science, Columbia University, Fall 2002
Prof. Kris R. Thórisson, Ph.D.
 
 
 

 

LECTURE NOTES

LECTURE 5 PART 2

Multimodal Knowledge Representation

November 7, 2002

 

 
     









1

Overview

 

 

Knowledge Bases

 
 

Content Layer & Content Blackboard

 

 

Multimodal Content Interpretation, Full-turn and larger

 
 

Multimodal Content generation, Full-turn and larger

 
     










2

Knowledge Bases

 

 

Q: What is a ‘knowledge base’?

 
 

A: A ______ _______ plus a set of _______ for how to manipulate that data to achieve a certain set of _________

 

 

There are a minimum of two databases in Ymir

 
 
  • A Dialogue Knowledge Base
  • One or more Topic Knowledge Bases
 
  Content knowledge: Procedural, interpretive & declarative knowledge related to a TOPIC such as dialog, ditch-digging, etc.  
  Ymir has a special layer dedicated to topic knowledge - high-level interpretive and generative knowledge  
     










3

Content Knowledge
Related to Dialog

 

Dialog knowledge in a Content Layer contains information about:

 
 
  • How to parse gestures related to the dialog process, and how they are generated {beats, deictics, symbolics}
  • Speech and gesture used in reference to the dialogue itself, participants, body parts, etc.
  • Dialogue structure - incl. openings, closings
    Participants, context (spatial and propositional)
 
     










4

Content Knowledge
Related to Other Topics

 

A Topic Knowledge Base contains all necessary information about a particular topic, including

  • Language related to that topic, cue words, key words, etc.
    Action repertoire related to the topic, e.g. digging ditches, moon landings, web searches, etc.
  • Spatial information regarding topic (e.g. “on top of”, “south of”, etc. as well as metric data - sizes, shapes and so forth)
  • Methods to parse topic-related gestures such as iconic, pantomimic, metaphoric

 










5

Multimodal 'Parsing': A Continuous Stream of Data

 

Speech recognizer -> words

 

Prosody analyzer -> intonation, pauses, volume

 
 

Gesture -> spatial+symbolic events

 

All streaming in @ 30 Hz!

 
     
 

Q: How do we make sense of it all?

 
 

A: By ________________ the stream

 
     










6

Making Sense of Multimodal Data

 
 

Use turn-taking to segment the stream:

 
 

You find the turns by looking at perceptual data, then you use the perceptually-guided turns to organize the perceptual data -- "Tautological Processing"?

 
 

Not really: The best place to take a turn in real-time is found using perceptual data, but the turn itself is a decision, not a perception

 
 

Turns are mutually defined

 
     









7

Real-Time Multimodal Parsing

 

Responses during real-time dialogue can only be made using a time/accuracy tradeoff

 
 

Turns and ‘thought-units’ are necessarily based on fragmented multimodal data

 
 

Certainty is only possible after integration and interpretation has happened, after the turn, or a series of several turns

 

 

 

 







8

Dialogue Structure

 


 
 

 

 
 

 

 

 

 
 

... use dialog structure to segment "thought stream".

 
 

 

 



 

 




9

Multimodal Content
Parsing, Using Frames

 
 

Manual-Gesture-Frame:

 
 
  • Period [begin-ts, end-ts]
  • Type: oneof {deictic, iconic, pantomimic, beat, butterworth, emblematic/symbolic}
  • References [list-of-objects [obj-1, obj-2, ... obj-n]]
  • Segmented-Movement-Data [data]
 
 

Rules for each slot in the frame determine what you can put in them

 
 

 

 



 




10

Multimodal Content Parsing, Using Frames

 
 

Speech-Frame:

 
 
  • Period [begin-ts, end-ts]
  • Words [word-1 ts, word-2 ts, ... word-n ts]
  • Type: oneof {question, request, command, informative, promise}*
  • Command [ACT: Enlarge]
    • Words [“make” ts-1, “bigger” ts-3]
  • References [list-of-objects [obj-1, obj-2, ... obj-n]]
    • Words [“those” ts-2]
 
 

Rules for each slot in the frame determine what you can put in them

 
 

*Translating these to Speech Act Theory:

  • informative = Assertive speech act
  • request + promise = Commissive speech act
  • question + command = Directive speech act
 
 

[Thórisson 1996a, Searle 1975 ]

 
     



 




11

Multimodal Content Parsing, Using Frames

 
 

Multimodal-Action-Frame:

 
 
  • Period [begin-ts, end-ts]
  • Type: oneof {question, request, command, informative, promise}
  • Topic: [Label, which-KB]
  • Perceptual-Data
    • Speech [Speech-Frame]
    • Communicative manual gesture [Manual-gesture-frame]
    • Communicative body gesture [Body-gesture-frame]
    • Communicative facial gesture [Facial-gesture-frame]
 
 

Rules for each slot in the frame determine what you can put in them

 
     

 




12

Support for Multimodal Frames

 
 

Turn-constructional frames

 
 

Topic frames

 
 

Encounter frames

 
 

 

 

 

 

 

 

 




13

Multimodal Content Interpretation

 
 

Bottom-up: Tease out data that supports different types of Interpretation Frames

  • e.g. if there is a manual gesture that looks a lot like a pointing gesture, pull out a Manual-Gesture-Frame and mark it “Deictic”
 
 

Fill in the frame the best you can

 
 

Give it a score depending on how well it is filled out

 
 

If there are conflicting frames, take the one with the highest score

 
 

If the frames you have are only partially filled out after turn is over, use Deciders to find the missing data, either by asking user or by searching memory

 
     

 

 

 

 

 



14

Content Blackboard

 
 

A place for processes in the Content Layer to inform the Process Control Layer what is happening at the content level

 
 

Examples of messages posted for this purpose on the Content Blackboard:

  • Rcv-Speech
  • Speech-Data-Avail
  • KB-Succ-Parse
  • KB-Exec-Act
  • CL-Act-Avail
  • KB-Exec-Act
  • TKB-Act-Avail
  • TKB-Exec-Speech-Act
  • TKB-Exec-World-Act
  • DKB-Exec-Act
  • Exec-Done
 
 

Rcv = received
Act = action
Exec =executing
succ = succsessful

 
     

 

 

 

 

 

 

 

15

Content Blackboard

 
 

A place for the Process Control Layer to let the Content Layer know what’s going on at the process level

 
 

Examples of messages posted on the Content Blackboard for this purpose:

  • User-Taking-Turn
  • User-Giving-Turn
  • User-Wanting-Turn
  • I-See-User
  • I-Take-Turn
  • I-Give-Turn
  • I-Want-Turn
 
 

 

 

 

 

 

 

 

 

16

From Interpretation to Generation

 
 

When the user takes turn, the agent must make sure that all processes that are necessary start interpreting the content - at run-time, in real-time

 
 

High-level content interpretation is done in the Content Layer: Processes in the Content Layer (CL) decide what content to generate (and how) in response to a perceived multimodal act

 
 

Once the interpretation has produced something meaningful in the current context, the agent needs to generate a multimodal action to the content

 
     

 

 

 

 

 


17

Multimodal Content Generation

 
 

Some action is generated during parsing and interpretation

 
 

Hopefully this action is mostly appropriate

 
 

As soon as interpretations start forming, no matter how likely they seem right then ...

 
 
  • start generating a response based on the data
  • refine this response as more information keeps flowing in through your senses
  • if you have more than one valid but conflicting interpretations, and therefore responses when you’re done, choose one over the other
 
     

 

 

 

 



18

Multimodal Content Generation - Example

 
 

Given the question “What is [deictic gesture] that?”, which contains speech & gesture, do:

  • Classify the utterance: [WH-Sentence]
  • Find referents to the utterance [“that”] -> missing referent
  • Gesture: [type: deictic; direction{x,y,z}; referent: green patch; name: “grass”]
  • Recall response format to WH-Sentence:
    • “That is [X]”, where X is filled in with the name of the object whose reference was requested for
 
 

Processes in the Content Layer mirror the hierarchical structure of the input frame containing the parsed multimodal input:

  • From the multimodal act types, {question, request, command, informative, promise}, select the appropriate top-level generation process
  • Generate a top-level response frame (e.g. if the input frame is of type 'question', select 'answer')
  • For each of the filled slots in the multimodal input frame, select the appropriate process, based on how much is filled in and what
  • Let the processes operate on the input frame to fill in the output frame
 
 

A Decider in the Content Layer subsequently cues the response when it's ready ...

 
     

 

 

 

 



19

Multimodal Content Execution

 
 

... As the action is cued, a Decider in the Process Control Layer (PCL) monitors the Content Blackboard to see if any responses are cued

 
 

When turn is taken (decided by other Deciders) the Decider delivers the cued response, ...

 
 

..which is sent to the Action Scheduler (AS), which

 
 

...determines how, exactly, the delivery will look by choosing a morphology from the Behavior Lexicon

 
     

 

 

 

 

 

 

 

20

Multimodal Content Execution

 
 

Content Layer has to make sure that the Action Scheduler is not innundated with behavior requests, so ...

 
 

The CL sends behavior requests to the AS in increments of relatively short sentences, between 500ms and 5 seconds long, so ...

 
 

The CL has planners that keep track of large chunks of behavior, spanning several seconds, to minutes, to hours long.

 
 

This makes it easy for the CL to re-plan the latter part of plans when circumstances change, without going to the Motor Feedback Blackboard

 
 

 

 

 

 

 

 

 

 

 

21

Multimodal Knowledge Representation - Summary

 
 

Frames in the Content Layerare used to collect data into coherent interpretations

 
 

Automatic processes in the RL and PCL (Unimodal)

 
 

Perceptors and Multimodal Integrators) produce data “chunks” of information that is used to fill the frames

 
 

UPs and MIs can be turned on and off by processes in the Content Layer

 
 

At first the frames serve as hypotheses about the incoming data

 
 

As they fill up (some better than others) they contain the agent’s understanding of what’s going on

 
 

Depending on the type of user action (request, command, etc.), fill in a multimodal response frame (or a sequence, in case of a long plan) that relates to the input

 
 

If there is missing information, request the missing information

 
 

Otherwise execute the multimodal act in increments, sending parts of it (typically 1-5 seconds long) to the Action Scheduler

 
 

The Action Scheduler posts the progress of its action execution on the Motor Feedback Blackboard (MFB)

 
 

By reading the MFB the Content Layer can cancel and modify unscheduled actions up to their point of execution

 
 

The Content Layer can re-plan non-transmitted parts of a plan, when it is clear, by reading the MFB, how a particular earlier part of the plan was actually executed by the AS

 
     

 

 

 



<- PART 1

2002©K.R.Thórisson