EMBODIED AGENTSIN AUGMENTED & VIRTUAL REALITIESCourse E6998-004, Dept. of Computer Science, Columbia University, Fall 2002Prof. Kris R. Thórisson, Ph.D. |
|||||||||
|
|||||||||
1 |
Overview |
|
| |
Knowledge Bases |
|
Content Layer & Content Blackboard |
||
| |
Multimodal Content Interpretation, Full-turn and larger |
|
Multimodal Content generation, Full-turn and larger |
||
2 |
Knowledge Bases |
|
| |
Q: What is a ‘knowledge base’? |
|
A: A
______ _______ plus a set of _______ for how to manipulate that data
to achieve a certain set of _________ |
||
| |
There are a minimum of two databases in Ymir |
|
|
||
| Content knowledge: Procedural, interpretive & declarative knowledge related to a TOPIC such as dialog, ditch-digging, etc. | ||
| Ymir has a special layer dedicated to topic knowledge - high-level interpretive and generative knowledge | ||
3 |
Content
Knowledge |
|
Dialog knowledge
in a Content Layer contains information about: |
||
|
||
4 |
Content
Knowledge |
|
A
Topic Knowledge Base contains all necessary information about a particular
topic, including |
||
|
||
| |
5 |
Multimodal 'Parsing': A Continuous Stream of Data |
|
Speech
recognizer -> words |
||
Prosody
analyzer -> intonation, pauses, volume |
||
Gesture -> spatial+symbolic events |
||
All streaming in @ 30 Hz! |
||
Q: How do we make sense of it all? |
||
A: By ________________ the stream |
||
6 |
Making Sense of Multimodal Data |
|
Use turn-taking
to segment the stream: |
||
You find the turns by looking at perceptual data, then you use the perceptually-guided turns to organize the perceptual data -- "Tautological Processing"? |
||
Not really: The best place to take a turn in real-time is found using perceptual data, but the turn itself is a decision, not a perception |
||
Turns are mutually defined |
||
7 |
Real-Time Multimodal Parsing |
|
Responses during
real-time dialogue can only be made using a time/accuracy tradeoff |
||
Turns and ‘thought-units’ are necessarily based on fragmented multimodal data |
||
Certainty is only possible after integration and interpretation has happened, after the turn, or a series of several turns |
||
|
|
8 |
Dialogue Structure |
|
| |
||
| |
||
... use dialog structure to segment "thought stream". |
||
9 |
Multimodal
Content |
|
Manual-Gesture-Frame: |
||
|
||
Rules for each slot in the frame determine what you can put in them |
||
10 |
Multimodal Content Parsing, Using Frames |
|
Speech-Frame: |
||
|
||
Rules for each slot in the frame determine what you can put in them |
||
*Translating these to Speech Act Theory:
|
||
[Thórisson 1996a, Searle 1975 ] |
||
11 |
Multimodal Content Parsing, Using Frames |
|
Multimodal-Action-Frame: |
||
|
||
Rules for each slot in the frame determine what you can put in them |
||
12 |
Support for Multimodal Frames |
|
Turn-constructional frames |
||
Topic frames |
||
Encounter frames |
||
13 |
Multimodal Content Interpretation |
|
Bottom-up: Tease out data that supports different types of Interpretation Frames
|
||
Fill in the
frame the best you can |
||
Give it a score
depending on how well it is filled out |
||
If there are
conflicting frames, take the one with the highest score |
||
If the frames you have are only partially filled out after turn is over, use Deciders to find the missing data, either by asking user or by searching memory |
||
14 |
Content Blackboard |
|
A place for processes in the Content Layer to inform the Process Control Layer what is happening at the content level |
||
Examples of messages posted for this purpose on the Content Blackboard:
|
||
Rcv
= received |
||
15 |
Content Blackboard |
|
A place for the Process Control Layer to let the Content Layer know what’s going on at the process level |
||
Examples of messages posted on the Content Blackboard for this purpose:
|
||
16 |
From Interpretation to Generation |
|
When the user takes turn, the agent must make sure that all processes that are necessary start interpreting the content - at run-time, in real-time |
||
High-level
content
interpretation is done in the Content Layer: Processes in the Content
Layer (CL) decide what content to generate (and how) in response to
a perceived multimodal act |
||
Once the interpretation has produced something meaningful in the current context, the agent needs to generate a multimodal action to the content |
||
17 |
Multimodal Content Generation |
|
Some action is generated during parsing and interpretation |
||
Hopefully this
action is mostly appropriate |
||
As soon as interpretations start forming, no matter how likely they seem right then ... |
||
|
||
18 |
Multimodal Content Generation - Example |
|
Given the question “What is [deictic gesture] that?”, which contains speech & gesture, do:
|
||
Processes in the Content Layer mirror the hierarchical structure of the input frame containing the parsed multimodal input:
|
||
A Decider in the Content Layer subsequently cues the response when it's ready ... |
||
19 |
Multimodal Content Execution |
|
... As the action is cued, a Decider in the Process Control Layer (PCL) monitors the Content Blackboard to see if any responses are cued |
||
When turn is taken (decided by other Deciders) the Decider delivers the cued response, ... |
||
..which is
sent to the Action Scheduler (AS), which |
||
...determines how, exactly, the delivery will look by choosing a morphology from the Behavior Lexicon |
||
20 |
Multimodal Content Execution |
|
Content Layer has to make sure that the Action Scheduler is not innundated with behavior requests, so ... |
||
The CL sends behavior requests to the AS in increments of relatively short sentences, between 500ms and 5 seconds long, so ... |
||
The CL has
planners that keep track of large chunks of behavior, spanning several
seconds, to minutes, to hours long. |
||
This makes it easy for the CL to re-plan the latter part of plans when circumstances change, without going to the Motor Feedback Blackboard |
||
21 |
Multimodal Knowledge Representation - Summary |
|
Frames in the Content Layerare used to collect data into coherent interpretations |
||
Automatic processes in the RL and PCL (Unimodal) |
||
Perceptors
and Multimodal Integrators) produce data “chunks” of information
that is used to fill the frames |
||
UPs and MIs
can be turned on and off by processes in the Content Layer |
||
At first the
frames serve as hypotheses about the incoming data |
||
As they fill up (some better than others) they contain the agent’s understanding of what’s going on |
||
Depending on the type of user action (request, command, etc.), fill in a multimodal response frame (or a sequence, in case of a long plan) that relates to the input |
||
If there is
missing information, request the missing information |
||
Otherwise execute the multimodal act in increments, sending parts of it (typically 1-5 seconds long) to the Action Scheduler |
||
The Action
Scheduler posts the progress of its action execution on the Motor
Feedback Blackboard (MFB) |
||
By reading
the MFB the Content Layer can cancel and modify unscheduled actions
up to their point of execution |
||
The Content Layer can re-plan non-transmitted parts of a plan, when it is clear, by reading the MFB, how a particular earlier part of the plan was actually executed by the AS |
||
2002©K.R.Thórisson