HUMANOBS Videos

The below videos show the AERA/S1 system learn to conduct (simplified) TV interview via goal-level learning, through observation and imitation. The agent S1 infers goals of the two agents via observation of their behavior, creates models intended to predict how they achieve their goals, and then tests these models via (internal) simulation – that is, given some new observed behavior, uses the models to predict ahead of time what the agents will do in continuation, with the hypothesized goal achievement as the termination event for each model. To learn complex tasks from observation using such goal-directed models, hierarchies of such models must be constructed by the system.

The behavior of the two persons is tracked in realtime with specialized sensors and used to re-create, in realtime, the interaction between the two people in a virtual environment. The two people interact as they would in a video conference call, but instead of a video image of the other person, each sees the other’s avatar on their screen. While somewhat simplified compared a real human-human interview, the dialogue contains all the key organizing principles and observable behaviors are present in realworld interaction.

In this first evaluation of the system, the full system, including the AERA-based mind of S1, was run on a 6-core desktop machine. Given 20 hours of observation-based learning of human-human dialogue, S1 can take the role of either interviewer or interviewee and continue the dialogue exactly as before. Apart from the fact that S1 has a synthesized voice, evaluations of S1’s performance after these 20 hours show virtually zero difference between its performance and the performance of the humans. These videos were produced immediately after the system observed these 20 hours of human-human interaction, without any extra computation time, and the interaction shown in the videos proceeded in realtime, with S1 interacting in realtime with a human. Both roles of interviewer and interviewee can immediately be assumed by S1, with the other role assumed by a human. No prior technology exists that can demonstrate the task performed by our AI in this task.

Human-Human Interaction

This is the input to the AERA agent S1 when learning psycho-social dialogue skills

http://www.youtube.com/watch?v=2NQtEJbQCdw

Human-human interaction, as observed by the S1 AERA agent. Two humans,
Kris and Eric, interact in a virtual environment. Their behavior is
being tracked in realtime by sensors, they speak to each other via
microphones. S1 observes their gesture and speech, via off-the-shelf
speech recognition software and prosody tracking. After observing for
sufficiently long, S1 can overtake either avatar and carry on with
the interview in precisely the same fasion (see videos MH.no_interrupt.mp4,
HM.no_interrupt.mp4, HM.interrupt.mp4 - in the "interrupt" scenario S1
has learned to use interruption as a method to keep the interview
from going over a pre-defined time limit).

What S1 is Given at the Outset

This is a complete list of what is in the seed (initial code) given to the system as it starts to observe the human-human interaction:

- actions: grab, release, point-at, look-at (defined as event types constrained by geometric relationships)
- stopping the interview clock ends the session
- objects: glass-bottle, plastic-bottle, cardboard-box, wodden-cube, newspaper, wooden-cube
- objects have properties (e.g. made-of)
- interviewee-role
- interviewer-role
- Model for interviewer
– top-level goal of interviewer: prompt interviewee to communicate
– in interruption case: an imposed interview duration time limit
- Models for interviewee
– top-level goal of interviewee: to communicate
– never communicate unless prompted
– communicate about properties of objects being asked about, for as long as there still are properties available
– don’t communicate about properties that have already been mentioned

This information is encapsulated in the system as Replicode programs. Then S1 observes about 20 hours of the type of interaction shown in the human-human video above. Observation is done by monitoring the event streams produced by the two avatars and the world they interact in, that is, timed productions of changes in word output (via speech recognition and prosody tracking) and geometric changes in orientation and positions of named objects.

Due to frequent errors of commission from the speech recognizer (that is, the recognizer outputting words that were actually not uttered by the users), the set of accepted words served as a filter to weed these out. This filtering was done outside of the S1 agent so these words were in fact not used by the AERA-based S1 agent in any way as part of the seed.

Human-S1 Interaction

Here S1 has learned a number of basic psycho-social dialogue skills from observation in the role of interviewer

http://www.youtube.com/watch?v=SH6tQ4fgWA4

After having observed two humans interact in a simulated TV interview
for some time, the AERA agent S1 takes the role of interviewer,
continuing the interview in precisely the same fashion as before, asking
questions of the human interviewee (see videos HH.no_interrupt.mp4 and
HH.no_interrupt.mp4 for the human-human interaction that S1 observed;
see MH.no_interrupt_mp4 and HM_interrupt_mp4 for other examples of the
skills that S1 has acquired by observation). In the "interrupt" scenario
(MH_interrupt.mp4) S1 has learned to use interruption as a method to
keep the interview from going over the allowed time limit.

What S1 Learns by Observation and Imitation

After 20 hours of watching two humans in a simulated TV interview like the one above, S1 has learned the following via goal-level imitation:

GENERAL INTERVIEW PRINCIPLES
word order in sentences (with no a-priori grammar)
disambiguation via co-verbal deictic references
role of interviewer and interviewee
interview involves serialization of joint actions (a series of Qs and As by each participant)

MULTIMODAL COORDINATION & JOINT ACTION
take turns speaking
co-verbal deictic reference
manipulation as deictic reference
looking as deictic reference
pointing as deictic reference

INTERVIEWER
to ask a series of questions, not repeating questions about objects already addressed
“thank you” stops the interview clock
interruption condition: using “hold on, let’s go to the next question” can be used to keep interview within time limits

INTERVIEWEE
what to answer based on what is asked
an object property is not spoken of if it is not asked for
a silence from the interviewer means “go on”
a nod from the interviewer means “go on”

S1-Human Interaction

Here S1 has learned a number of basic psycho-social dialogue skills from observation in the role of interviewee

http://www.youtube.com/watch?v=x96HXLPLORg

After having observed two humans interact in a simulated TV interview
for some time, the AERA agent S1 takes the role of interviewee, continuing
the interview in precisely the same fasion as before, answering the
questions of the human interviewer (see videos HH.no_interrupt.mp4 and
HH.no_interrupt.mp4 for the human-human interaction that S1 observed;
see HM.no_interrupt_mp4 and HM_interrupt_mp4 for other examples of the
skills that S1 has acquired by observation). In the "interrupt" scenario
S1 has learned to use interruption as a method to keep the interview
from going over a pre-defined time limit.

Human-Human Interaction with interruption (example)

Two humans interacting via a virtual world in realtime; this interaction
provides the AERA agent S1 with an example of how to use interruption to
move the interview forward for meeting deadlines

http://www.youtube.com/watch?v=AWIqOOCCvqg

This video clip shows two humans interacting, and the interviewer use interruption
to keep the interview below a pre-defined time limit of 4 minutes.

Human-S1 Interaction with Interruption

In this video S1 has learned to use interruption to move the dialogue
along so that the interview can finish on time

http://www.youtube.com/watch?v=Cyd-EueNKqE&feature=youtu.be

Having observed a human interviewer use interruption to move the interview
forward, S1 takes the role of interviewer and demonstrates the acquisition
of this skill by interrupting the human interviewee to meet pre-defined
time-limits for the interview. (See videos HH.interrupt.mp4 to see what S1
observed to learn this technique; see HH.no_interrupt.mp4 for the general
human-human interaction that S1 learned interview skills from; see
HM.no_interrupt_mp4 and HM_interrupt_mp4 for other examples of the skills
that S1 has acquired automatically by observation).