[visionlist] PhD scholarship : Directing virtual actors by interaction and mutual imitation
Remi Ronfard
remi.ronfard at inria.fr
Tue May 15 12:03:33 GMT 2012
We are selecting candidates for a PhD scholarship as part of LABEX
PERSYVAL in Grenoble, France, on the topic of Directing virtual actors
by interaction and mutual imitation.
This is a joint project between the IMAGINE team at INRIA and the
GIPSA-LAB at Grenoble University.
The Phd topic requires a strong background in machine learning and
computer graphics, excellent academic records, and good programming skills.
Interested students should send their curriculum vitae before May 31 to
Gérard BAILLYGIPSA-LabGerard.bailly at gipsa-lab.grenoble-inp.fr
<mailto:Gerard.bailly at gipsa-lab.grenoble-inp.fr>334 76 57 47 11
Rémi RONFARDLJKRemi.ronfard at inria.fr <mailto:Remi.ronfard at inria.fr>334
76 61 53 94
Based on the quality of applications, the attribution of the scholarship
will be decided on June 30.
Abstract:
The challenge of this project is to propose a system that allows a
director to control and modify the performance of a virtual actor, by
demonstration. The system will perform an action-perception loop video
input: the director plays the scene in front of a camera. The system
analyzes his diction, his facial expressions and head movements. Then
the system creates a virtual copy of his performance by chaining
statistical models that will drive a virtual character with multi-modal
speech synthesis and animation of a talking head, imitating the
director. In general, this first attempt will not correspond exactly to
the sought effect. The director then repeats the sequence, by changing
his speech and gestures to be better understood. It can also give the
system of rewards (better, worse) and indications (faster, quieter). The
iterative system developed will achieve the result in a series of
interactions where the system tries to learn the required sequence of
actions to be performed and how to parameterize them.
The originality of the project is to consider the behavior of the two
interlocutors ¬ multimodal movements of the head and eyes, facial
expressions and speech ¬ as coupled systems and study the rhythmic
coordination of all these gestures.
Context:
The model of the virtual actor that we would develop will behave as a
kind of Eliza Doolittle when she repeats her diction exercises to
reproduce (with difficulty) the instructions of Dr. Higgins (My Fair
Lady). Another useful reference to motivate and illustrate this project
is that of theater exercises, where the same phrase is repeated with all
the intonations possible: Marcel Pagnol Schpountz provides a familiar
example, when Fernandel repeats the sentence "Anyone sentenced to death
will have his head cut off" with a large series of expressive attitudes,
from fear, incredulity and disgust to sarcasm and doubt.
It is assumed that the virtual actor has a rich inventory of multimodal
behavior and can express a wide variety of mental states. The beginning
of this thesis will thus consist in a formal modeling of a large part of
the 412 emotions organized into 24 functional groups by S. Baron Cohen
[1], for the inventory carried by six English speakers that we will
adapt to French and that will played by a professional human actor. A 3D
virtual clone will be endowed with the ability to reproduce those
emotions on any statement from a multimodal decomposition of these
behaviors into elementary gestures.
The aim of this thesis is to orchestrate these gestures ¬ selection,
sequencing, phasing and gradience ¬ as well as increase the inventory to
mimic the communicative intentions of an actor's director, operating
through demonstration and reward.
Scientific challenges:
The scientific challenges are twofold: technological and cognitive.
Technological challenges include both the problems of analysis and
synthesis of multimodal behavior, statistical modeling (including
stochastic processes such as POMDP (Young 2010; Jurcicek et al. 2011)
and learning by demonstration with a central issue on the orchestration
of different dimensions of the animation in terms of sequencing of
actions (see for example the ability of CRF (Sutton and McCallum 2006)
to learn complex syntactic relations) and fine tuning of kinematic
trajectories (see for example the so-called trajectory Hidden Markov
models introduced by Tokuda et al (Zen et al. 2004; Zen et al. 2011).
The aim of the project is to model the coupling between director and
actor in order to take directly into account the satisfactory
experiences and unsuccessful imitations.
Cognitive challenges concern the study and modeling of perception-action
links. According to Gallese et al (Gallese et al. 1996; Gallese and
Goldman 1998 ), the activity of mirror neurons (NM) in the brain of a
primate or human observer would perform an automatic motor simulation of
movements performed by the agent in order to represent the intended
actions of the latter. Following this hypothesis, NM help "mindreading"
¬ concept introduced by S. Baron-Cohen quoted above ¬ that is to say,
the psychological understanding of others (or mentalization), the
ability to "read" in the minds of others and thus to represent their
mental states to understand and predict their behaviors. The project
aims, through the analysis of loops of reciprocal imitation and targeted
questionnaires, to study the mental simulations and behavioral
strategies implemented by human or virtual observers to discover and
exploit the capabilities of mind reading of their conversational partner.
The global scientific challenge is to provide a comprehensive cognitive
architecture that can simulate the dynamic coupling between a real human
and a virtual human in the context of multimodal interactions. The
originality of this architecture is to account for phenomena of
synchrony, imitation and variability specific to human interactions in
order to increase the credibility of the humanoid.
Références
Bailly, G., O. Govokhina, F. Elisei and G. Breton (2009). "Lip-synching
using speaker-specific articulation, shape and appearance models."
Journal of Acoustics, Speech and Music Processing. Special issue on
"Animating Virtual Speakers or Singers from Audio: Lip-Synching Facial
Animation" ID 769494: 11 pages.
Bailly, G. and B. Holm (2005). "SFC: a trainable prosodic model." Speech
Communication 46(3-4): 348-364.
Baron-Cohen, S. (2008). Mind Reading: The Interactive Guide to Emotions.
London, Jessica Kingsley Publishers: 29 pages.
Bérar, M., G. Bailly, M. Chabanas, M. Desvignes, F. Elisei, M. Odisio
and Y. Pahan (2006). Towards a generic talking head. Towards a better
understanding of speech production processes. J. Harrington and M.
Tabain. New York, Psychology Press: 341-362.
Gallese, V., L. Fadiga, L. Fogassi and G. Rizzolatti (1996). "Action
recognition in the premotor cortex." Brain 119: 593-609.
Gallese, V. and A. I. Goldman (1998 ). "Mirror neurons and the
simulation theory of mindreading." Trends in Cognitive Sciences 2(12):
493-501.
Gao, X., Y. Su, X. Li and D. Tao (2010). "A Review of Active Appearance
Models." IEEE Transactions on Systems Man and Cybernetics 40(2): 145-158.
Jurcicek, F., B. Thomson and S. Young (2011). "Natural Actor and Belief
Critic: Reinforcement algorithm for learning parameters of dialogue
systems modelled as POMDPs." ACM Transactions on Speech and Language
Processing 7(3).
Lelong, A. and G. Bailly (2011). Study of the phenomenon of phonetic
convergence thanks to speech dominoes Analysis of Verbal and Nonverbal
Communication and Enactment: The Processing Issue. A. Esposito, A.
Vinciarelli, K. Vicsi, C. Pelachaud and A. Nijholt. Berlin, Springer
Verlag: 280-293.
Sutton, C. and A. McCallum (2006). An Introduction to Conditional Random
Fields for Relational Learning. Introduction to Statistical Relational
Learning. L. Getoor and B. Taskar. Boston, MA, MIT Press.
Young, S. (2010). "Cognitive user interfaces." IEEE Signal Processing
Magazine 27(3): 128-140.
Zen, H., Y. Nankaku and K. Tokuda (2011). "Continuous stochastic feature
mapping based on trajectory HMMs." IEEE Trans. on Audio, Speech, and
Language Processing 19(2): 417-430.
Zen, H., K. Tokuda and T. Kitamura (2004). An introduction of trajectory
model into HMM-based speech synthesis. ISCA Speech Synthesis Workshop.
Pittsburgh, PE, pp. 191-196.
--
Rémi Ronfard, IMAGINE team, INRIA / LJK, Grenoble
Tel 334 76 61 53 03 Cell. 336 71 08 88 81
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://visionscience.com/pipermail/visionlist/attachments/20120515/fec9c2d8/attachment.htm>
More information about the visionlist
mailing list