Software Framework for Parsing and Interpreting Gestures in a Multimodal Virtual Environment Context
Overview
Human-computer interaction (HCI) is a research topic whose eventual outcome will allow users of
computer systems for more natural interfaces than the traditional keyboard and mouse. Ideally,
those interfaces would exploit the same communication channels used in everyday life, which are
speech, gestures or any other expressive feature of the human body. In this thesis, a continuous
dynamic gesture recognition system is presented. Positioning devices such as a mouse, a data glove
or a video camera are used as input streams to the recognition module, from which it extracts
the most interesting features. Gestures are recognized continuously, which means that no prior
temporal segmentation is necessary. Training facilities are also available in order to build gesture
models from known segmented gesture sequences. In order for users to effectively reuse code and
build new modules that share common interfaces, a software framework was built that allows for
multimodal inputs and outputs as well as configuring the virtual world and how will data coming
from the real world influence virtual world entities. The data flow originating from the real world
uses a common data format that is standard from the configuration file to the network packets. The
virtual world is modeled such that the actions that affect the virtual entities given input data from
the real world is configurable and extensible. Sharing the environment through the network is also
possible, allowing users from different locations to work on the same virtual world. Preliminary
tests on the gesture recognition performance are presented given several different input modalities
and setups. An experimental application is also described, showing the flexibility and extensibility
features of the software framework.
Related Work
Considerable research has been invested over the past few decades in order to improve the
interaction between humans and computers. One of the objectives of a user interface is that it should
be natural to use. Gestures have been shown to play an important role in everyday communications
between humans in order to express emotions or to augment information conveyed through
other communication channels. Some examples of common culturally specific gestures would be the
“okay” sign, the “thumbs up” sign, the large amplitude gesture people make to catch a taxi, the
salutation gesture, and many others. Also, people tend to gesticulate in order to mimic concepts
that have a spatial dimension which cannot be as easily described with speech. An example of
gestures augmenting speech can be found when a person describes her weekend: “I killed a caribou
that big”, while performing a gesture indicating the size of the caribou. The most famous use of
gestures in communication is obviously sign language.
Gesture recognition software includes several components that depend on the type of hardware
that is chosen. For vision-based gesture recognition systems, the first step in the processing pipeline is
the image analysis, which is meant to extract distinguishable features from a large data set. There
are essentially two ways of solving the problem of feature detection: model-based detection and
appearance-based detection.
The classification phase, also called the recognition process, is one of the critical parts for
a reliable gesture recognition system. It is possible to recognize many types of gestures: static,
dynamic or both at the same time. For static gestures, model matching is usually employed in
order to compare incoming data with a previously trained template. For instance, artificial neural
networks (ANN) can classify incoming data given some previously trained network models.
For dynamic gesture recognition, several statistical methods can be used. Dynamic time warping
(DTW) is employed to align the incoming data stream with a template. Time difference neural
networks (TDNN), the dynamic version of ANN, can classify incoming data given a large amount
of training data. Variants of TDNN have also been developed. CONDENSATION-based
gesture recognition can be used in order to match a dynamic CONDENSATION model with the
incoming data. One of the most popular statistical classifiers used for gesture recognition is
hidden Markov models (HMMs). These have been successfully applied to speech recognition and
can be adapted for gesture recognition.
An important aspect of virtual environments is how the software is built and how it is possible
to build upon existing software components in order to accommodate future applications. Software
frameworks provide a set of abstract interfaces from which a user inherits in order to include
application-specific code. A framework is therefore a large piece of code that is extensible for particular
applications, while providing building blocks for theoretically any supported type of application.
Current Results
I am writing papers for conference submission on the work I have done for my thesis
I ported my software framework to Linux for later use in the SRE lab, and implemented OpenSceneGraph world entities and output modality in order to integrate my work with the SoundScape project.
I submitted my thesis on June 21st, 2005. The verdict is a pass, thus I am now making the corrections in order to do the final submission very soon.
The status of my current work is a software framework that allows for vision, glove and mouse-based gesture interaction with a very basic virtual environment (video demonstration.
I spent the fall 2004 semester at the CVSL where I pursued some work on glove-based gesture recognition in the context of virtual environments and collaborative work. I built a generic and configurable framework that allows multiple input modalities to affect world entities by generating tokens that are associated with a set of actions applicable on entities when some conditions are met.
I recently went in Europe to present our work on "The Modellers' Apprentice" at the IHM'04 where I presented a poster and HCI 2004 where I presented a short paper.
Here is what I presented on February 23, 2004 on recognition techniques used in gesture recognition: Litterature review of gesture recognition techniques
Here is the current status of my literature review mind map (requires Java plugin...), and the associated BibTeX file.
Future Work
Finish the port of my framework under Linux and develop more components that could be used for interesting applications.
We will also do user testing involving my gesture recognition framework as well as work on user interface metaphors done last year.
Publications
F. Rioux, F. Rudzicz and M. Wozniewski, Palettes transparentes hybrides appliquées aux environnements immersifs, Communications informelles de IHM'04, Namur, Belgique, 2004 (Poster)
F. Rioux, F. Rudzicz and M. Wozniewski, The Modellers' Apprentice -- The Toolglass Metaphor in an Immersive Environment, Proceedings of the 18th British HCI Group Annual Conference, Leeds, England, 2004
Y. Boussemart, F. Rioux, F. Rudzicz M. Wozniewski and J. Cooperstock, A Framework for 3D Visualization and Manipulation in an
Immersive Space using an Untethered Bimanual Gestural Interface, Proceedings of the ACM Symposium on Virtual Reality Software and Technology, Hong Kong, 2004
Last update: 22 June 2005