Software Framework for Parsing and Interpreting Gestures in a Multimodal Virtual Environment Context


Human-computer interaction (HCI) is a research topic whose eventual outcome will allow users of computer systems for more natural interfaces than the traditional keyboard and mouse. Ideally, those interfaces would exploit the same communication channels used in everyday life, which are speech, gestures or any other expressive feature of the human body. In this thesis, a continuous dynamic gesture recognition system is presented. Positioning devices such as a mouse, a data glove or a video camera are used as input streams to the recognition module, from which it extracts the most interesting features. Gestures are recognized continuously, which means that no prior temporal segmentation is necessary. Training facilities are also available in order to build gesture models from known segmented gesture sequences. In order for users to effectively reuse code and build new modules that share common interfaces, a software framework was built that allows for multimodal inputs and outputs as well as configuring the virtual world and how will data coming from the real world influence virtual world entities. The data flow originating from the real world uses a common data format that is standard from the configuration file to the network packets. The virtual world is modeled such that the actions that affect the virtual entities given input data from the real world is configurable and extensible. Sharing the environment through the network is also possible, allowing users from different locations to work on the same virtual world. Preliminary tests on the gesture recognition performance are presented given several different input modalities and setups. An experimental application is also described, showing the flexibility and extensibility features of the software framework.

Related Work

Considerable research has been invested over the past few decades in order to improve the interaction between humans and computers. One of the objectives of a user interface is that it should be natural to use. Gestures have been shown to play an important role in everyday communications between humans in order to express emotions or to augment information conveyed through other communication channels. Some examples of common culturally specific gestures would be the “okay” sign, the “thumbs up” sign, the large amplitude gesture people make to catch a taxi, the salutation gesture, and many others. Also, people tend to gesticulate in order to mimic concepts that have a spatial dimension which cannot be as easily described with speech. An example of gestures augmenting speech can be found when a person describes her weekend: “I killed a caribou that big”, while performing a gesture indicating the size of the caribou. The most famous use of gestures in communication is obviously sign language.

Gesture recognition software includes several components that depend on the type of hardware that is chosen. For vision-based gesture recognition systems, the first step in the processing pipeline is the image analysis, which is meant to extract distinguishable features from a large data set. There are essentially two ways of solving the problem of feature detection: model-based detection and appearance-based detection.

The classification phase, also called the recognition process, is one of the critical parts for a reliable gesture recognition system. It is possible to recognize many types of gestures: static, dynamic or both at the same time. For static gestures, model matching is usually employed in order to compare incoming data with a previously trained template. For instance, artificial neural networks (ANN) can classify incoming data given some previously trained network models. For dynamic gesture recognition, several statistical methods can be used. Dynamic time warping (DTW) is employed to align the incoming data stream with a template. Time difference neural networks (TDNN), the dynamic version of ANN, can classify incoming data given a large amount of training data. Variants of TDNN have also been developed. CONDENSATION-based gesture recognition can be used in order to match a dynamic CONDENSATION model with the incoming data. One of the most popular statistical classifiers used for gesture recognition is hidden Markov models (HMMs). These have been successfully applied to speech recognition and can be adapted for gesture recognition.

An important aspect of virtual environments is how the software is built and how it is possible to build upon existing software components in order to accommodate future applications. Software frameworks provide a set of abstract interfaces from which a user inherits in order to include application-specific code. A framework is therefore a large piece of code that is extensible for particular applications, while providing building blocks for theoretically any supported type of application.

Current Results

I am writing papers for conference submission on the work I have done for my thesis

I ported my software framework to Linux for later use in the SRE lab, and implemented OpenSceneGraph world entities and output modality in order to integrate my work with the SoundScape project.

I submitted my thesis on June 21st, 2005. The verdict is a pass, thus I am now making the corrections in order to do the final submission very soon.

The status of my current work is a software framework that allows for vision, glove and mouse-based gesture interaction with a very basic virtual environment (video demonstration.

I spent the fall 2004 semester at the CVSL where I pursued some work on glove-based gesture recognition in the context of virtual environments and collaborative work. I built a generic and configurable framework that allows multiple input modalities to affect world entities by generating tokens that are associated with a set of actions applicable on entities when some conditions are met.

I recently went in Europe to present our work on "The Modellers' Apprentice" at the IHM'04 where I presented a poster and HCI 2004 where I presented a short paper.

Here is what I presented on February 23, 2004 on recognition techniques used in gesture recognition: Litterature review of gesture recognition techniques

Here is the current status of my literature review mind map (requires Java plugin...), and the associated BibTeX file.

Future Work

Finish the port of my framework under Linux and develop more components that could be used for interesting applications.

We will also do user testing involving my gesture recognition framework as well as work on user interface metaphors done last year.


  • F. Rioux, F. Rudzicz and M. Wozniewski, Palettes transparentes hybrides appliquées aux environnements immersifs, Communications informelles de IHM'04, Namur, Belgique, 2004 (Poster)
  • F. Rioux, F. Rudzicz and M. Wozniewski, The Modellers' Apprentice -- The Toolglass Metaphor in an Immersive Environment, Proceedings of the 18th British HCI Group Annual Conference, Leeds, England, 2004
  • Y. Boussemart, F. Rioux, F. Rudzicz M. Wozniewski and J. Cooperstock, A Framework for 3D Visualization and Manipulation in an Immersive Space using an Untethered Bimanual Gestural Interface, Proceedings of the ACM Symposium on Virtual Reality Software and Technology, Hong Kong, 2004

    Last update: 22 June 2005