Guide

User Manual:

Open the PDF directly: View PDF .
Page Count: 38

Download
Open PDF In Browser	View PDF

REAL TIME SIGN LANGUAGE GESTURE
RECOGNITION FROM VIDEO SEQUENCES
A PROJECT REPORT
Submitted in the partial fulfilment of the requirements
for the award of the degree of

BACHELOR OF TECHNOLOGY
IN
COMPUTER ENGINEERING

Under the Supervision of
Mr. Sarfaraz Masood
Assistant Professor,
Dept. of Computer Engg.
Jamia Millia Islamia

Submitted by
Harish Chandra Thuwal
(13BCS0027)
Adhyan Srivastava
(13BCS0007)

DEPARTMENT OF COMPUTER ENGINEERING
FACULTY OF ENGINEERING AND TECHNOLOGY
JAMIA MILLIA ISLAMIA, NEW DELHI  110017
(Year2017)

DECLARATION

We, Harish Chandra Thuwal and Adhyan Srivastava, students of
Bachelor of Technology, Computer Engineering hereby declare that the
project entitled “Real Time Sign Language Gesture Recognition from
Video Sequences” which is submitted by us to the Department of
Computer Engineering, Faculty of Engineering and Technology, Jamia
Millia Islamia, New Delhi in partial fulfilment of requirements for the
award of the degree of Bachelor of Technology in Computer Engineering,
has not been formed the basis for the award of any degree, diploma or
other similar title or recognition.

Place: New Delhi

Harish Chandra Thuwal
(13BCS0027)

Date:
Adhyan Srivastava
(13BCS0007)

1

CERTIFICATE

This is to certify that the dissertation/project report(Course Code) entitled
“Real Time Sign Language Gesture Recognition from Video Sequences”, is an
authentic work carried out by Harish Chandra Thuwal and Adhyan
Srivastava.
The work is submitted in partial fulfilment of the requirements for the award
of the degree of Bachelor of Technology in Computer Engineering under my
guidance. The matter embodied in this project work has not been submitted
earlier for the award of any degree or diploma to the best of my knowledge
and belief.

Prof. M.N Doja,
Professor and Head,
Dept. of Computer Engg.
Jamia Millia Islamia

Sarfaraz Masood,
Assistant Professor,
Dept. of Computer Engg.
Jamia Millia Islamia

2

ACKNOWLEDGMENT

We would like to thank our mentor Mr. Sarfaraz Masood ( Assistant
Professor, Department of Computer Engineering) for giving us the opportunity
to undertake the project. We thank them for their immense guidance, and
appreciate their timely engagement.
We would like to extend special gratitude to the the assistants and lab
coordinators of the Department for providing us the infrastructural facilities
necessary to sustain the project.

3

ABSTRACT
Inability to speak is considered to be true disability. People with this disability
use different modes to communicate with others, there are number of methods
available for their communication one such common method of
communication is sign language.
Developing sign language application for deaf people can be very important,
as they’ll be able to communicate easily with even those who don’t understand
sign language. Our project aims at taking the basic step in bridging the
communication gap between normal people, deaf and dumb people using sign
language.
The main focus of this work is to create a vision based system to identify sign
language gestures from the video sequences. The reason for choosing a system
based on vision relates to the fact that it provides a simpler and more intuitive
way of communication between a human and a computer. In this report, 46
different gestures have been considered.
We used the following approach for the classification of sign language
gestures:
Video sequences contain both the temporal as well as the spatial features. So
we have used two different models to train both the temporal as well as the
spatial features. To train the model on the spatial features of the video
sequences we have used Inception model[14] which is a deep
CNN(convolutional neural net). CNN was trained on the frames obtained from
the video sequences of train data. We have used RNN(recurrent neural
network) to train the model on the temporal features. Trained CNN model was
used to make predictions for individual frames to obtain a sequence of
predictions or pool layer outputs for each video. Now this sequence of
prediction or pool layer outputs was given to RNN to train on the temporal
features.The data set[7] used consists of Argentinian Sign Language(LSA)
Gestures, with around 2300 videos belonging to 46 gestures categories. Using
the predictions by CNN as input for RNN 93.3% accuracy was obtained and
by using pool layer output as input for RNN an accuracy of 95.217% was
obtained.

4

TABLE OF CONTENTS

1 Introduction

8

1.1 Sign Language

8

______________________________________________
2 Literature Survey

10

2.1 Vision Based

10

2.1.1 Handshape recognition for Argentinian Sign Language using
ProbSom

11

2.1.2 Automatic Indian Sign Language Recognition for Continuous
Video sequence.
2.1.3 Continuous Indian Sign Language Gesture Recognition and
Sentence Formation
2.1.4 Recognition of isolated Indian Sign Language Gesture in Real
TIme

12
12
14

______________________________________________
3 Algorithms

15

3.1 Convolutional Neural Network

15

3.1.1 CNN Summarised in 6 Steps

15

3.1.1.1 Convolution

16

3.1.1.2 Subsampling

19

3.1.1.3 Pooling

19

3.1.1.4 Activation

20

3.1.1.5 Fully Connected

20

3.1.1.6 Loss (During Training)

20

3.1.2 Implementation

20

3.1.3 Inception

21

3.2 Recurrent Neural Network

22

3.2.1 Recurrent Neural Networks have Loops

23

3.2.2 How memory of previous inputs Carried Forward

24

3.2.3 Exploding and Vanishing Gradient

25

3.2.3.1 Vanishing Gradient

25
5

3.2.3.2 Exploding Gradient

26

3.2.4 Long Short Term Memory Units

26

3.2.5 Our RNN Model

27

______________________________________________
4 Experimental Design

28

4.1 Dataset Used

28

4.2 First Approach

29

4.2.1 Methodology

29

4.2.1.1 Frame Extraction and Background REmoval

30

4.2.1.2 Training CNN (Spatial Features) and Prediction

31

4.2.1.3 Training RNN (Temporal Features)

32

4.2.2 Limitation

32

4.3 Second Approach

32

______________________________________________
5 Results

33

5.1 Result of Approach 1

33

5.2 Result of Approach 2

33

______________________________________________
6 Conclusion and Future Work

35

______________________________________________
7 References

36

______________________________________________

6

LIST OF FIGURES
Fig 1
Fig 2
Fig 3
Fig 4
Fig 5
Fig 6
Fig 7
Fig 8
Fig 9
Fig 10
Fig 11
Fig 12
Fig 13
Fig 14
Fig 15
Fig 16
Fig 17
Fig 18
Fig 19
Fig 20
Fig 21
Fig 22
Fig 23
Fig 24
Fig 25
Fig 26
Fig 27
Fig 28
Fig 29
Fig 30

American Sign Language
Block Diagram Vision Based Recognition System
LSA Hand Gesture Recognition System block diagram
System Overview
General Diagram of the Work
Gesture of Sentence It is Closed Today
Methodology for real time ISL classification
Convolutional Neural Network
Convolving Wally with a circle filter
Dot product of filter with single chunk of input image
Dot product or Convolve over all all possible 5x5
Input Image Convolving with layer of 6 filters
Input Image with two Conv layer having 6 & 10 filters
Subsampling Wally by 10 times
Pooling to reduce size
Max Pooling
Inception v3 model
A chunk of RNN
An Unrolled recurrent neural network
Memory of previous inputs being carried forward
Vanishing Sigmoid
Our RNN model
Approach 1 Methodology
One of the Extracted Frames
Frame after extracting hands
Train CNN and prediction
Train CNN and prediction illustration
Train RNN illustration
Results Approach 1
Results Approach 2

7

09
10
11
12
13
13
14
15
16
17
17
18
18
19
19
20
22
23
23
24
25
27
29
30
30
31
31
32
33
34

1. INTRODUCTION

________________________________________________

Motion of any body part like face, hand is a form of gesture. Here for gesture
recognition we are using image processing and computer vision. Gesture
recognition enables computer to understand human actions and also acts as an
interpreter between computer and human. This could provide potential to
human to interact naturally with the computers without any physical contact of
the mechanical devices. Gestures are performed by deaf and dumb community
to perform sign language. This community used sign language for their
communication when broadcasting audio is impossible, or typing and writing
is difficult, but there is the vision possibility. At that time sign language is the
only way for exchanging information between people. Normally sign language
is used by everyone when they do not want to speak, but this is the only way of
communication for deaf and dumb community. Sign language is also serving
the same meaning as spoken language does. This is used by deaf and dumb
community all over the world but in their regional form like ISL, ASL. Sign
language can be performed by using Hand gesture either by one hand or two
hands. It is of two type Isolated sign language and continuous sign language.
Isolated sign language consists of single gesture having single word while
continuous ISL or Continuous Sign language is a sequence of gestures that
generate a meaningful sentence. In this report we performed isolated ASL
gesture recognition technique.

1.1 Sign Language
Deaf people around the world communicate using sign language as distinct
from spoken language in their every day a visual language that uses a system
of manual, facial and body movements as the means of communication. Sign
language is not an universal language, and different sign languages are used in
different countries, like the many spoken languages all over the world. Some
countries such as Belgium, the UK, the USA or India may have more than one
sign language. Hundreds of sign languages are in used around the world, for
instance, Japanese Sign Language, British Sign Language (BSL), Spanish Sign
Language, Turkish Sign Language.
Sign language is a visual language and consists of 3 major components:

8

Fingerspelling

Word level sign
vocabulary

Nonmanual features

Used to spell words Used for the majority Facial expressions and
letter by letter .
of communication.
tongue, mouth and
body position.

Fig 1: Finger Spelling American Sign Language[11]

9

2. LITERATURE SURVEY

________________________________________________

In

the recent years, there has been tremendous research on the hand sign

language gesture recognition. The technology for gesture recognition is given
below.

2.1 Visionbased
In visionbased methods computer camera is the input device for observing the
information of hands or fingers. The Vision Based methods require only a
camera, thus realizing a natural interaction between humans and computers
without the use of any extra devices. These systems tend to complement
biological vision by describing artificial vision systems that are implemented
in software and/or hardware. This poses a challenging problem as these
systems need to be background invariant, lighting insensitive, person and
camera independent to achieve real time performance. Moreover, such systems
must be optimized to meet the requirements, including accuracy and
robustness.
The vision based hand gesture recognition system is shown in fig.:

Fig 2: Block Diagram of vision based recognition system
Vision based analysis, is based on the way human beings perceive information
about their surroundings, yet it is probably the most difficult to implement in a
satisfactory way. Several different approaches have been tested so far.
1. One is to build a threedimensional model of the human hand. The model is
matched to images of the hand by one or more cameras, and parameters
corresponding to palm orientation and joint angles are estimated. These
parameters are then used to perform gesture classification.

10

2. Second one to capture the image using a camera then extract some feature
and those features are used as input in a classification algorithm for
classification.

2.1.1 Handshape recognition for Argentinian Sign Language using
ProbSom [1]
In this paper, a method for hand gesture recognition of Argentinian sign
language (LSA) is proposed. This paper offers two main contributions: first,
the creation of a database of handshapes for the Argentinian Sign Language
(LSA). Secondly, a technique for image processing, descriptor extraction and
subsequent handshape classification using a supervised adaptation of
selforganizing maps that is called ProbSom. This technique is compared to
others in the state of the art, such as Support Vector Machines (SVM), Random
Forests, and Neural Networks. The ProbSombased neural classifier, using the
proposed descriptor, achieved an accuracy rate above 90%.

Fig 3: Block Diagram of Hand Gesture Recognition System for LSA

11

2.1.2 Automatic Indian Sign Language Recognition for Continuous
Video Sequence [2]
The proposed system comprises of four major modules: Data Acquisition,
Preprocessing, Feature Extraction and Classification. Preprocessing stage
involves Skin Filtering and histogram matching after which Eigenvector
based Feature Extraction and Eigen value weighted Euclidean distance based
Classification Technique was used. 24 different alphabets were considered in
this paper where 96% recognition rate was obtained.

Fig 4: System Overview [2]

2.1.3 Continuous Indian Sign Language Gesture Recognition and
Sentence Formation [3]
Recognizing a sign language gestures from continuous gestures is a very
challenging research issue. The researchers solved this problem using gradient
based key frame extraction method. These key frames were helpful in splitting
continuous sign language gestures into sequence of signs as well as for
removing uninformative frames. After splitting of gestures each sign has been
treated as an isolated gesture. Then features of preprocessed gestures were
extracted using Orientation Histogram (OH) with Principal Component
Analysis (PCA) is applied for reducing dimension of features obtained after
OH. Experiments were performed on their own continuous ISL dataset which
was created using canon EOS camera in Robotics and Artificial Intelligence
laboratory (IIITA). Probes were tested using various types of classifiers like
12

Euclidean distance, Correlation, Manhattan distance, city block distance etc.
Comparative analysis of their proposed scheme was performed with various
types of distance classifiers. From the above analysis they found that the
results obtained from Correlation and Euclidean distance gives better accuracy
than other classifiers.

Fig 5: General Diagram of the Work [3]

Fig 6: Gesture of Sentence It is Closed Today [3]

13

2.1.4 Recognition of isolated Indian Sign Language Gesture in Real
Time [4]
This paper demonstrates the statistical techniques for recognition of ISL
gestures in real time which comprises both the hands. A video database was
created by the authors and utilized which contained several videos for large
number of signs. Direction histogram is the feature used for classification due
to its appeal for illumination and orientation invariance. Two different
approaches utilized for recognition were Euclidean distance and Knearest
neighbor metrics.

Fig 7: Methodology for real time ISL classification [4]

14

3. ALGORITHMS

________________________________________________
3.1 Convolutional Neural Network (CNN)
Neural networks, as its name suggests, is a machine learning technique which
is modeled after the brain structure. It comprises of a network of learning units
called neurons. These neurons learn how to convert input signals (e.g. picture
of a cat) into corresponding output signals (e.g. the label “cat”), forming the
basis of automated recognition.
A convolutional neural network (CNN, or ConvNet) is a type of feedforward
artificial neural network in which the connectivity pattern between its neurons
is inspired by the organization of the animal visual cortex.
CNNs have repetitive blocks of neurons that are applied across space (for
images) or time (for audio signals etc). For images, these blocks of neurons
can be interpreted as 2D convolutional kernels, repeatedly applied over each
patch of the image. For speech, they can be seen as the 1D convolutional
kernels applied across timewindows. At training time, the weights for these
repeated blocks are 'shared', i.e. the weight gradients learned over various
image patches are averaged.

3.1.1 CNN Summarized in 4 Steps
There are four main steps in CNN: convolution, subsampling, activation and
full connectedness.

Fig 8: Convolutional neural network [13]

15

3.1.1.1 Convolution
The first layers that receive an input signal are called convolution filters.
Convolution is a process where the network tries to label the input signal by
referring to what it has learned in the past. If the input signal looks like
previous cat images it has seen before, the “cat” reference signal will be mixed
into, or convolved with, the input signal. The resulting output signal is then
passed on to the next layer.

Fig 9: Convolving Wally with a circle filter. The circle filter responds
strongly to the eyes.
Convolution has the nice property of being translational invariant.
Intuitively, this means that each convolution filter represents a feature of
interest (e.g whiskers, fur), and the CNN algorithm learns which features
comprise the resulting reference (i.e. cat). The output signal strength is not
dependent on where the features are located, but simply whether the features
are present. Hence, a cat could be sitting in different positions, and the CNN
algorithm would still be able to recognize it.
For e.g suppose we convolve a 32x32x3 (32x32 image with 3 channels R,G
and B respectively) with a 5x5x3 filter. We take the 5*5*3 filter and slide it
over the complete image and along the way take the dot product between the
filter and chunks of the input image.

16

Fig 10: Dot Product of Filter with single chunk of Input Image[12]

Fig 11: Dot Product or Convolve over all possible 5x5 spatial location in
Input Image[12]
The convolution layer is the main building block of a convolutional neural
network. The convolution layer comprises of a set of independent filters (6 in
the example shown). Each filter is independently convolved with the image
and we end up with 6 feature maps of shape 28*28*1.

17

Fig 12: Input Image Convolving with a Convolutional layer of 6
independent filters[12]
The CNN may consists of several Convolutional layers each of which can
have similar or different number of independent filters. For example the
following diagram shows the effect of two Convolutional layers having 6 and
10 filters respectively.

Fig 13: Input Image Convolving with two Convolutional layers having 6
and 10 filters respectively[12]
All these filters are initialized randomly and become our parameters which
will be learned by the network subsequently.

18

3.1.1.2 Subsampling
Inputs from the convolution layer can be “smoothened” to reduce the
sensitivity of the filters to noise and variations. This smoothing process is
called subsampling, and can be achieved by taking averages or taking the
maximum over a sample of the signal. Examples of subsampling methods (for
image signals) include reducing the size of the image, or reducing the color
contrast across red, green, blue (RGB) channels.

Fig 14: Sub sampling Wally by 10 times. This creates a lower resolution
image.
3.1.1.3 Pooling
A pooling layer is another building block of a CNN.

Fig 15: Pooling to reduce size from 224x224 to 112x112 [12]
Its function is to progressively reduce the spatial size of the representation to
reduce the amount of parameters and computation in the network. Pooling
layer operates on each feature map independently.
The most common approach used in pooling is max pooling in which
maximum of a region taken as its representative. For example in the following
diagram a 2x2 region is replaced by the maximum value in it.

19

Fig 16: Max Pooling [12]
3.1.1.4 Activation
The activation layer controls how the signal flows from one layer to the next,
emulating how neurons are fired in our brain. Output signals which are
strongly associated with past references would activate more neurons,
enabling signals to be propagated more efficiently for identification.
CNN is compatible with a wide variety of complex activation functions to
model signal propagation, the most common function being the Rectified
Linear Unit (ReLU), which is favored for its faster training speed.
3.1.1.5 Fully Connected
The last layers in the network are fully connected, meaning that neurons of
preceding layers are connected to every neuron in subsequent layers. This
mimics high level reasoning where all possible pathways from the input to
output are considered.
3.1.1.6 (During Training) Loss
When training the neural network, there is additional layer called the loss
layer. This layer provides feedback to the neural network on whether it
identified inputs correctly, and if not, how far off its guesses were. This helps
to guide the neural network to reinforce the right concepts as it trains. This is
always the last layer during training.

3.1.2 Implementation
Algorithms used in training CNN are analogous to studying for exams using
flash cards. First, you draw several flashcards and check if you have mastered
the concepts on each card. For cards with concepts that you already know,
20

discard them. For those cards with concepts that you are unsure of, put them
back into the pile. Repeat this process until you are fairly certain that you
know enough concepts to do well in the exam. This method allows you to
focus on less familiar concepts by revisiting them often. Formally, these
algorithms are called gradient descent algorithms for forward pass learning.
Modern deep learning algorithm uses a variation called stochastic gradient
descent, where instead of drawing the flashcards sequentially, you draw them
at random. If similar topics are drawn in sequence, the learners might
overestimate how well they know the topic. The random approach helps to
minimize any form of bias in the learning of topics.

Learning algorithms require feedback. This is done using a validation set
where the CNN would make predictions and compare them with the true
labels or ground truth. The predictions which errors are made are then fed
backwards to the CNN to refine the weights learned, in a so called backwards
pass. Formally, this algorithm is called backpropagation of errors, and it
requires functions in the CNN to be differentiable (almost).
CNNs are too complex to implement from scratch. Today, machine learning
practitioners often utilize toolboxes developed such as Caffe, Torch,
MatConvNet and Tensor flow for their work.

3.1.3 Inception [14]
We’ve used the Inception v3 model of the Tensor Flow library. Inception is a
huge image classification model with millions of parameters that can
differentiate a large number of kinds of images. We only trained the final layer
of that network, so training will end in a reasonable amount of time.
Inceptionv3 is trained for the ImageNet Large Visual Recognition Challenge
using the data from 2012 where it reached a top5 error rate of as low as
3.46%.

21

Fig 17: Inception v3 model Architecture
We performed transfer learning on Inception model that is we downloaded the
pretrained Inception v3 model (trained on ImageNet Dataset consisting of
1000 classes) , added a new final layer corresponding to the number of
categories and then trained the final layer on the dataset.
The kinds of information that make it possible for the model to differentiate
among 1,000 classes are also useful for distinguishing other objects. By using
this pretrained network, we are using that information as input to the final
classification layer that distinguishes our dataset.

3.2 Recurrent Neural Network (RNN)
Humans don’t start their thinking from scratch every second. We don’t throw
everything away and start thinking from scratch again. Our thoughts have
persistence. Traditional neural networks can’t do this but Recurrent Neural
Networks can. There is information in the sequence itself, and recurrent nets
use it to perform tasks that feedforward networks can’t. Recurrent networks
are distinguished from feedforward networks by the fact that they have
feedback loop, ingesting their own outputs moment after moment as input.
They’re especially useful with sequential data because each neuron or unit can
use its internal memory to maintain information about the previous input.
For example in case of a network that is suppose to classify what kind of
event is happening at every point in a movie. It requires the network to use its
reasoning about previous events in the film to inform later ones. Another
example in case of language, “I had washed my house” is much more different
than “I had my house washed”. This allows the network to gain a deeper
understanding of the statement. This is important to note because reading
22

through a sentence even as a human, you’re picking up the context of each
word from the words before it.

3.2.1 Recurrent Neural Networks have Loops
A RNN has loops in them that allow information to be carried across neurons
while reading in input. In the following diagram, a chunk of Recurrent neural
network, A, looks at some input xt and outputs a value ht. The loop allows
information to be passed from one step of the network to the next. The
decision a recurrent net reached at time step t1 affects the decision it will
reach one moment later at time step t. So recurrent networks have two sources
of input, the present and the recent past, which combine to determine how they
respond to new data.

Fig 18: A chunk of Recurrent Neural Network
A recurrent neural network can be thought of as multiple copies of the same
network, each passing a message to a successor. Consider what happens if we
unroll the loop:

Fig 19: An Unrolled recurrent neural network
This chainlike nature reveals that recurrent neural networks are intimately
related to sequences and lists. They’re the natural architecture of neural
network to use for such data. The sequential information is preserved in the
recurrent network’s hidden state, which manages to span many time steps as it
cascades forward to affect the processing of each new example
23

3.2.2 How Memory of previous inputs Carried forward

The hidden state at time step t is ht. It is a function of the input at the same
time step xt, modified by a weight matrix W, added to the hidden state of the
previous time step h_t1 multiplied by its own hiddenstatetohiddenstate
matrix U called transition matrix. The weight matrices are filters that
determine how much importance to accord to both the present input and the
past hidden state. The error they generate can be used to adjust their weights
using Backpropagation through Time (BPTT). The sum of the weight input
and hidden state is squashed by the function φ – either a logistic sigmoid
function or tanh.

Fig 20: Memory of previous inputs being carried forward
Because this feedback loop occurs at every time step in the series, each hidden
state contains traces not only of the previous hidden state, but also of all those
that preceded h_t1 for as long as memory can persist.

24

3.2.3 Exploding and Vanishing Gradient Problem
In theory, RNNs are absolutely capable of handling longterm
dependencies.Sadly, in practice, RNNs don’t seem to be able to learn them as
explained in [5]. The gradient expresses the change in all weights with regard
to the change in error.Since the layers and time steps of deep neural networks
relate to each other through multiplication, gradient is susceptible to vanishing
or exploding.
3.2.3.1 Vanishing Gradient
The gradients of the network's output with respect to the parameters in the
early layers become extremely small. In other words even a large change in
the value of parameters for the early layers doesn't have a big effect on the
output. Hence the network can’t learn the parameter effectively.
This happens because the activation functions (sigmoid or tanh) squash their
input into a very small output range in a very nonlinear fashion. For example,
sigmoid maps the real number line onto a "small" range of [0, 1]. As a result,
there are large regions of the input space which are mapped to an extremely
small range. In these regions of the input space, even a large change in the
input will produce a small change in the output  hence the gradient is small.

Fig 21: Vanishing Sigmoid (Analogous to Vanishing Gradient)
25

This becomes much worse when we stack multiple layers of such
nonlinearities on top of each other. For instance, first layer will map a large
input region to a smaller output region, which will be mapped to an even
smaller region by the second layer, which will be mapped to an even smaller
region by the third layer and so on. As a result, even a large change in the
parameters of the first layer doesn't change the output much.
In the Fig _ we can see the effects of applying a sigmoid function over and
over again. The data is flattened until, for large stretches, it has no detectable
slope. This is analogous to a gradient vanishing as it passes through many
layers.
3.2.3.2 Exploding Gradient
Exploding gradients treat every weight as though it were the proverbial
butterfly whose flapping wings cause a distant hurricane. Those weights’
gradients become saturated on the high end; i.e. they are presumed to be too
powerful.
Exploding gradients can be solved relatively easily, because they can be
truncated or squashed. Vanishing gradients can become too small for
computers to work with or for networks to learn – a harder problem to solve.

3.2.4 Long ShortTerm Memory Units (LSTMs)
A variation of of recurrent net with “Long ShortTerm Memory Units”
LSTMs, was proposed by the German researchers Sepp Hochreiter and
Juergen Schmidhuber [6] as a solution to the vanishing gradient problem.
LSTMs help preserve the error that can be backpropagated through time and
layers. By maintaining a more constant error, they allow recurrent nets to
continue to learn over many time steps (over 1000), thereby opening a channel
to link causes and effects remotely.
LSTMs are explicitly designed to avoid the longterm dependency problem.
Remembering information for long periods of time is practically their default
behavior, not something they struggle to learn!

26

3.2.5 Our RNN Model
We have create a RNN model based on LSTMs. The first layer is an input
layer used to feed input to the upcoming layers. Its size is determined by the
size of the input being fed. Our Model is a wide network consisting of single
layer of 256 LSTM units. This Layer is followed by a fully connected layer
with softmax activation. In Fully Connected every neuron is connected to
every neuron of previous layer. The fully connected layer consists of as many
neurons as there are categories/classes. Finally a regression layer to apply a
regression (linear or logistic) to the provided input. We used adam[8]
(Adaptive Moment Estimation) which is a stochastic optimizer, as a gradient
descent optimizer to minimize the provided loss function
“categorical_crossentropy” (which calculate the errors).

Fig 22: Our RNN Model
We also tried a wider RNN network with 512 LSTM units and another deep
RNN network with three layers of 64 LSTM units each. We tested these on a
sample of the dataset and found that wide model with 256 LSTM units
performed the best and therefore only the wide model was used for training
and testing on complete dataset.

27

4. EXPERIMENTAL DESIGN

________________________________________________
We have used two approaches to train the model on the temporal and the
spatial features. Both approaches differ by the inputs given to RNN to train it
on the temporal features.

4.1 Data Set Used
The data set[7] used for both the approaches consists of Argentinian Sign
Language(LSA) Gestures, with around 2300 videos belonging to 46 gestures
categories. 10 nonexpert subjects executed the 5 repetitions of every gesture
thereby producing 50 videos per category or gesture.
Id

Name

Id

Name

Id

Name

Id

Name

1

Son

13

Enemy

25

Country

37

ToLand

2

Food

14

Dance

26

Red

38

Yellow

3

Trap

15

Green

27

Call

39

Give

4

Accept

16

Coin

28

Run

40

Away

5

Opaque

17

Where

29

Bitter

41

Copy

6

Water

18

Breakfast

30

Map

42

Skimmer

7

Colors

19

Catch

31

Milk

43

SweetMilk

8

Perfume

20

Name

32

Uruguay

44

Chewing gum

9

Born

21

Yogurt

33

Barbeque

45

Photo

10

Help

22

Man

34

Spagheti

46

Thanks

11

None

23

Drawer

35

Patience

12

Deaf

24

Bathe

36

Rice

28

Out of the 50 gestures per category, 75% i.e. 40 were used for training and
25% i.e. 10 were used for testing

4.2 First Approach
In this approach we extracted spatial features for individual frames using
inception model (CNN) and temporal features using RNN. Each video (a
sequence of frames) was then represented by a sequence of predictions made
by CNN for each of the individual frames. This sequence of predictions was
given as input to the RNN.

4.2.1 Methodology
● First, we will extract the frames from the multiple video sequences of
each gesture.
● After the first step, noise from the frames i.e background, body parts
other than hand are removed to extract more relevant features from the
frame.
● Frames of the train data are given to the CNN model for training on the
spatial features. We have used inception model for this purpose which
is a deep neural net.
● Store the train and test frame predictions. We’ll use the model obtained
in the above step for the prediction of frames.
● The predictions of the train data are now given to the RNN model for
training on the temporal features. We have used LSTM model for this
purpose.

Fig 23
In further subsections of this section, each step of the methodology has been
shown diagrammatically for better understanding of that step.

29

4.2.1.1 Frame Extraction and Background Removal
Each video gesture video is broken down into a sequence of frames. Frames
are then processed to remove all the noise from the image that is everything
except hands.
The final image consists of grey scale image of hands to avoid color specific
learning of the model

Fig 24: One of the Extracted Frames

Fig 25: Frame after extracting hands (Background Removal)
30

4.2.1.2 Train CNN(Spatial Features) and Prediction

Fig 26
The first row in the below illustration is the video of a gesture Elephant. The
second row shows the set of frames extracted from it. The third row shows the
sequence of predictions for each frame by CNN after training it.

Fig 27

31

4.2.1.3 Training RNN (Temporal Features)

Fig 28

4.2.2 Limitations
The length of a probabilistic prediction by CNN in the sequences of
predictions of frames is equal to the number of classes to be classified. In our
case it is equal to 46 because we have 46 classes to classify. Therefore the
length of feature vector of each frame for the RNN is dependent upon the
number of classes to be classified. Less are the number of classes, less would
be the length of feature vector for each frame.

4.3 Second Approach
In this approach we have used CNN to train the model on the spatial features
and have given the output of the pool layer, before it’s made into a prediction,
to the RNN. The pool layer gives us a 2048 dimensional vector that represents
the convoluted features of the image, but not a class prediction.
Rest of the steps of this approach are same as that of first approach. Both
approaches only differ by inputs given to RNN.

32

6. RESULTS

________________________________________________
5.1 Result of Approach 1

Fig 29
Average accuracy obtained using this approach is 93.3333%.

5.2 Result of Approach 2
Out of the 460 Gestures (10 Per category) used for testing 438 were
recognized correctly giving an average accuracy of 95.217%.
Category Wise Accuracy is tabulated and is given on the next page.

33

Fig 30: Accuracy
The second approach provided a better accuracy than the first approach
because of the fact that in the first approach the input to the RNN was a
sequence of 46 dimensional prediction while in the second approach the RNN
was being given a sequence of 2048 dimensional pool layer output. This gave
RNN more number of feature points to distinguish among different videos.

34

7. CONCLUSION AND FUTURE WORK

________________________________________________
Hand gestures are a powerful way for human communication, with lots of
potential applications in the area of human computer interaction. Visionbased
hand gesture recognition techniques have many proven advantages compared
with traditional devices. However, hand gesture recognition is a difficult
problem and the current work is only a small contribution towards achieving
the results needed in the field of sign language gesture recognition. This report
presented a visionbased system able to interpret isolated hand gestures from
the Argentinian Sign Language(LSA).
Videos are difficult to classify because they contain both the temporal as well
as the spatial features. We have used two different models to classify on the
spatial and temporal features. CNN was used to classify on the spatial features
whereas RNN was used to classify on the temporal features. We obtained an
accuracy of 95.217 %. This shows that CNN along with RNN can be
successfully used to learn spatial and temporal features and classify Sign
Language Gestures.
We have used two approaches to solve our problem and both of the approaches
only differ by the inputs given to the RNN as explained in the methodologies
above.
We wish to extend our work further in recognising continuous sign language
gestures with better accuracy. This method for individual gestures can also be
extended for sentence level sign language. Also the current process uses two
different models, training inception (CNN) followed by training RNN. For
future work one can focus on combining the two models into a single model.

35

8. REFERENCES

________________________________________________
[1]

Ronchetti, Franco, Facundo Quiroga, César Armando Estrebou, and
Laura Cristina Lanzarini. "Handshape recognition for argentinian
sign language using probsom." Journal of Computer Science &
Technology 16 (2016).

[2]

Singha, Joyeeta, and Karen Das. "Automatic Indian Sign Language
Recognition for Continuous Video Sequence." ADBU Journal of
Engineering Technology 2, no. 1 (2015).

[3]

Tripathi, Kumud, and Neha Baranwal GC Nandi. "Continuous
Indian Sign Language Gesture Recognition and Sentence
Formation." Procedia Computer Science 54 (2015): 523531.

[4]

Nandy, Anup, Jay Shankar Prasad, Soumik Mondal, Pavan
Chakraborty, and Gora Chand Nandi. "Recognition of isolated
indian sign language gesture in real time." Information Processing
and Management (2010): 102107.

[5]

Bengio, Yoshua, Patrice Simard, and Paolo Frasconi. "Learning
longterm dependencies with gradient descent is difficult." IEEE
transactions on neural networks 5, no. 2 (1994): 157166.

[6]

Hochreiter, Sepp, and Jürgen Schmidhuber. "Long shortterm
memory." Neural computation 9, no. 8 (1997): 17351780.

[7]

Ronchetti, Franco, Facundo Quiroga, César Armando Estrebou,
Laura Cristina Lanzarini, and Alejandro Rosete. "LSA64: An
Argentinian Sign Language Dataset." In XXII Congreso Argentino
de Ciencias de la Computación (CACIC 2016). 2016.

[8]

Kingma, Diederik, and Jimmy Ba. "Adam: A method for stochastic
optimization." arXiv preprint arXiv:1412.6980 (2014).

[9]

Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams.
"Learning representations by backpropagating errors." Cognitive
modeling 5, no. 3 (1988): 1

36

[10] Hahnloser, Richard HR, Rahul Sarpeshkar, Misha A. Mahowald,
Rodney J. Douglas, and H. Sebastian Seung. "Digital selection and
analogue amplification coexist in a cortexinspired silicon circuit."
Nature 405, no. 6789 (2000): 947951.12 Bottou, Léon. "Largescale
machine learning with stochastic gradient descent." In Proceedings
of COMPSTAT'2010, pp. 177186. PhysicaVerlag HD, 2010.

[11] Copyright © William Vicars, Sign Language resources at
LifePrint.com, http://lifeprint.com/asl101/topics/wallpaper1.htm

[12] https://medium.com/technologymadeeasy/thebestexplanationofco
nvolutionalneuralnetworksontheinternetfbb8b1ad5df8

[13] https://www.quora.com/WhatisanintuitiveexplanationofConvol
utionalNeuralNetworks

[14] Abadi, Martín, Ashish Agarwal, Paul Barham, Eugene Brevdo,
Zhifeng Chen, Craig Citro, Greg S. Corrado et al. "Tensorflow:
Largescale machine learning on heterogeneous distributed
systems." arXiv preprint arXiv:1603.04467 (2016).

[15] Cooper, Helen, Brian Holt, and Richard Bowden. "Sign language
recognition." In Visual Analysis of Humans, pp. 539562. Springer
London, 2011.

[16] Zhang, Chenyang, Xiaodong Yang, and YingLi Tian. "Histogram of
3D facets: A characteristic descriptor for hand gesture recognition."
In Automatic Face and Gesture Recognition (FG), 2013 10th IEEE
International Conference and Workshops on, pp. 18. IEEE, 2013.

[17] Cooper, Helen, EngJon Ong, Nicolas Pugeault, and Richard
Bowden. "Sign language recognition using subunits." Journal of
Machine Learning Research 13, no. Jul (2012): 22052231.

37

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.4
Linearized                      : No
Page Count                      : 38
Creator                         : Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36
Producer                        : Skia/PDF m54
Create Date                     : 2017:05:28 18:19:36+00:00
Modify Date                     : 2017:05:28 18:19:36+00:00

EXIF Metadata provided by EXIF.tools

Guide

Navigation menu

Versions of this User Manual:

Views

Navigation