当前位置：文档库 › Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps

Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps

Group Sparsity and Geometry Constrained Dictionary Learning for Action

Recognition from Depth Maps

Jiajia Luo,Wei Wang,and Hairong Qi

The University of Tennessee,Knoxville

{jluo9,wwang34,hqi}@https://www.wendangku.net/doc/7b17175454.html,

Abstract

Human action recognition based on the depth informa-

tion provided by commodity depth sensors is an impor-

tant yet challenging task.The noisy depth maps,differ-

ent lengths of action sequences,and free styles in per-

forming actions,may cause large intra-class variations.In

this paper,a new framework based on sparse coding and

temporal pyramid matching(TPM)is proposed for depth-

based human action recognition.Especially,a discrimina-tive class-speci?c dictionary learning algorithm is proposed for sparse coding.By adding the group sparsity and geom-etry constraints,features can be well reconstructed by the sub-dictionary belonging to the same class,and the geom-etry relationships among features are also kept in the cal-culated coef?cients.The proposed approach is evaluated on two benchmark datasets captured by depth cameras.Exper-imental results show that the proposed algorithm repeatedly achieves superior performance to the state of the art algo-rithms.Moreover,the proposed dictionary learning method also outperforms classic dictionary learning approaches.

1.Introduction

Traditional human action recognition approaches focus on learning distinctive feature representations for actions from labelled videos and recognizing actions from unknown videos.However,it is a challenging task to label unknown RGB sequences due to the large intra-class variability and inter-class similarity of actions,cluttered background,pos-sible camera movements and illumination changes.

Recently,the introduction of cost-effective depth cam-eras provides a new possibility to address dif?cult issues in traditional human action https://www.wendangku.net/doc/7b17175454.html,pared to the monocular video sensors,depth cameras can provide3D motion information so that the discrimination of actions can be enhanced and the in?uence of cluttered background and illumination variations can be mitigated.Especially, the work of Shotton et al.[16]provided an ef?cient hu-

Figure1.Sample images obtained by different cameras for the ac-tion“drink”.The3D joints are estimated by the method in[16].

man motion capturing technology to accurately estimate the3D skeleton joint positions from a single depth image, which are more compact and discriminative than RGB or depth sequences.As shown in Figure1,the action“drink”from the MSR DailyActivity3D dataset[19],can be well re-?ected from the extracted3D joints by comparing the joints “head”and“hand”in the two frames.However,it is not that straightforward to tell the difference between the two frames from the depth maps or color images.

Although with strong representation power,the esti-mated3D joints also bring challenges to perform depth-data based action recognition.For example,the estimated 3D joint positions are sometimes unstable due to the noisy depth maps.In addition,the estimated3D joint positions are frame-based,which require representation methods to be tolerant to the variations in speed and time of actions.

To extract robust features from estimated3D joint po-sitions,relative3D joint features have been explored and

2013 IEEE International Conference on Computer Vision

Figure2.Illustration of different feature quantization strategies.

(a)K-means.(b)Sparse coding.(c)Sparse coding with group sparsity constraint.(d)Proposed method(sparse coding with group sparsity and geometry constraint).

achieved satisfactory performance[19,21,24].To represent depth sequences with different lengths,previous research mainly focused on temporal alignment of sequences[11, 14,21]or frequencies evolution of extracted features[19] within a given period.However,the limited lengths of se-quences,the noisy3D joint positions,and the relatively small number of training samples may cause the over?tting problem and make the representation unstable.

In this paper,a new framework is proposed for depth-based human action recognition.Instead of modeling tem-poral evolution of features,our work emphasizes on the dis-tributions of representative features within a given time pe-riod.To realize this representation,a new Dictionary Learn-ing(DL)method is proposed and the Temporal Pyramid Matching(TPM)is used for keeping the temporal informa-tion.The proposed DL method aims to learn an overcom-plete set of representative vectors(atoms)so that any input feature can be approximated by linear combination of these atoms.The coef?cients for the linear combination are re-ferred to as the“sparse codes”.

From the DL algorithm design perspective,recent trend is to develop“discriminative”dictionaries to solve classi?-cation problems.For example,Zhang and Li[25]proposed a discriminative K-SVD method by incorporating classi?-cation error into the objective function and learned the clas-si?er together with the dictionary.Jiang et al.[6]further in-creased the discrimination by adding a label consistent term. Yang et al.[23]proposed to add the Fisher discrimination criterion into the dictionary learning.For these methods,la-bels of inputs should be known before training.However, this requirement cannot be satis?ed in our problem.Since different actions contain shared local features,assigning la-bels to these local features would not be proper.

In this paper,we propose a discriminative DL algorithm for depth-based action recognition.Instead of simultane-ously learning one overcomplete dictionary for all classes, we learn class-speci?c sub-dictionaries to increase the dis-crimination.In addition,the l1,2-mixed norm and geome-try constraint are added to the learning process to further increase the discriminative power.Existing class-speci?c dictionary learning methods[7,15]are based on l1norm which may result in randomly distributed coef?cients[4]. In this paper,we add the group sparsity regularizer[26], which is a combination of l1-and l2-norms to ensure fea-tures are well reconstructed by atoms from the same class. Moreover,the geometry relationship among local features are incorporated during the process of dictionary learning, so that features from the same class with high similarity will be forced to have similar coef?cients.

The process that assigns each feature with coef?cients according to a learned dictionary can be de?ned as“quan-tization”,following the similar de?nition in the?eld of im-age classi?cation.As shown in Figure2,different quantiza-tion methods will generate different representations.Atoms from two classes are marked as“circles”(class A)and“tri-angles”(class B),respectively.We use two similar features to be quantized(both from class A)as an example to illus-trate the coef?cient distribution from various quantization methods.In k-means,features are assigned to the nearest atoms,which is sensitive to the variations of features.In the sparse coding with l1norm,features are assigned to atoms with lowest reconstruction error,but the distributions of se-lected atoms can be random and from different classes[4]. In the spare coding with group sparsity,features will choose atoms from the same group(class),but similar features may not choose the same atoms within the group.In our method, features from the same class will be forced to choose atoms within the same group,and the selections of atoms also re-late to the similarity of features.

The main contributions of this paper are three-fold.First, a new discriminative dictionary learning algorithm is pro-posed to realize the quantization of depth features.Both the group sparsity and geometry constraints are incorporated to improve the discriminative power of the learned dictionary. Second,a new framework that based on sparse coding and temporal pyramid matching is proposed to solve the tem-poral alignment problem of depth features.Third,exten-sive experimental results have shown that both the proposed framework and the dictionary learning algorithm are effec-tive for the task of action recognition based on depth maps.

2.Background of Sparse Coding

Given a dataset Y=[y1,...,y N],sparse coding is a process to solve the optimization problem as:

min

D,X

i=1

y i?Dx i 22+λ|x i|1

(1)

where matrix D=[d1,...,d K]is the dictionary with K atoms and elements in matrix X=[x1,...,x N]are coef-

?cients.Different from the K-means clustering that assigns every data with its nearest cluster center,sparse coding uses a linear combination of atoms in the dictionary D to re-construct the data,and only a sparse number of atoms have nonzero coef?cients.

To increase the discriminative power of dictionary,class-speci?c dictionary learning methods have been proposed that learn a sub-dictionary for each class [7,15].For ex-ample,Eq.1can be rewritten as:

min

D ,X C i =1

??? Y i ?D i X i 2F +λN i j =1

|x i

j |1?

(2)where Y i =[y i 1,...,y i N i

]and X i =[x i 1,...,x i

N i ]are the dataset and coef?cients for class i ,respectively.Matrix D i is the learned sub-dictionary for class i .

Since the sub-dictionaries are trained independently,it is possible that related atoms among those sub-dictionaries are generated.In this case,the sparse representation will be sensitive to the variations among features.Even though an incoherence promoting term i =j D T

i D j 2F can be added to the dictionary learning,correlated atoms still exist [15].

3.Proposed Method

The proposed depth-based human action recognition framework consists of three components,feature extraction from the 3D joint positions,feature representation using the discriminative DL and temporal pyramid matching,and classi?cation.Our discussion below focuses on the con-struction of the discriminative dictionary which is the main contributor to the success of the proposed framework.

3.1.Feature Extraction

Given a depth image,20joints of the human body can be tracked by the skeleton tracker [16].At frame t ,the posi-tion of each joint i is uniquely de?ned by three coordinates p i (t )=(x i (t ),y i (t ),z i (t ))and can be represented as a 3-element vector.The work of Wang et al .[19]showed that the pairwise relative positions result in more discriminative and intuitive features.However,enumerating all the joint pairs introduces some redundant and irrelevant information to the classi?cation task [19].

In this paper,only one joint is selected as a reference joint,and its differences to all the other joints are used as features.Since the joint Hip Center has relatively small mo-tions for most actions,it is used as a reference joint.Let the position for the Hip Center be p 1(t ),the 3D joint feature at frame t is de?ned as:

y (t )={p i (t )?p 1(t )|i =2, (20)

(3)

Note that both p 1and p i are functions of time,and y (t )is a vector with 57(19×3=57)elements.For any depth

sequence with T frames,there will be T joint features from y (1)to y (T ).

Compared to the work of [19]using 20joints as refer-ences by turns,our experimental result will show that only 1joint used as reference is suf?cient for the proposed frame-work to achieve state-of-the-art accuracies on benchmark datasets.

3.2.Group Sparsity and Geometry Constrained

Dictionary Learning (DL-GSGC)

The process that generates a vector representation for any depth sequence with a speci?c number of extracted 3D joint features is referred to as “feature representation”.Al-though the Bag-of-Words representation based on K-means clustering can serve the purpose,it discards all the temporal information and large vector quantization error can be intro-duced by assigning each 3D joint feature to its nearest “vi-sual word”.Recently,Yang et al .[22]showed that classi?-cation accuracies bene?t from generalizing vector quantiza-tion to sparse coding.However,discrimination of the repre-sentation can be compromised due to the possible randomly distributed coef?cients solved by sparse coding [4].In this paper,a class speci?c dictionary learning method based on group sparsity and geometry constraint is proposed,referred to as DL-GSGC .

Group sparsity encourages the sparse coef?cients in the same group to be zero or nonzero simultaneously [2,4,26].Adding the group sparsity constraint to the class-speci?c dictionary learning has three advantages.First,the intra-class variations among features can be compressed since features from the same class tend to select atoms within the same group (sub-dictionary).Second,in?uence of corre-lated atoms from different sub-dictionaries can be compro-mised since their coef?cients will tend to be zero or nonzero simultaneously.Third,possible randomness in coef?cients distribution can be removed since coef?cients have group-clustered sparse characteristics.In this paper,the Elastic net regularizer [26]is added as the group sparsity con-straint since it has automatic group effect.The Elastic net regularizer is a combination of the l 1-and l 2norms.Specif-ically,the l 1penalty promotes sparsity,while the l 2norm encourages the grouping effect [26].

Given a learned dictionary that consists of all the sub-dictionaries and an input feature from class i ,it is ideal to use atoms from the i -th class to reconstruct it.In addition,similar features should have similar coef?cients.Inspired by the work of Gao et al .[5],we propose to add geometry constraint to the class-speci?c dictionary learning process.Let Y =[Y 1,...,Y C ]be the dataset with N features for C classes,where Y i ∈R f ×N i is the f -dimensional dataset from class i .DL-GSGC is designed to learn a dis-criminative dictionary D =[D 1,...,D C ]with K atoms

in total (K = C

i =1K i ),where D i ∈R f ×K i is the class-

speci?ed sub-dictionary associated with class i .The objec-tive function of DL-GSGC is:

min D ,X ?????????????????????C i =1{ Y i ?DX i 2F + Y i ?D ∈i X i 2F + D /∈i X i 2F +λ1

N i

j =1|x i j |1+λ2 X i 2

F }+λ3J i =1N j =1

αi ?x j 22w ij ?????????????????????

subject to

d k 22=1,

?k =1,2,...,K

(4)where X =[X 1,...,X C ]represents the coef?cients ma-trix and coef?cients vector for the j -th feature in class i is

x i j .The value D ∈i is set to be [0,...,D i ,...,0]with K columns and the value D /∈i is calculated as D ?D ∈i .Term

Y i ?DX i 2

F represents the minimization of reconstruc-tion error using dictionary D .The terms Y i ?D ∈i X i 2F

and D /∈i X i 2

are added to ensure that features from class i can be well reconstructed by atoms in the sub-dictionary D i but not by other atoms belonging to different classes.The group sparsity constraint is represented as λ1|x i j |1+λ2 x i j 2

2,and the geometry constraint is represented as λ3 J i =1 N

j =1 αi ?x j 22w ij .In the geometry constraint,elements in vector αi are calculated coef?cients for “tem-plate”feature y i .Here,templates are small sets of features randomly selected from all classes.In total,there are J tem-plates used for similarity measure.Especially,coef?cients αi for the template y i belonging to class m can be calcu-lated by Eqs.5and 6:

β=min β

y i ?D m β 22+λ1|β|1+λ2 β 2

(5)αi =[0

K 1

,...,0 K m ?1

β,0 K m +1

,...,0 K C

](6)

In αi ,only coef?cients corresponding to the atoms from the

same class m are nonzero.The weight w ij between the query feature y j and template feature y i is de?ned as:

w ij =exp (? y i ?y j ) 22/σ)

(7)

3.2.1

Optimization Step -Coef?cients

The optimization problem in Eq.4can be iteratively solved

by optimizing over D or X while ?xing the other.After ?xing the dictionary D ,the coef?cients vector x i j can be calculated by solving the following convex problem (details are provided in the supplementary material):

min x i

s i j ?D i x i j 22+λ1|x i j |1+λ3L (x i j ) (8)where

s i j =[y i j ;y i

j ;0;...;0

f +K

](9)

D i =[D ;D ∈i ;D /∈i ;

λ2I ](10)L (x i j )

A i

m =1

αm ?x i j 2

2w mj

(11)

where I ∈R K ×K is an identity matrix.Note that w mj

represents the weight between feature y i

j and template y m calculated by Eq.7.To remove the in?uence of shared features among classes,we use templates belonging to the same class as the input feature for similarity measure at this stage.According to Eqs.6and 7,we know that term L (x i j )encourages the calculated coef?cients to have zeros at atoms not from the same class as the input feature.In total,there are A i templates used to calculate the unknown coef?cient x i j .

Since the analytical solution can be calculated for Eq.8if the sign of each element in x i j is known,the feature-sign search method [9]can be used to obtain the coef?cients.However,the augmented matrix D i needs to be normal-ized before using the feature-sign search method.Let D i

be the l 2column-wise normalized version of D

i .By simple

derivations,we know that D

i =(√2+λ2)D i .Therefore,Eq.8can be rewritten as:

min x i j ????????? s i j ?D i x i j 2

2+λ1√2+λ2

|x i j |1+λ32+λ2A i

m =1

2+λ2αm ?x i j 22

w mj ?????????(12)

where x i j =√

2+λ2x i j .Therefore,the feature-sign search method can be applied to Eq.12to obtain x i j ,and the coef?-cients for input feature y i

j should be 1√2+λ2

x i j .The detailed derivations can be found in the supplementary material.3.2.2Optimization Step -Dictionary

Fixing the coef?cients,atoms in the dictionary can be up-dated.In this paper,the sub-dictionaries are updated class by class.In other words,while updating the sub-dictionary D i ,all the other sub-dictionaries will be ?xed.Terms that are independent of the current sub-dictionary can then be omitted from optimization,and the objective function when updating the sub-dictionary D i can be given as:

min D i

Y i ?DX i 2F

+ Y i ?D ∈i X i 2F (13)To solve Eq.13,atoms in the sub-dictionary D i are updated

one by one.Let d i k be the k -th atom in the sub-dictionary D i .When updating atom d i k ,all the rest atoms in D are ?xed,and the ?rst derivative of Eq.13over d i k can be rep-resented as:

?(f (d i k ))=(?4Y i +2MX i +4d i k x i (k ))x T

i (k )

(14)

12345678

1235678

356

41[

][][][

[

]

]]]3D J o i n t s

F e a t u r e

C o Temporal Pyramid Matching

Figure 3.Temporal pyramid matching based on sparse coding.

where x i (k )is the r -th row (r = i ?1

j =1K j +k )in matrix X i ∈R K ×N i ,and it is corresponding to the coef?cients contributed by the atom d i k .Matrix M is of the same size as D and is equal to M 1+M 2.Here M 1is the matrix after replacing the r -th column in D with zeros,and M 2is the matrix after replacing the r -th column with zeros in D ∈i .The updated atom d i k can be calculated by setting Eq.14to zero,which is:

d i k =(Y i ?0.5MX i )x T i (k )/ x i (k ) 2

(15)

3.3.Representation and Classi?cation

After constructing the discriminative dictionary D ,the coef?cients for a given feature y can be calculated by solv-ing the following optimization problem:min x

y ?Dx 22+λ1|x |1+λ2 x 22

+λ3

J i =1

αi ?x 22w i

(16)

Similar to the derivation in Sec.3.2.1,the feature-sign search method [9]can be used to obtain the coef?cients.To keep the temporal information during the feature rep-resentation,a temporal pyramid matching (TPM)based on a pooling function z =F (X )is used to yield the his-togram representation for every depth sequence.In this paper,the max pooling is selected as many literature work did [20,22].TPM divides the video sequence into several segments along the temporal direction.Histograms gen-erated from segments by max pooling are concatenated to form the representation,as shown in Figure 3.In this pa-per,the depth sequence is divided into 3levels with each containing 1,2and 4segments,respectively.

To speed up the process of training and testing,a linear SVM classi?er [22]is used on the calculated histogram.

4.Experiments

Two benchmark datasets,MSR-Action3D dataset [10]and MSR DailyActivity3D dataset [19],are used for eval-uation purpose.For both datasets,we compare the perfor-mance from two aspects,the effectiveness of the proposed framework (i.e.,DL-GSGC+TPM)as compared to state-of-the-art approaches and the effectiveness of the proposed

Figure 4.Sample frames from the MSR Action3D dataset.From

top to bottom,frames are respectively from actions:Draw X,Draw Circle,and Forward Kick.

dictionary learning algorithm (i.e.,DL-GSGC)as compared

to state-of-the-art DL methods.In addition,since the sec-ond dataset also contains the RGB video sequence,we fur-ther compare the performance between using the RGB se-quence and the depth map sequence.In all experiments,the proposed approaches constantly outperform the state-of-the-art.

4.1.Parameters Setting

For the DL-GSGC dictionary learning,there are three parameters:λ1,λ2and λ3that corresponding to group spar-sity and geometry constraints,respectively.According to our observation,the performance is best when λ1=0.1～0.2,λ2=0.01～0.02and λ3=0.1～0.2.Initial sub-dictionaries are obtained by solving Y i ?D i X i 2F +λ1 N i

j =1|x i j |1+λ2 X i 2

F using online dictionary learn-ing [12]and the number of atoms is set to be 15for each sub-dictionary.For geometry constraint,1500features are used to build the templates.Note that all these features are collected from a subset of training samples,and cover all the https://www.wendangku.net/doc/7b17175454.html,pared to the total number of training features,the number of templates is relatively small.

4.2.MSR Action3D Dataset

The MSR-Action3D dataset [10]contains 567depth map sequences.There are 20actions performed by 10subjects.For each action,the same subject performs it three times.The size of the depth map is 640×480.Figure 4shows the depth sequences of three actions:draw x ,draw circle ,and forward kick ,performed by different subjects.For all experiments on this dataset,the 1500templates used for ge-ometry constraint are collected from two training subjects.4.2.1

Compared with State-of-the-art Algorithms

We ?rst evaluate the proposed algorithm (DL-GSGC +TPM )in terms of recognition rate and compare it with the state-of-the-art algorithms that have been applied on the MSR Action3D dataset.For fair comparison,all results are

Method Accuracy

Recurrent Neural Network[13]42.5%

Dynamic Temporal Warping[14]54.0%

Hidden Markov Model[11]63.0%

Bag of3D Points[10]74.7%

Histogram of3D Joints[21]78.97%

Eigenjoints[24]82.3%

STOP Feature[17]84.8%

Random Occupy Pattern[18]86.2%

Actionlet Ensemble[19]88.2%

DL-GSGC+TPM96.7%

DL-GSGC+TPM(λ2=0)95.2%

DL-GSGC+TPM(λ3=0)94.2%

Table1.Evaluation of algorithms on the cross subject test for the MSRAction3D dataset.

obtained using the same experimental setting:5subjects are used for training and the rest5subjects are used for testing. In other words,it is a cross-subject test.Since subjects are free to choose their own styles to perform actions,there are large variations among training and testing features.

Table1shows the experimental results by various algo-rithms.Our proposed method achieves the highest recog-nition accuracy as96.7%,and accuracies reduced to95.2% and94.2%if only one constraint is kept.Note that the work of[19]required a feature selection process on3D joint fea-tures and a multiple kernal learning process based on the SVM classi?er to achieve the accuracy of88.2%,whereas our algorithm use simple3D joint feature as described in Sec.3.1,combined with the proposed feature representation and a simple linear SVM classi?er.Therefore,the proposed dictionary learning method and framework is effective for the task of depth-based human action recognition.

Figure5shows the confusion matrix of the proposed method.Actions of high similarity get relative low accu-racies.For example,action Draw Tick tends to be confused with Draw X.

4.2.2Comparison with Sparse Coding Algorithms

To evaluate the performance of the proposed DL-GSGC, classic DL methods are used for comparison.These meth-ods include K-SVD[1],sparse coding used for image clas-si?cation based on spatial pyramid matching(ScSPM)[22], and the dictionary learning with structured incoherence (DLSI)[15].In addition,for all the evaluated DL meth-ods,the feature-sign search method is used for coef?cients calculation,the TPM and max pooling are used to obtain the vector representation,and the linear SVM classi?er is used for classi?cation.We refer to the corresponding algorithms as K-SVD,ScTPM and DLSI for simplicity.

Comparisons are conducted on three subsets from the MSR Action3D dataset,as described in[10].For each sub-

1.0

.08.91

1.0

.27.72

.77.23

1.0

.93.06

1.0

.06.93 HighArmWave

HorizontalArmWave

Hammer

HandCatch

ForwardPunch

HighThrow

DrawX

DrawTick

DrawCircle

HandClap

TwoHandWave

SideBoxing

Bend

FowardKick

SideKick

Jogging

TennisSwing

TennisServe

GolfSwing

PickUpThrow

g h

A r

m W

a v

H o

r z

o n

t a

A r

m W

a v

H a

m m

e r

H a

n d

C a

t c h

F o

r w

a r d

P u

n c

g h

T h

r o w

D r

a w

D r

a w

T c

D r

a w

r c

H a

n d

a p

T w

o H

a n

d W

a v

S d

e B

o x

n g

B e

n d

F o

w a

r d K

c k

S d

e K

c k

J o

g g

n g

T e

n n

s S

n g

T e

n n

s S

e r v

G o

f S

n g

P c

k U

p T

h r o

w Figure5.Confusion matrix for MSR Action3D dataset.

set,8actions are included.All the subsets(AS1,AS2and AS3)are deliberately constructed such that similar move-ments are included within the group while A3further con-tains complex actions with large and complicated body movements.On each subset,three tests are performed by choosing different training and testing samples.Since each subject will perform the same action3times,Test1and Test2choose1/3and2/3samples for training respectively. Test3uses the cross subjects setting,which is the same as described in https://www.wendangku.net/doc/7b17175454.html,pared with Test1and Test2, Test3is more challenging since the variations are larger be-tween training and testing samples.

Table2shows the results on the three subsets.Note that the overall accuracies based on all actions(20actions)are also provided for each test.It shows that the performance of DL-GSGC is superior to other sparse coding algorithms in terms of accuracies on all tests.In addition,class-speci?c dictionary learning methods,such as DL-GSGC and DLSI, perform better than methods learning a whole dictionary simultaneously for all classes(e.g.,K-SVD and ScTPM). Moreover,the proposed framework(i.e.,sparse coding+ TPM),is effective for action recognition,since accuracies when using different sparse coding methods outperform the literature work in both Tables1and2.Especially,our method outperforms other algorithms in Table1based on 3D joint features by15%～17%on test3.

4.3.MSR DailyActivity3D Dataset

The MSR DailyActivity3D dataset contains16daily ac-tivities captured by a Kinect device.There are10subjects in this dataset,and each subject performs the same action twice,once in standing position,and once in sitting posi-tion.In total,there are320samples with both depth maps and RGB sequences available.Figure6shows the sample frames for the activities:drink,write and stand up,from

Method (%)AS1AS2AS3Overall AS1AS2AS3Overall AS1AS2AS3Overall [10]89.589.096.391.693.492.996.394.272.971.979.274.7[21]98.596.793.596.298.697.994.997.287.985.563.579.0[24]94.795.497.395.897.398.797.397.874.576.196.482.3K-SVD 98.895.698.897.810098.010098.992.491.995.592.0ScTPM 98.895.698.897.310098.010098.996.292.996.492.7DLSI

97.498.199.497.698.897.210097.996.693.796.493.2DL-GSGC

100

98.7100

98.9100

98.7100

98.997.2

95.599.196.7

Table 2.Performance evaluation of sparse coding based algorithms on three subsets.

Figure 6.Sample frames for the MSR DailyActivity3D dataset.

From top to bottom,frames are from actions:drink,write,and stand up.

top to bottom.As shown in Figure 6,some activities in this dataset contain small body movements,such as drink and write .In addition,the same activity performed in differ-ent positions have large variations in the estimated 3D joint positions.Therefore,this dataset is more challenging than the MSR Action3D dataset.Experiments performed on this dataset is based on cross subjects test.In other words,5sub-jects are used for training,and the rest 5subjects are used for testing.The number of templates is also 1500which are collected from 2training subjects.Table 3shows the experimental results by using various algorithms.4.3.1Comparison with State-of-the-art Algorithms We ?rst compare the performance of DL-GSGC with liter-ature work that have been conducted on this dataset.As shown in Table 3,the proposed method outperforms the state-of-the-art work [19]by 10%and the geometry con-straint is more effective for performance improvement.In addition,other DL methods are incorporated in our frame-work for comparison,referred to as K-SVD,ScTPM and DLSI.Experimental results show that the performance of DL-GSGC is superior to other DL methods by 4%～5%.In addition,class-speci?c dictionary learning methods,e.g .,DL-GSGC and DLSI,are better for classi?cation task than K-SVD and ScTPM.Moreover,the proposed framework outperforms the state-of-the-art work [19]by 5%～10%when different DL methods are used.Considering the large intra-class variations and noisy 3D joint positions in this

Method

Accuracy Cuboid+HoG*

53.13%Harris3D+HOG/HOF*

56.25%Dynamic Temporal Wrapping [14]54%3D Joints Fourier [19]68%Actionlet Ensemble [19]85.75%K-SVD 90.6%ScTPM 90.6%DLSI

91.3%DL-GSGC

95.0%DL-GSGC (λ2=0)93.8%DL-GSGC (λ3=0)92.5%

Table 3.Performance evaluation of the proposed algorithm with eight algorithms.Algorithms marked with (*)are applied on RGB videos and all rest algorithms are applied on depth sequences.

dataset,the proposed framework is quite robust.4.3.2Comparison with RGB Features

Since both depth and RGB videos are available in this dataset,we also compare the performance of RGB features with that of depth features.For traditional human action recognition problem,spatio-temporal interest points based methods have been heavily explored.Two important steps are spatio-temporal interest point detection and local feature description.As for feature representation,Bag-of-Words representation based on K-means clustering is widely used.In this paper,we follow the same steps to perform action recognition from RGB videos.To be speci?c,the classic Cuboid [3]and Harris3D [8]detectors are used for feature detection,and the HOG/HOF descriptors are used for de-scription.The Bag-of-Words representation is used for fea-ture representation.

Table 3provides the recognition rates by using different feature detectors and descriptors on RGB video https://www.wendangku.net/doc/7b17175454.html,pared with the performance of depth features,recogni-tion rates on RGB sequences are lower.We argue the main reason to be that this dataset contains many actions with high similarity but small body movements,e.g .,Drink,Eat,

Write,Readbook.In this case,the3D joint features contain-ing depth information are more reliable than RGB features. In addition,the K-mean clustering method will cause larger quantization error than sparse coding algorithms.There-fore,depth information is important for the task of action recognition,and the sparse coding based representation is better for quantization.

5.Conclusion

This paper presented a new framework to perform hu-man action recognition on depth sequences.To better rep-resent the3D joint features,a new discriminative dictio-nary learning algorithm(DL-GSGC)that incorporated both group sparsity and geometry constraints was proposed.In addition,the temporal pyramid matching method was ap-plied on each depth sequence to keep the temporal infor-mation in the representation.Experimental results showed that the proposed framework is effective that outperformed the state-of-the-art algorithms on two benchmark datasets. Moreover,the performance of DL-GSGC is superior to clas-sic sparse coding methods.Although the DL-GSGC is pro-posed for dictionary learning in the task of depth-based ac-tion recognition,it is applicable to other classi?cation prob-lems,such as image classi?cation and face recognition. 6.Acknowledgement

This work was supported in part by National Science Foundation under Grant NSF CNS-1017156. References

[1]M.Aharon,M.Elad,and A.M.Bruckstein.K-svd:An algo-

rithm for designing overcomplete dicitonaries for sparse rep-resentation.IEEE Trans.Signal Processing,54:4311–4322, 2006.6

[2]H.Bondell and B.Reich.Simultaneous regression shrink-

age,variable selection and supervised clustering of predic-tors with oscar.Biometrics,64(1):115–123,2008.3

[3]P.Dollar,V.Rabaud,G.Cottrell,and S.Belongie.Behavior

recognition via sparse spatio-temporal features.VS-PETS, 2005.7

[4]Y.Fang,R.Wang,and B.Dai.Graph-oriented learning via

automatic group sparsity for data analysis.IEEE12th Inter-national Conference on Data Mining,2012.2,3

[5]S.Gao,I.Tshang,L.Chia,and P.Zhao.Local features are

not lonely-laplacian sparse coding for image classi?cation.

CVPR,2010.3

[6]Z.Jiang,Z.Lin,and L.S.Davis.Learning a discrimina-

tive dictionary for sparse coding via label consistent k-svd.

CVPR,2011.2

[7]S.Kong and D.Wang.A dictionary learning approach for

classi?cation:separating the particularity and the common-ality.ECCV,2012.2,3

[8]https://www.wendangku.net/doc/7b17175454.html,ptev,M.Marszalek, C.Schmid,and B.Rozenfeld.

Learning realistic human actions from movies.CVPR,2008.

[9]H.Lee,A.Battle,R.Raina,and A.Y.Ng.Ef?cient sparse

code algorithms.NIPS,2007.4,5

[10]W.Li,Z.Zhang,and Z.Liu.Action recognition based on a

bag of3d points.In Human communicative behavior analy-sis workshop(in conjunction with CVPR),2010.5,6,7 [11] F.Lv and R.Nevatia.Recognition and segmentation of3-d

human action using hmm and multi-class adaboost.ECCV, pages359–372,2006.2,6

[12]J.Mairal,F.Bach,J.Ponce,and G.Sapiro.Online dictionary

learning for sparse coding.ICML,2009.5

[13]J.Martens and I.Sutskever.Learning recurrent neural net-

works with hessian-free optimzation,2011.6

[14]M.Muller and T.Roder.Motion templates for automatic

classi?cation and retrieval of motion capture data.In pro-ceedings of the2006ACM SIGGRAPH/Eurographics sym-posium on compute animation,pages137–146,2006.2,6, 7

[15]I.Ramirez,P.Sprechmann,and G.Sapiro.Classi?cation

and clustering via dictionary learning with structured inco-herence and shared features.CVPR,2010.2,3,6

[16]J.Shotton,A.Fitzgibbon,M.Cook,T.Sharp,M.Finocchio,

R.Moore,A.Kipman,and A.Blake.Real-time human pose recognition in parts from single depth cameras.CVPR,2011.

1,3

[17] A.W.Vieira,E.R.Nascimento,G.Oliveira,Z.Liu,and

M.Campos.Stop:space-time occupancy patterns for3d ac-tion recognition from depth map sequences.17th Iberoamer-ican congress on pattern recognition.6

[18]J.Wang,Z.Liu,J.Chorowski,Z.Chen,and Y.Wu.Ro-

bust3d action recognition with random occupancy patterns.

ECCV,2012.6

[19]J.Wang,Z.Liu,Y.Wu,and J.Yuan.Mining actionlet en-

semble for action recognition with depth cameras.CVPR, 2012.1,2,3,5,6,7

[20]J.Wang,J.Yang,K.Yu,F.Lv,T.Huang,and Y.Gong.

Locality-constrained linear coding for image classi?cation.

CVPR,2010.5

[21]L.Xia,C.C.Chen,and J.K.Aggarwal.View invariant hu-

man action recognition using histograms of3d joints.CVPR Workshop,2012.2,6,7

[22]J.Yang,K.Yu,Y.Gong,and T.Huang.Linear spatial pyra-

mid matching using sparse coding for image classi?cation.

CVPR,2009.3,5,6

[23]M.Yang,L.Zhang,X.Feng,and D.Zhang.Fisher discrim-

ination dictionary learning for sparse representation.ICCV, 2011.2

[24]X.Yang and Y.Tian.Eigenjoints-based action recognition

using naive bayes nearest neighbor.CVPR2012HAU3D Workshop,2012.2,6,7

[25]Q.Zhang and B.Li.Discriminative k-svd for dicitonary

learning in face recognition.CVPR,2010.2

[26]H.Zou and H.Hastie.Regression and variable selection via

the elastic net.Journal of the Royal Statistical Society,Series B,67(2):301–320,2005.2,3