当前位置：文档库 › 3D ShapeNets A Deep Representation for Volumetric Shapes

3D ShapeNets A Deep Representation for Volumetric Shapes

3D ShapeNets:A Deep Representation for Volumetric Shapes Zhirong Wu? Shuran Song?Aditya Khosla?Fisher Yu?Linguang Zhang?Xiaoou Tang Jianxiong Xiao??Princeton University Chinese University of Hong Kong?Massachusetts Institute of Technology Abstract

3D shape is a crucial but heavily underutilized cue in to-

day’s computer vision systems,mostly due to the lack of a

good generic shape representation.With the recent avail-

ability of inexpensive2.5D depth sensors(e.g.Microsoft

Kinect),it is becoming increasingly important to have a

powerful3D shape representation in the loop.Apart from category recognition,recovering full3D shapes from view-based2.5D depth maps is also a critical part of visual un-derstanding.To this end,we propose to represent a geomet-ric3D shape as a probability distribution of binary vari-ables on a3D voxel grid,using a Convolutional Deep Belief Network.Our model,3D ShapeNets,learns the distribution of complex3D shapes across different object categories and arbitrary poses from raw CAD data,and discovers hierar-chical compositional part representation automatically.It naturally supports joint object recognition and shape com-pletion from2.5D depth maps,and it enables active object recognition through view planning.To train our3D deep learning model,we construct ModelNet–a large-scale3D CAD model dataset.Extensive experiments show that our 3D deep representation enables signi?cant performance im-provement over the-state-of-the-arts in a variety of tasks.

1.Introduction

Since the establishment of computer vision as a?eld?ve decades ago,3D geometric shape has been considered to be one of the most important cues in object recognition. Even though there are many theories about3D represen-tation(e.g.[5,22]),the success of3D-based methods has largely been limited to instance recognition(e.g.model-based keypoint matching to nearest neighbors[24,31]).For object category recognition,3D shape is not used in any state-of-the-art recognition methods(e.g.[11,19]),mostly due to the lack of a good generic representation for3D geo-metric shapes.Furthermore,the recent availability of inex-pensive2.5D depth sensors,such as the Microsoft Kinect,?This work was done when Zhirong Wu was a VSRC visiting student at Princeton University.

sofa?

bathtub?

(back of a sofa)Representation

Next-Best-View Recognition dresser?

Shape Completion

3D ShapeNets

https://www.wendangku.net/doc/1318871981.html, Figure1:Usages of3D ShapeNets.Given a depth map of an object,we convert it into a volumetric representation and identify the observed surface,free space and occluded space.3D ShapeNets can recognize object category,com-plete full3D shape,and predict the next best view if the ini-tial recognition is uncertain.Finally,3D ShapeNets can in-tegrate new views to recognize object jointly with all views.

Intel RealSense,Google Project Tango,and Apple Prime-Sense,has led to a renewed interest in2.5D object recogni-tion from depth maps(e.g.Sliding Shapes[30]).Because the depth from these sensors is very reliable,3D shape can play a more important role in a recognition pipeline.As

a result,it is becoming increasingly important to have a

strong3D shape representation in modern computer vision systems.

Apart from category recognition,another natural and challenging task for recognition is shape completion.Given a2.5D depth map of an object from one view,what are the possible3D structures behind it?For example,humans do not need to see the legs of a table to know that they are there and potentially what they might look like behind the visible surface.Similarly,even though we may see a coffee mug from its side,we know that it would have empty space in the middle,and a handle on the side.

In this paper,we study generic shape representation 1

object label 1013

230

3D voxel input

4000

48 filters of stride 2

160 filters of stride 2

512 filters of stride 16

1200(a)Architecture of our 3D ShapeNets model.For illustration purpose,we only draw one ?lter for each convo-lutional layer.L5

L4L3

(b)Data-driven visualization:For each neuron,we average the top 100training examples with

highest responses (>0.99)and crop the volume inside the receptive ?eld.The averaged result is visualized by transparency in 3D (Gray)and by the average surface obtained from zero-crossing (Red).3D ShapeNets are able to capture complex structures in 3D space,from low-level surfaces and corners at L1,to objects parts at L2and L3,and whole objects at L4and above.

Figure 2:3D ShapeNets.Architecture and ?lter visualizations from different layers.

for both object category recognition and shape comple-tion.While there is signi?cant progress on shape synthe-sis [7,17]and recovery [27],they are mostly limited to part-based assembly and heavily relies on expensive part anno-tation.Instead of hand-coding shapes by parts,we desire a data-driven way to learn the complicated shape distribu-tions from raw 3D data across object categories and poses,and automatically discover hierarchical compositional part representation.As shown in Figure 1,this would allow us to infer the full 3D volume from a depth map without the knowledge of object category and pose a priori.Beyond the ability to jointly hallucinate missing structures and predict categories,we also desire to compute the potential informa-tion gain for recognition with regard to missing parts.This would allow an active recognition system to choose an op-timal subsequent view for observation,when the category recognition from the ?rst view is not suf?ciently con?dent.To this end,we propose 3D ShapeNets to represent a ge-ometric 3D shape as a probabilistic distribution of binary variables on a 3D voxel grid.Our model uses a power-ful Convolutional Deep Belief Network (Figure 2)to learn the complex joint distribution of all 3D voxels in a data-driven manner.To train this 3D deep learning model,we construct ModelNet,a large-scale object dataset of 3D com-puter graphics CAD models.We demonstrate the strength of our model at capturing complex object shapes by draw-ing samples from the model.We show that our model can recognize objects in single-view 2.5D depth images and hal-lucinate the missing parts of depth maps.Extensive exper-iments suggest that our model also generalizes well to real world data from the NYU depth dataset [23],signi?cantly outperforming existing approaches on single-view 2.5D ob-ject recognition.And it is effective for next-best-view pre-diction in view planning for active object recognition [25].

2.Related Work

There has been a large body of insightful research on analyzing 3D CAD model collections.Most of the works [12,7,17]use an assembly-based approach to build deformable part-based models.These methods are limited to a speci?c class of shapes with small variations,with surface correspondence being one of the key problems in such approaches.Since we are interested in shapes across a variety of objects with large variations and part annota-tion is tedious and expensive,assembly-based modeling can be rather cumbersome.For surface reconstruction of cor-rupted scanning input,most related works [26,3]are largely based on smooth interpolation or extrapolation.These ap-

free space unknown

observed surface observed points completed surface

It is a chair!

(1)object (2)depth &point cloud (3)volumetric representation (4)recognition &completion

Figure 3:View-based 2.5D Object Recognition.(1)illustrates that a depth map taken from a physical object in the 3D world.(2)shows the depth image captured from the back of the chair.A slice is used for visualization.(3)shows the pro?le of the slice and different types of voxels.The surface voxels of the chair x o are in red,and the occluded voxels x u are in blue.(4)shows the recognition and shape completion result,conditioned on the observed free space and surface.proaches can only tackle small missing holes or de?cien-cies.Template-based methods [27]are able to deal with large space corruption but are mostly limited by the qual-ity of available templates and often do not provide different semantic interpretations of reconstructions.

The great generative power of deep learning models has allowed researchers to build deep generative models for 2D shapes:most notably the DBN [15]to generate handwrit-ten digits and ShapeBM [10]to generate horses,etc.These model are able to effectively capture intra-class variations.We also desire this generative ability for shape reconstruc-tion but we focus on more complex real world object shapes in 3D.For 2.5D deep learning,[29]and [13]build discrimi-native convolutional neural nets to model images and depth maps.Although their algorithms are applied to depth maps,they use depth as an extra 2D channel,and they do not model in full 3D.Unlike [29],our model learns a shape dis-tribution over a voxel grid.To the best of our knowledge,we are the ?rst work to build 3D deep learning models.To deal with the dimensionality of high resolution voxels,in-spired by [21]1,we apply the same convolution technique in our model.

Unlike static object recognition in a single image,the sensor in active object recognition [6]can move to new view points to gain more information about the object.Therefore,the Next-Best-View problem [25]of doing view planning based on current observation arises.Most previous works in active object recognition [16,9]build their view plan-ning strategy using 2D color information.However this multi-view problem is intrinsically 3D in nature.Atanasov et al,[1,2]implement the idea in real world robots,but they assume that there is only one object associated with each class reducing their problem to instance-level recognition with no intra-class variance.Similar to [9],we use mutual information to decide the NBV .However,we consider this problem at the precise voxel level allowing us to infer how voxels in a 3D region would contribute to the reduction of

1The model is precisely a convolutional DBM where all the connections

are undirected,while ours is a convolutional DBN.

recognition uncertainty.

3.3D ShapeNets

To study 3D shape representation,we propose to rep-resent a geometric 3D shape as a probability distribution of binary variables on a 3D voxel grid.Each 3D mesh is repre-sented as a binary tensor:1indicates the voxel is inside the mesh surface,and 0indicates the voxel is outside the mesh (i.e.,it is empty space).The grid size in our experiments is 30×30×30.

To represent the probability distribution of these binary variables for 3D shapes,we design a Convolutional Deep Belief Network (CDBN).Deep Belief Networks (DBN)[15]is a powerful class of probabilistic models often used to model the joint probabilistic distribution over pixels and la-bels in 2D images.Here,we adapt the model from 2D pixel data to 3D voxel data,which imposes some unique chal-lenges.A 3D voxel volume with reasonable resolution (say 30×30×30)would have the same dimensions as a high-resolution image (165×165).A fully connected DBN on such an image would result in a huge number of parameters making the model intractable to train effectively.Therefore,we propose to use convolution to reduce model parameters by weight sharing.However,different from typical convo-lutional deep learning models (e.g.[21]),we do not use any form of pooling in the hidden layers –while pooling may enhance the invariance properties for recognition,in our case,it would also lead to greater uncertainty for shape reconstruction.

The energy,E ,of a convolutional layer in our model can be computed as:

E (v ,h )=? f

h f j W f ?v j +c f h f

j ?

b l v l (1)

where v l denotes each visible unit,

h f j denotes each hidden unit in a feature channel f ,and W f

denotes the convolu-tional ?lter.The “?”sign represents the convolution opera-tion.In this energy de?nition,each visible unit v l is associ-

three different next-view candidates

3 possible shapes predicted new freespace & visible surface

observed surface unknown

potentially visible voxels in next view

newly visible surface

free space

original surface

Figure4:Next-Best-View Prediction.[Row1,Col1]:the

observed(red)and unknown(blue)voxels from a single

view.[Row2-4,Col1]:three possible completion sam-

ples generated by conditioning on(x o,x u).[Row1,Col2-

4]:three possible camera positions V i,front top,left-sided,

tilted bottom,front,top.[Row2-4,Col2-4]:predict the

new visibility pattern of the object given the possible shape

and camera position V i.

ated with a unique bias term b l to facilitate reconstruction,

and all hidden units{h f j}in the same convolution channel

share the same bias term c f.Similar to[19],we also allow

for a convolution stride.

A3D shape is represented as a24×24×24voxel grid

with3extra cells of padding in both directions to reduce

the convolution border artifacts.The labels are presented as

standard one of K softmax variables.The?nal architecture

of our model is illustrated in Figure2(a).The?rst layer has

48?lters of size6and stride2;the second layer has160

?lters of size5and stride2(i.e.,each?lter has48×5×5×5

parameters);the third layer has512?lters of size4;each

convolution?lter is connected to all the feature channels in

the previous layer;the fourth layer is a standard fully con-

nected RBM with1200hidden units;and the?fth and?nal

layer with4000hidden units takes as input a combination of

multinomial label variables and Bernoulli feature variables.

The top layer forms an associative memory DBN as indi-

cated by the bi-directional arrows,while all the other layer

connections are directed top-down.

We?rst pre-train the model in a layer-wise fashion fol-

lowed by a generative?ne-tuning procedure.During pre-

training,the?rst four layers are trained using standard

Contrastive Divergence[14],while the top layer is trained

more carefully using Fast Persistent Contrastive Divergence

(FPCD)[32].Once the lower layer is learned,the weights

are?xed and the hidden activations are fed into the next

layer as input.Our?ne-tuning procedure is similar to wake

sleep algorithm[15]except that we keep the weights tied.

In the wake phase,we propagate the data bottom-up and use

the activations to collect the positive learning signal.In the

sleep phase,we maintain a persistent chain on the topmost

layer and propagate the data top-down to collect the nega-

tive learning signal.This?ne-tuning procedure mimics the

recognition and generation behavior of the model and works

well in practice.We visualize some of the learned?lters in

Figure2(b).

During pre-training of the?rst layer,we collect learning

signal only to receptive?elds which are non-empty.Be-

cause of the nature of the data,empty spaces occupy a large

proportion of the whole volume,which have no information

for the RBM and would distract the learning.Our experi-

ment shows that ignoring those learning signals during gra-

dient computation results in our model learning more mean-

ingful?lters.In addition,for the?rst layer,we also add

sparsity regularization to restrict the mean activation of the

hidden units to be a small constant(following the method

of[20]).During pre-training of the topmost RBM where

the joint distribution of labels and high-level abstractions

are learned,we duplicate the label units10times to increase

their signi?cance.

4.2.5D Recognition and Reconstruction

4.1.View-based Sampling

After training the CDBN,the model learns the joint dis-

tribution p(x,y)of voxel data x and object category label

y∈{1,···,K}.Although the model is trained on com-

plete3D shapes,it is able to recognize objects in single-

view2.5D depth maps(e.g.,from RGB-D sensors).As

shown in Figure3,the2.5D depth map is?rst converted into

a volumetric representation where we categorize each voxel

as free space,surface or occluded,depending on whether

it is in front of,on,or behind the visible surface(i.e.,the

depth value)from the depth map.The free space and sur-

face voxels are considered to be observed,and the occluded

voxels are regarded as missing data.The test data is rep-

resented by x=(x o,x u),where x o refers to the observed

free space and surface voxels,while x u refers to the un-

known voxels.Recognizing the object category involves

estimating p(y|x o).

We approximate the posterior distribution p(y|x o)by

Gibbs sampling.The sampling procedure is as follows.

We?rst initialize x u to a random value and propagate the

data x=(x o,x u)bottom up to sample for a label y from

p(y|x o,x u).Then the high level signal is propagated down

to sample for voxels x.We clamp the observed voxels x o in

this sample x and do another bottom up pass.50iterations

of up-down sampling should suf?ce to get a shape comple-

Figure 5:ModelNet Dataset.Left:word cloud visualization of the ModelNet dataset based on the number of 3D models in each https://www.wendangku.net/doc/1318871981.html,rger font size indicates more instances in the category.Right:Examples of 3D chairs models.tion x ,and its corresponding label y .The above procedure runs in parallel for a large number of particles resulting in a variety of completion results corresponding to potentially different classes.The ?nal category label corresponds to the most frequently sampled class.

4.2.Next-Best-View Prediction

Object recognition from a single-view can sometimes be

challenging,both for humans and computers.However,if an observer is allowed to view the object from another view point when recognition fails from the ?rst view point,we may be able to signi?cantly reduce the recognition uncer-tainty.Given the current view,our model is able to predict which next view would be optimal for discriminating the object category.

The inputs to our next-best-view system are observed voxels x o of an unknown object captured by a depth cam-era from a single view,and a ?nite list of next-view candi-dates {V i }representing the camera rotation and translation in 3D.An algorithm chooses the next-view from the list that has the highest potential to reduce the recognition uncer-tainty.Note that during this view planning process,we do not observe any new data,and hence there is no improve-ment on the con?dence of p (y |x o =x o ).

The original recognition uncertainty,H ,is given by the entropy of y conditioned on the observed x o :H =H (p (y |x o =x o ))=?

k =1

p (y =k |x o =x o )log p (y =k |x o =x o )

(2)

where the conditional probability p (y |x o =x o )can be ap-proximated as before by sampling from p (y,x u |x o =x o )

and marginalizing x u .

When the camera is moved to another view V i ,some of the previously unobserved voxels x u may become observed based on its actual shape.Different views V i will result in different visibility of these unobserved voxels x u .A view with the potential to see distinctive parts of objects (e.g.

arms of chairs)may be a better next view.However,since the actual shape is partially unknown 2,we will hallucinate that region from our model.As shown in Figure 4,condi-tioning on x o =x o ,we can sample many shapes to gen-erate hypotheses of the actual shape,and then render each hypothesis to obtain the depth maps observed from differ-ent views,V i .In this way,we can simulate the new depth maps for different views on different samples and compute the potential reduction in recognition uncertainty.

Mathematically,let x i n =Render (x u ,x o ,V i

)\x o de-note the new observed voxels (both free space and surface)in the next view V i .We have x i n ?x u ,and they are un-known variables that will be marginalized in the following equation.Then the potential recognition uncertainty for V i is measured by this conditional entropy,

H i =H p (y |x i n

,x o =x o )

x i n

p (x i n

|x o =x o )H (y |x i n

,x o =x o ).

(3)The above conditional entropy could be calculated by ?rst

sampling enough x u from p (x u |x o =x o ),doing the 3D rendering to obtain 2.5D depth map in order to get x i n

from x u ,and then taking each x i n to calculate H (y |x i

n =x i

n ,x o =x o )as before.

According to information theory,the reduction of en-tropy H ?H i =I (y ;x i n |x o =x o )≥0is the mutual in-formation between y and x i n conditioned on x o .This meets our intuition that observing more data will always poten-tially reduce the uncertainty.With this de?nition,our view planning algorithm is to simply choose the view that maxi-mizes this mutual information,

V ?=arg max V i I (y ;x i n |x o =x o ).

(4)

Our view planning scheme can naturally be extended to a sequence of view planning steps.After deciding the best

2If

the 3D shape is fully observed,adding more views will not help to reduce the recognition uncertainty in any algorithm purely based on 3D shapes,including our 3D ShapeNets.

c h a i r

b e d

d e s k t a b l e n i g h t s t a n

d s o f a b a t h t u b t o i l e

t Figure 6:Shape Sampling.Example shapes generated by sampling our 3D ShapeNets for some categories.

candidate to move for the ?rst frame,we physically move the camera there and capture the other object surface from that view.The object surfaces from all previous views are merged together as our new observation x o ,allowing us to run our view planning scheme again.

5.ModelNet:A Large-scale 3D CAD Dataset

Training a deep 3D shape representation that captures intra-class variance requires a large collection of 3D shapes.Previous CAD datasets (e.g.,[28])are limited both in the variety of categories and the number of examples per cate-gory.Therefore,we construct ModelNet,a large-scale 3D CAD model dataset.

To construct ModelNet,we downloaded 3D CAD mod-els from 3D Warehouse,and Yobi3D search engine index-ing 261CAD model websites.We query common object categories from the SUN database [33]that contain no less than 20object instances per category,removing those with too few search results,resulting in a total of 660categories.We also include models from the Princeton Shape Bench-mark [28].After downloading,we remove mis-categorized models using Amazon Mechanical Turk.Turkers are shown a sequence of thumbnails of the models and answer “Yes”or “No”as to whether the category label matches the model.The authors then manually checked each 3D model and re-moved irrelevant objects from each CAD model (e.g,?oor,thumbnail image,person standing next to the object,etc)so

10classes SPH [18]LFD [8]Ours classi?cation 79.79%79.87%83.54%retrieval AUC 45.97%51.70%69.28%retrieval MAP 44.05%49.82%68.26%40classes SPH [18]LFD [8]Ours classi?cation 68.23%75.47%77.32%retrieval AUC 34.47%42.04%49.94%retrieval MAP 33.26%40.91%49.23%

Table 1:Shape Classi?cation and Retrieval Results.that each mesh model contains only one object belonging to the labeled category.We also discarded unrealistic (overly simpli?ed models or those only containing images of the object)and duplicate https://www.wendangku.net/doc/1318871981.html,pared to [28],which consists of 6670models in 161categories,our new dataset is 22times larger containing 151,1283D CAD models be-longing to 660unique object categories.Examples of major object categories and dataset statistics are shown in Figure 5.

6.Experiments

We choose 40common object categories from ModelNet with 100unique CAD models per category.We then aug-ment the data by rotating each model every 30degrees along the gravity direction (i.e.,12poses per model)resulting in models in arbitrary poses.Pre-training and ?ne-tuning each took about two days on a desktop with one Intel XEON E5-2690CPU and one NVIDIA K40c GPU.Figure 6shows some shapes sampled from our trained model.

6.1.3D Shape Classi?cation and Retrieval

Deep learning has been widely used as a feature extrac-tion technique.Here,we are also interested in how well the features learned from 3D ShapeNets compared with other state-of-the-art 3D mesh features.We discriminatively ?ne-tune 3D ShapeNets by replacing the top layer with class labels and use the 5th layer as features.For comparison,we choose Light Field descriptor [8](LFD,4,700dimensions)and Spherical Harmonic descriptor [18](SPH,544dimen-sions),which performed best among all descriptors [28].We conduct 3D classi?cation and retrieval experiments to evaluate our features.Of the 48,000CAD models (with rotation enlargement),38,400are used for training and 9,600for testing.We also report a smaller scale result on a 10-category subset of 40-category data.For classi?cation,we train a linear SVM to classify meshes using each of the features mentioned above,and use average category accu-racy to evaluate the performance.

For retrieval,we use L 2distance to measure the simi-larity of the shapes between each pair of testing samples.Given a query from the test set,a ranked list of the remain-ing test data is returned according to the similarity mea-

Recall

00.10.20.30.40.50.60.70.80.91

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Spherical Harmonic

Light Field

Our 5th layer finetuned

Recall

00.10.20.30.40.50.60.70.80.91

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9Spherical Harmonic

Light Field

Our 5th layer finetuned

Figure7:3D Mesh Retrieval.Precision-recall curves at

standard recall levels.

sure3.We evaluate retrieval algorithms using two metrics:

(1)mean area under precision-recall curve(AUC)for all the

testing queries4;(2)mean average precision(MAP)where

AP is de?ned as the average precision each time a positive

sample is returned.

We summarize the results in Table1and Figure7.Since

both of the baseline mesh features(LFD and SPH)are ro-

tation invariant,from the performance we have achieved,

we believe3D ShapeNets must have learned this invariance

during feature learning.Despite using a signi?cantly lower

resolution mesh as compared to the baseline descriptors,

3D ShapeNets outperforms them by a large margin.This

demonstrates that our3D deep learning model can learn bet-

ter features from3D data automatically.

6.2.View-based2.5D Recognition

To evaluate3D ShapeNets for2.5D depth-based object

recognition task,we set up an experiment on the NYU

RGB-D dataset with Kinect depth maps[23].We select10

object categories from ModelNet that overlaps with NYU

dataset.This brings us4,899unique CAD models for3D

ShapeNets training.

We create each testing example by cropping the3D point

cloud from the3D bounding boxes.The segmentation mask

is used to remove outlier depth in the bounding box.Then

we directly apply our model trained on CAD models to

NYU dataset.This is absolutely non-trivial because the

statistics of real world depth is signi?cantly different from

the synthetic CAD models used for training.In Figure9,we

visualize the successful recognitions and reconstructions.

Note that3D ShapeNets is even able to partially reconstruct

the“monitor”despite the bad scanning caused by the re?ec-

tion problem.To further boost recognition performance,we

discriminatively?ne-tune our model on the NYU dataset

using back propagation.By simply assigning invisible vox-

els as0(i.e.considering occluded voxels as free space and

only representing the shape as the voxels on the3D surface)

3For our feature and SPH we use the L2norm,and for LFD we use the

distance measure from[8].

4We interpolate each precision-recall curve.

Input GT3D ShapeNets Completion Result NN

Figure8:Shape Completion.From left to right:input

depth map from a single view,ground truth shape,shape

completion result(4cols),nearest neighbor result(1col).

and rotating training examples every30degrees,?ne-tuning

works reasonably well in practice.

As a baseline approach,we use k-nearest-neighbor

matching in our low resolution voxel space.Testing depth

maps are converted to voxel representation and compared

with each of the training samples.As a more sophisticated

high resolution baseline,we match the testing point cloud

to each our3D mesh models using Iterated Closest Point

method[4]and use the top10matches to vote for the labels.

We also compare our result with[29]which is the state-of-

the-art deep learning model applied on RGB-D data.To

train and test their model,2D bounding boxes are obtained

by projecting the3D bounding box to the image plane,and

object segmentations are also used to extract features.1,390

instances are used to train the algorithm of[29]and perform

our discriminative?ne-tuning,while the remaining495in-

stances are used for testing all?ve methods.Table2sum-

marizes the recognition https://www.wendangku.net/doc/1318871981.html,ing only depth without

color,our?ne-tuned3D ShapeNets outperforms all other

approaches with or without color by a signi?cant margin.

6.3.Next-Best-View Prediction

For our view planning strategy,computation of the term

p(x i n|x o=x o)is critical.When the observation x o is am-

biguous,samples drawn from p(x i n|x o=x o)should have

varieties across different categories.When the observation

is rich,samples should be limited to very few categories.

Since x i n is the surface of the completions,we could just

test the shape completion performance p(x u|x o=x o).In

Figure8,our results give reasonable shapes across different

categories.We also match the nearest neighbor in the train-

ing set to show that our algorithm is not just memorizing

the shape and it can generalize well.

bathtub R G B

d e p t h

f u l l 3D f u l l 3D bathtub bed chair desk desk dresser monitor monitor stand sofa sofa table table toilet

toilet Figure 9:Successful Cases of Recognition and Reconstruction on NYU dataset [23].In each example,we show the RGB color crop,the segmented depth map,and the shape reconstruction from two view points.

bathtub bed chair desk dresser monitor nightstand sofa table toilet all [29]Depth 0.0000.7290.8060.1000.4660.2220.3430.4810.4150.2000.376NN 0.4290.4460.3950.1760.4670.3330.1880.4580.4550.4000.374ICP

0.5710.6080.1940.3750.7330.3890.4380.3490.052 1.0000.4713D ShapeNets

0.1420.5000.6850.1000.3660.5000.7190.2770.3770.7000.4373D ShapeNets ?ne-tuned

0.8570.7030.9190.3000.5000.5000.6250.7350.2470.4000.579[29]RGB 0.1420.7430.7660.1500.2660.1660.2180.3130.3760.2000.334[29]RGBD

0.000

0.743

0.693

0.175

0.466

0.388

0.468

0.602

0.441

0.500

0.448

Table 2:Accuracy for View-based 2.5D Recognition on NYU dataset [23].The ?rst ?ve rows are algorithms that use only depth information.The last two rows are algorithms that also use color information.Our 3D ShapeNets as a generative model performs reasonably well as compared to the other methods.After discriminative ?ne-tuning,our method achieves the best performance by a large margin of over 10%.

bathtub bed chair desk dresser monitor nightstand sofa table toilet all Ours 0.80 1.000.850.500.450.850.750.850.95 1.000.80Max Visibility 0.850.850.850.500.450.850.750.850.900.950.78Furthest Away 0.650.850.750.550.250.850.650.50 1.000.850.69Random Selection

0.60

0.80

0.75

0.50

0.45

0.90

0.70

0.65

0.90

0.72

Table 3:Comparison of Different Next-Best-View Selections Based on Recognition Accuracy from Two Views.Based on an algorithm’s choice,we obtain the actual depth map for the next view and recognize the objects using two views by our 3D ShapeNets to compute the accuracies.

To evaluate our view planning strategy,we use CAD models from the test set to create synthetic rendering of depth maps.We evaluate the accuracy by running our 3D ShapeNets model on the integration depth maps of both the ?rst view and the selected second view.A good view-planning strategy will result in a better recognition accu-racy.Note that next-best-view selection is always coupled with the recognition algorithm.We prepare three base-line methods for comparison :(1)random selection among the candidate views;(2)choose the view with the highest new visibility (yellow voxels,NBV for reconstruction);(3)choose the view which is farthest away from the previous view (based on camera center distance).In our experiment,we generate 8view candidates randomly distributed on the sphere of the object,pointing to the region near the object center and,we randomly choose 200test examples (20per category)from our testing set.Table 3reports the recog-nition accuracy of different view planning strategies with the same recognition 3D ShapeNets.We observe that our entropy based method outperforms all other strategies.

7.Conclusion

To study 3D shape representation for objects,we propose a convolutional deep belief network to represent a geomet-ric 3D shape as a probability distribution of binary variables on a 3D voxel grid.Our model can jointly recognize and re-construct objects from a single-view 2.5D depth map (e.g.from popular RGB-D sensors).To train this 3D deep learn-ing model,we construct ModelNet,a large-scale 3D CAD model dataset.Our model signi?cantly outperforms exist-ing approaches on a variety of recognition tasks,and it is also a promising approach for next-best-view planning.All source code and data set are available at our project website.

Acknowledgment.This work is supported by gift funds from Intel Corporation and Project X grant to the Princeton Vision Group,and a hardware donation from NVIDIA Cor-poration.Z.W.is also partially supported by Hong Kong RGC Fellowship.We thank Thomas Funkhouser,Derek Hoiem,Alexei A.Efros,Andrew Owens,Antonio Torralba, Siddhartha Chaudhuri,and Szymon Rusinkiewicz for valu-able discussion.

References

[1]N.Atanasov,B.Sankaran,J.Le Ny,T.Koletschka,G.J.

Pappas,and K.Daniilidis.Hypothesis testing framework for active object detection.In ICRA,2013.3

[2]N.Atanasov, B.Sankaran,J.L.Ny,G.J.Pappas,and

K.Daniilidis.Nonmyopic view planning for active object detection.arXiv preprint arXiv:1309.5401,2013.3

[3]M.Attene.A lightweight approach to repairing digitized

polygon meshes.The Visual Computer,2010.2

[4]P.J.Besl and N.D.McKay.Method for registration of3-d

shapes.In PAMI,1992.7

[5]I.Biederman.Recognition-by-components:a theory of hu-

man image understanding.Psychological review,1987.1 [6] F.G.Callari and F.P.Ferrie.Active object recognition:

Looking for differences.IJCV,2001.3

[7]S.Chaudhuri,E.Kalogerakis,L.Guibas,and V.Koltun.

Probabilistic reasoning for assembly-based3d modeling.In ACM Transactions on Graphics(TOG),2011.2

[8] D.-Y.Chen,X.-P.Tian,Y.-T.Shen,and M.Ouhyoung.On

visual similarity based3d model retrieval.In Computer graphics forum,2003.6

[9]J.Denzler and https://www.wendangku.net/doc/1318871981.html,rmation theoretic sensor

data selection for active object recognition and state estima-tion.PAMI,2002.3

[10]S.M.A.Eslami,N.Heess,and J.Winn.The shape boltz-

mann machine:a strong model of object shape.In CVPR, 2012.3

[11]P.F.Felzenszwalb,R.B.Girshick,D.McAllester,and D.Ra-

manan.Object detection with discriminatively trained part based models.PAMI,2010.1

[12]T.Funkhouser,M.Kazhdan,P.Shilane,P.Min,W.Kiefer,

A.Tal,S.Rusinkiewicz,and D.Dobkin.Modeling by exam-

ple.In ACM Transactions on Graphics(TOG),2004.2 [13]S.Gupta,R.Girshick,P.Arbel′a ez,and J.Malik.Learning

rich features from rgb-d images for object detection and seg-mentation.In ECCV.2014.3

[14]G.E.Hinton.Training products of experts by minimizing

contrastive divergence.Neural computation,2002.4 [15]G.E.Hinton,S.Osindero,and Y.-W.Teh.A fast learning

algorithm for deep belief nets.Neural computation,2006.3, 4

[16]Z.Jia,Y.-J.Chang,and T.Chen.Active view selection for

object and pose recognition.In ICCV Workshops,2009.3 [17] E.Kalogerakis,S.Chaudhuri,D.Koller,and V.Koltun.A

probabilistic model for component-based shape synthesis.

ACM Transactions on Graphics(TOG),2012.2[18]M.Kazhdan,T.Funkhouser,and S.Rusinkiewicz.Rotation

invariant spherical harmonic representation of3d shape de-scriptors.In SGP,2003.6

[19] A.Krizhevsky,I.Sutskever,and G.Hinton.Imagenet clas-

si?cation with deep convolutional neural networks.In NIPS, 2012.1,4

[20]H.Lee,C.Ekanadham,and A.Y.Ng.Sparse deep belief net

model for visual area v2.In NIPS,2007.4

[21]H.Lee,R.Grosse,R.Ranganath,and A.Y.Ng.Unsuper-

vised learning of hierarchical representations with convolu-tional deep belief https://www.wendangku.net/doc/1318871981.html,munications of the ACM, 2011.3

[22]J.L.Mundy.Object recognition in the geometric era:A

retrospective.In Toward category-level object recognition.

2006.1

[23]P.K.Nathan Silberman,Derek Hoiem and R.Fergus.Indoor

segmentation and support inference from rgbd images.In ECCV,2012.2,7,8

[24] F.Rothganger,https://www.wendangku.net/doc/1318871981.html,zebnik,C.Schmid,and J.Ponce.3d

object modeling and recognition using local af?ne-invariant image descriptors and multi-view spatial constraints.IJCV, 2006.1

[25]W.Scott,G.Roth,and J.-F.Rivest.View planning for auto-

mated3d object reconstruction inspection.ACM Computing Surveys,2003.2,3

[26]S.Shalom,A.Shamir,H.Zhang,and D.Cohen-Or.Cone

carving for surface reconstruction.In ACM Transactions on Graphics(TOG),2010.2

[27] C.-H.Shen,H.Fu,K.Chen,and S.-M.Hu.Structure re-

covery by part assembly.ACM Transactions on Graphics (TOG),2012.2,3

[28]P.Shilane,P.Min,M.Kazhdan,and T.Funkhouser.The

princeton shape benchmark.In Shape Modeling Applica-tions,2004.6

[29]R.Socher,B.Huval,B.Bhat,C.D.Manning,and A.Y.Ng.

Convolutional-recursive deep learning for3d object classi?-cation.In NIPS.2012.3,7,8

[30]S.Song and J.Xiao.Sliding Shapes for3D object detection

in RGB-D images.In ECCV,2014.1

[31]J.Tang,https://www.wendangku.net/doc/1318871981.html,ler,A.Singh,and P.Abbeel.A textured object

recognition pipeline for color and depth image data.In ICRA, 2012.1

[32]T.Tieleman and https://www.wendangku.net/doc/1318871981.html,ing fast weights to improve

persistent contrastive divergence.In ICML,2009.4 [33]J.Xiao,J.Hays,K.A.Ehinger,A.Oliva,and A.Torralba.

SUN database:Large-scale scene recognition from abbey to zoo.In CVPR,2010.6