From Bag-of-Visual-Words to Bag-of-Visual-Phrases using n-Grams

Glauco V.Pedrosa and Agma J.M.Traina

Instituto de Ciências Matemáticas e de Computa??o-ICMC

Universidade de S?o Paulo-USP


Abstract—The Bag-of-Visual-Words has emerged as an effective modeling approach to represent local image features.It describes local image features by assigning them a visual word according to a visual dictionary.The image representation is given by the frequency of each visual word in the image,as a similar representation used in textual documents.In this paper, we present a novel approach building a high-level description using a group of words(phrases)for representing an image. We introduce the use of n-grams for image representation, based on the idea of“Bag-of-Visual-Phrases”.In the?eld of computational linguistics,an n-gram is a phrase formed by a sequence of n-consecutive words.As analogy,we represent an image by a combination of n-consecutive visual words.We made representative experiments using three public benchmark databases of textures and nature scenes and two medical databases to demonstrate an area that can bene?t from the proposed technique.Our proposed Bag-of-Visual-Phrases approach improved up to44%the retrieval precision and up to33%the classi?cation rate compared to the traditional Bag-of-Visual-Words,being a valuable asset for content-based image retrieval and image classi?cation.

Keywords-Image description;CBIR;Bag-of-features;SIFT;


Image representation plays an essential role in image categorization and retrieval applications.Research in this area has advanced in order to obtain methods that capture the semantics of the image,extracting features perceptually ef?cient and compact[1].This fact should be taken into consideration to make a CBIR(Content-Based Image Retrieval)system closer to the users’expectation.

One of the state of art technique for image representation is the Bag-of-Visual-Words[2],[3],[4],[5],[6],[7],[8], also known as Bag-of-Features or Bag-of-Keypoints.This approach describes an image using a dictionary composed of different visual words.Visual words are local image patterns,which concentrate relevant semantic information about the image.As illustrated in Figure1,after extracting the local image features,each feature is assigned to its nearest visual word according to a visual dictionary.The traditional image representation,employed by the Bag-of-Visual-Words approach,is the number of occurrences of each visual word contained in the image,as an analogy to the Bag-of-Words representation used for textual information retrieval.

A more powerful description can be obtained by grouping words[9],[10],[11],[12],once this approach can encode the arrangement between the visual words in the image space. In fact,aggregating spatial information in visual words is a promising approach for image description,because the appearance of the visual words can change profoundly when they participate in relations.Spatial information reaches a high-level semantic characterization and leads to a more meaningful image representation.

The goal of this work is to represent an image taking into consideration the relationship between the visual words, instead of considering the image as a set of isolated words. In this paper,we introduce the idea of using n-grams for generating visual phrases.The use of n-grams is an ef?cient model already used in natural language processing to represent a textual document[13].The words are modeled such that each n-gram is composed of n sequential words.For example, the1-gram(unigram)representation is composed of isolated words,such as{intelligence,arti?cial,computer,vision, medical,systems}.By analogy,the1-gram representation is the traditional Bag-of-Visual-Words approach.On the other hand,the2-gram(bigram)is represented by two sequences of words,such as{arti?cial intelligence,computer vision, medical systems,arti?cial systems},enriching the semantic and yielding a more complete representation.The core advantages of using n-gram models are the simplicity and the ability to scale up the content representation very effectively by simply increasing n.

We performed representative evaluations in several different databases.We evaluated our proposed method in benchmark databases of the image classi?cation and retrieval area, and also in medical databases,which have shown to be an application area that can bene?t from the proposed technique, since it adds more semantics to the image description. Comparative results,with respect to the traditional Bag-of-Visual-Words approach,show that the use of n-grams is a promising descriptor to achieve a more robust image characterization and make a CBIR system closer to users’expectation.

The rest of the paper is organized as follows.Section2 gives the formal de?nitions and the background needed to follow the work,such as the motivation for the proposed work. Section3explains the method to represent an image in n-gram visual words.Section4presents the experimental analysis and Section5gives the conclusions of this paper.

Figure1.Scheme to represent an image in Bag-of-Visual-Word and Bag-of-Visual-Phrases.


The Bag-of-Visual-Words(BoVW)approach is a technique used to model the local image features.These local features are described by an unordered set of keypoints,where each keypoint describes representative local image features.The goal is to quantize these features using a visual dictionary. The main idea of using visual dictionaries is to consider that the image visual patterns are similar to textual words present in textual documents.Therefore,an image is composed of visual words as a textual document is composed of textual words. Clustering is a common method in the literature for learning a visual dictionary.A dictionary can be built by clustering the local features detected in a set of training images from the database,such as schematized in Figure2.Formally,let P={p1,p2,...,p z}be the local features detected in a subset of the database image.The visual dictionary is given by a division of P into k distinct clustersπk={C1,C2,...,C k}, so that C1∪C2∪...∪C k=P,C i=?,and C i∩C j=?for i=j and i,j=1...k.

A visual word w i is the centroid of cluster C i.When a new image arrives,its features are extracted and assigned to the nearest visual word w i,for i=1...k,where k is the number of words in the visual dictionary.The image representation employing the BoVW approach is simply the normalized histogram of the quantized visual words detected in the image.

Spatial information is a very important feature for the characterization of images and objects,because the appearance of objects can deeply change when they participate in relations.One of the?rst work to attempt encoding geometric information with Bag-of-Visual-Words is the spatial pyramid[14],which splits the image into hierarchical cells and computes bag-of-visual-words for each cell,concatenating the results at the end.However,it is crucial that the image characterization does not depend on the placement of the image,because the features should be invariant to geometric transformations.Other works employ correlograms of visual words[15]and image splitting by linear and circular projections[16].The method proposed in [17]encode spatial-relationship information of visual words using image space partitions to count the occurrences of the visual words in relation to the other visual words positions. It seems plausible that grouping words might be applied successfully for enriching the BoVW representation[12],once a high semantic information level is reached.The bene?ts of using group of words have been proven to boost local feature matching[11].Previous works employ phrases to model the co-occurrences of the words in local neighborhoods,using methods to encode the spatial layouts with a grid-dependent size[18].Besides that,a big problem of these approaches is that the number of phrases can exponentially grow according to the number of words in a phrase.Thus,it is necessary to select a subset from the entire phrase set.Sophisticated mining or learning algorithms have been proposed for this selection [9],[10],but it may still be risky to discard a large portion of phrases,because some of which may be representative ones for the images.

Instead of using a dictionary formed by a huge list of phrases with different number of words,why do not use a dictionary formed by phrases with a limited but representative number of words?A representation that utilizes phrases with a limited sequence of n words is denominated n-gram.In the

Figure 2.Process used to generate a visual dictionary.Initially,a keypoint detector is applied in a set of training images to detect representative local image points.The detected keypoints are represented by a descriptor that summarizes the information about the region around each keypoint.A dictionary can be built by clustering the keypoints detected.

?elds of computational linguistics and probability,an n -gram is a contiguous sequence of n items from a given sequence of text or speech [13].An n -gram could be any combination of letters.However,the items in question can be phonemes,syllables,letters,words or pairs according to the application.How useful the n -gram representation could be for image description based on the idea of Bag-of-Visual-Phrases?The idea of using n -grams in vision is quite similar to a number of previous works that combine visual words with spatial information.For example,in [12]triplets of visual words are used.The work in [19]use 2-grams ("doublets")from spatially neighboring word pairs.In [20],the authors use "higher order features"(i.e.BoVW +2-grams,3-grams etc),but instead of considering all possible n -grams,they perform feature selection to only pick relevant n -grams.In [21]the authors propose to do joint clustering of nearest feature pairs.In this paper,we propose a slightly different measure of similarity considering n -grams and BoVW,and a simple and different procedure to extract the n -grams from the image compared to [22]and [23].The details of the proposed technique is presented in the next section.


The proposed method describes an image as a Bag-of-Visual-Phrases,where the Visual Phrases are n -grams extracted from the image.In our proposed model,the image representation is the frequency that each n -gram appears in the image.In this section we explain how to extract the n -grams from an image and how to represent an image using Bag-of-Visual-Phrases.

A.Extracting the n -grams from an image

In text mining,a unigram representation can be obtained by placing a small window over the text,such that we only look at one word at a time.In a similar way,a bigram can be thought of as a window that shows two words at a time and moving this window to the right,one word at a time,in a stepwise manner.This procedure is the same for extracting n -grams from a text with n greater than two.To take this “window”analogy to our image representation problem,we could say that all visual n -grams of an image can be generated by placing a region over each visual word,such that we only look at the n -nearest visual words at time.

Formally,let P ={p 1,p 2,...,p m }be the local features detected in an image.Each local feature p i is represented by

its coordinates (x i ,y i )in the image and it is assigned to a visual word w j .The n -Nearest Visual Word (n NVW)of a local feature p i is given by:

n NVW (p i ,n )={P ?P ||P |=n,?p j ∈P ,

?p r ∈[P ?P ]:d (p i ,p j )


d (p i ,p j )=

(x i ?x j )2+(y i ?y j )2

(2)The next de?nition is needed to give the main building block of our proposed technique.

De?nition 1.The proposed image representation in n -grams is given by the occurrences of the n -Nearest Visual Words considering each local feature p i detected in an image P:

n -gram (P )={n NVW (p i ,n )},for i =1...m.


where m is the number of keypoints detected in the image.To illustrate,Figure 3shows an example of extracting the 2-grams from an image.For each local point,we look at its closest neighbor point.There is no need to specify a threshold distance,since we get the nearest point.We consider the pair as a 2-gram phrase.In a similar way,to extract the n -grams from an image,with n >2,we just need to increase the number of nearest points.

B.Modeling the image in Bag-of-Visual-Phrases using n -grams

After extracting the n -grams from an image,each n -gram is treated as a visual phrase.To model an image in Bag-of-Visual-Phrases,the next step is to count the number of visual phrases according to a dictionary of visual phrases.This dictionary of visual phrases is given by the all possible combinations of n -grams.However,the size of the dictionary can be exponential with respect to the number of the visual words in the vocabulary.For example,considering 100different visual words,the number of all possible combinations of 2-grams is 1002.

To reduce the size of the dictionary of visual phrases we do not consider n -grams with repeated visual words,this means,an n -gram with n similar consecutive visual words.Additionally,we consider inverted n -gram as the same visual phrase.Inverted visual phrases consists of phrases with the

Figure3.Example of the process used to extract2-grams from an image.For each keypoint we look at its closest neighbor point.To extract n-grams from

an image,with n>2,we just need to increase the number of nearest



Figure4.Representation of the image in?g.3by:(a)Bag-of-Visual-Words,(b)Bag-of-Visual-Phrase considering2-gram phrases and the proposed dictionary.

same visual words,where the visual words appear in different orders.With these two restrictions the dictionary of visual phrases can be reduced to50%or more.

To illustrate,the2-gram phrases generated by a dictionary formed by the words{w1,w2,w3}is {{w1,w2},{w1,w3},{w2,w3}}.Figure4shows the representation of the image from Figure3considering the traditional Bag-of-Visual-Words and the proposed method of Bag-of-Visual-Phrases using2-gram phrases.C.Calculating the images’distances

To measure the dissimilarity(distance)between two images A and B,we take advantage of the distance between the histograms of Bag-of-Visual-Words and Bag-of-Visual-Phrases of both images.

Let h A and H A be the normalized histograms of Bag-of-Visual-Words and Bag-of-Visual-Phrases of image A, respectively,and h B and H B of image B.The distance used to measure the dissimilarity between the images A and B is given by:

Distance(A,B)=||h A?h B||1+||H A?H B||1(4) where||.||1is the L1distance.

Two images A and B are considered similar when Distance(A,B)→0.


In this section,we report representative experimental results performed to evaluate the effectiveness of the image representation technique proposed in this work.We compared our method with the traditional Bag-of-Visual-Words representation,which has only the frequency of occurrence of the words in the image.The Bag-of-Visual-Words is by analogy the1-gram representation.

We seek a vocabulary of visual words which will be invariant to changes in viewpoint and illumination.For this,

we used the SIFT descriptor[24],which is one of the most widely used descriptor to extract local image features.SIFT descriptors,based on histograms of local orientation,has some tolerance to illumination change.

The experiments was conducted to evaluate the accuracy of2-gram and3-gram representations compared to1-gram representation.The tests was performed in two different tasks: image retrieval and image classi?cation.

A.Image retrieval evaluation

We performed an image retrieval evaluation on four different image databases,two public databases and two medical ones:?Corel1000database1:consists of images of natural

scenes.It is composed of1000images divided into10 classes,100images in each class.Figure5a shows an image for each class of this database;

?Lung database2:consists of CT images of lung, composed of234images divided into6classes(39 images in each class),classi?ed according to the Lung ?ndings:Emphysema,Honeycombing,Interlobular Septal,Healthy,Consolidation and Ground-glass.Figure 5b shows an image for each class of this database;?Medical Image Exams database3:it contains2,200 medical images of X-ray and MRI,classi?ed according to body part and type of cut,being composed of11distinct classes with200images in each class.Figure5c shows images samples from this database.

?Texture database[25]:includes surfaces composed of materials such as wood,marble and fur under varying viewpoints,scales and illumination conditions.

This database consists of1,000images comprising40 samples of25different textures.Figure5d shows images samples from this database.

The comparative retrieval performance was evaluated using the mean of Average Precision(mAP).The Precision is the ratio of the number of relevant retrieved images to the total number of retrieved images.The Average Precision computes the average value of Precision at each ranking position where a relevant image is retrieved.The top value is1,which means that all relevant images were retrieved in the?rst positions of the ranking.The mAP determine the mean of Average Precision considering each image as query.The best value of mAP is1.

Table I presents the average mAP values for the evaluated databases.The2-gram representation presented the best results in three databases.In general,the2-gram boosted the precision in5%in these databases.Except for the Corel1000database, the3-gram presented the best result.In this database,the3-gram representation increased13%the precision compared with the traditional1-gram representation.

Considering each class individually,we can see where the results were more signi?cant.Table III presents the mAP 1available at:https://www.wendangku.net/doc/4316772984.html,/docs/related/

2provide by the Clinical Hospital with our university.

3provide by the Clinical Hospital with our






Figure 5.Sample images from the:(a)Corel1000database,(b)Lung database,(c)Medical Image Exams database,(d)Texture database. results for each class of the Lung database and Tables II,IV and V summarize the results from representative classes of the other three databases,where the difference between the methods were considered signi?cant.Figures6a,6b,6c and 6d present the Precision values for each class of the evaluated databases.

For the class Emphysema in the Lung database and for the class Knee in the Medical Image Exams database,the2-gram representation had a gain up to17%in Precision.For the class Africa in the Corel1000database,the2-gram representation improve the precision in26%.Considering the Class1of the Texture database the3-gram representation achieved a gain of 44%in precision.


P r e c

i s i o n




Class P r e c i s i o n

Lung database


P r e c i s i o n

Medical Image Exams database





P r e c i s i o n

Texture database

Figure 6.

Precision values for each class of the evaluated databases.

Table I


Database Image representation 1-gram 2-gram 3-gram Corel10000.2990.3320.340Lung 0.3650.3850.366Medical Exams





In general,when a class is complex,this means,when it has more visual details,the n -gram representation with a large value of n tends to have better results.However,the value of n affects the dictionary size,such that more visual phrases are being analyzed and this can affect the retrieval performance.Considering the overall performance for all the evaluated databases,the 2-gram and 3-gram achieved a gain in Precision compared to the traditional 1-gram representation.These results demonstrated that the high-level image feature proposed in this work was able to improve the retrieval system and making a CBIR closer to the users’expectation.

Table II



Image Representation 1-gram 2-gram 3-gram 2Food 0.1710.1890.2193Horses 0.1750.2150.1745Flower 0.2310.2810.3096Africa 0.3050.3850.3557Elephant 0.3060.3560.3499Mountain 0.4890.4720.52910





B.Image classi?cation evaluation

We performed an image classi?cation evaluation using the 15-scenes database [14].This database (Fig.7)is composed of 4485images categorized in ?fteen different scenes.Each category has 200to 400images,and average image size is 300x250pixels.The major sources of the pictures in the database include the COREL collection,personal photographs,and Google image search.This is one of the most complete

Table III



Image representation 1-gram 2-gram 3-gram 1Emphysema 0.3040.3470.3302Honeycombing 0.3330.3760.3303Interlobular Septal

0.3640.3730.3394Healthy 0.3710.3750.3695Consolidation 0.3910.3960.3986




Table IV



Class Image representation 1-gram 2-gram 3-gram 1Chest 0.2370.2590.2492Brain Axial

0.2380.2760.2704Hand 0.2570.2580.2525Foot

0.3110.2980.3196Brain Coronal

0.4050.4190.4007Breast 0.4870.5350.4808Knee 0.5270.6160.59110Abdomen 0.5600.6480.64011

Brain Sagittal




scene category database used in the literature thus far.

A multi-class classi?cation was done with a support vector machine (SVM)trained using the one-versus-all rule:a classi?er is learned to separate each class from the rest,and a test image is assigned to the label of the classi?er with the highest response.

Table VI shows the classi?cation rate for the experiments using 100images per class for training and the rest for testing (the same setup as [14]).The 2-gram representation presented the best classi?cation rate with 48%,followed by the 3-gram representation (43%).Compared with the 1-gram,the 2-gram representation had a gain of 11%in accuracy.This result demonstrate that the 2-gram representation encode more discriminative features than the 1-gram representation.

Figure 7.Example images from the 15-scenes category database.

Table V


Class Image representation 1-gram 2-gram 3-gram 1Class 10.3140.4140.4522Class 20.4270.4520.4274Class 40.4930.5300.5275Class 50.4990.5270.5586Class 60.5150.5140.4867Class 70.5460.5910.5349Class 90.5930.5820.56810Class 100.5990.6190.62211Class 110.6180.6660.68013Class 130.6550.7030.68115Class 150.6690.7130.69516Class 160.7040.7740.71518Class 180.7250.8160.84819Class 190.7450.7850.72022Class 220.7720.8390.77823Class 230.8230.8870.91124Class 240.9260.9600.96725

Class 25



Table VI



Average Classi?cation

1-gram 0.4282-gram 0.4763-gram


Analyzing each class individually,Table VII presents the classi?cation rates for each class of this database.For three classes,the 3-gram presented the best results.The 2-gram had a gain of 33%in accuracy for the class Kitchen and the 3-gram had a gain of 20%for the class MITforest compared with the traditional 1-gram approach.

In this evaluation task (image classi?cation),we can see the same behavior that the previous task (image retrieval).The 2-gram and 3-gram representations presented an overall performance better than 1-gram representation.These results indicate that the 2-gram and 3-gram add more information in the image feature description,being a valuable asset to improve the image analysis.


In this paper,we have introduced a novel modeling approach for representing images.The new approach represents an image by taking into consideration the relationship between its visual words.The proposed method is based on the idea of Bag-of-Visual-Phrases,which has a higher level of semantic characterization compared to the traditional Bag-of-Visual-Words.The image is represented as a collection of visual

phrases,instead of considering the image as a set of isolated visual words.

Our proposed method uses a dictionary composed of visual phrases with a?xed number of words.Such representation is an analogy to the popular n-gram representation used for textual representation.

We have conducted experiments in?ve different databases. Three of the databases are public and employed as benchmarks in the image retrieval and classi?cation community.The others two are composed of medical images to demonstrate an area that can bene?t from the proposed technique.The results have shown that bigram and trigram dictionaries are suf?cient to boost the retrieval and classi?cation accuracy.

Our proposed novel modeling approach enriches the Bag-of-Visual-Words representation and the obtained results indicate that it can become a powerful and promising descriptor for image representation,and can contribute to the content-based image retrieval and image classi?cation?eld.

Table VII






















The authors acknowledge the?nancial support of FAPESP, CNPq,CAPES and INCT INCod.


