当前位置：文档库 › The BNR model foundations and performance of a Bayesian network L.M. de Campos et al

The BNR model foundations and performance of a Bayesian network L.M. de Campos et al

The BNR model:foundations

and performance of a Bayesian

network-based retrieval model

Luis M.de Campos *,Juan M.Fern a ndez-Luna,

Juan F.Huete

Departamento de Ciencias de la Computaci o

n,e Inteligencia Arti?cial,https://www.wendangku.net/doc/bb3340897.html,rm a tica,Universidad de Granada,38,18071Granada,Spain

Received 1March 2003;accepted 1July 2003

Abstract

This paper presents an information retrieval model based on the Bayesian network formalism.The topology of the network (representing the dependence relationships between terms and documents)as well as the quantitative knowledge (the probabilities encoding the strength of these relationships)will be mined from the document collection using automatic learning algorithms.The relevance of a document to a given query is obtained by means of an inference process through a complex network of dependences.A new inference technique,called propagation +evaluation,has been developed in or-der to obtain the exact probabilities of relevance in the whole network e?ciently.ó2003Elsevier Inc.All rights reserved.

1.Introduction

Nowadays,the retrieval of information is becoming more and more im-portant with the widespread use of Internet in our everyday tasks.The ?eld of information retrieval (IR)has been de?ned by Salton and McGill [22]as the

Corresponding author.Tel.:+34-58-443-199/958-244019;fax:+34-58-243-317/958-243317.E-mail addresses:lci@decsai.ugr.es (L.M.de Campos),jm?una@decsai.ugr.es (J.M.Fern a ndez-Luna),jhg@decsai.ugr.es (J.F.Huete).

doi:10.1016/j.ijar.2003.07.011

https://www.wendangku.net/doc/bb3340897.html,/locate/ijar

International Journal of Approximate Reasoning 34(2003)265–285

266L.M.de Campos et al./Internat.J.Approx.Reason.34(2003)265–285

subject concerned with the representation,storage,organization,and accessing of information items.1In this paper,we mainly focus our attention on two main IR tasks:representing the information,and the way in which we access information items,i.e.identifying documents in a collection that are relevant to a particular information need formulated by means of a query.

We shall focus our research on the use of uncertain inference models for IR [3].These models represent an extension of the classical probabilistic model [28],providing a framework for the integration of several sources of evidence. The use of these models is based on the fact that most tasks in this area may be described as uncertain processes[26].The theoretical justi?cation for these models is based on the?probability ranking principle’[21]which states that the best overall retrieval e?ectiveness will be achieved when documents are ranked in decreasing order of their probability of relevance.

The concept of relevance in uncertain inference models is basically related to an inference process through a network of dependences using evidential rea-soning techniques.The most promising ones are those based on Bayesian networks[18].Intuitively,as[16]says,‘‘Bayesian networks are complex dia-grams that organize the body of knowledge in a given area by mapping out cause-and-e?ect relationships among key variables and encoding them with numbers that represent the extent to which one variable is likely to a?ect an-other’’.The use of a general Bayesian network methodology as the basis for an IR system is di?cult to tackle.The problem mainly appears because of the large number of variables involved and the computational e?orts needed to both determine the relationships between variables and perform the inference processes.2Nevertheless,an increasing e?ort has been made in the research of uncertain inference models for IR[17,20,25].These models consider the fol-lowing two main simplifying restrictions in order to solve the above e?ciency problem:

R1Fixed dependence relationships:the structure of the model,encoding the dependence relationships between variables,is?xed a priori,without con-sidering any potential knowledge that might be mined from the collection. R2Simpli?ed estimation of probabilities:in order to avoid the large space nec-essary to store all the probabilities relevant to the process,it is assumed that those complex compound events will have been assigned zero proba-bility values.With this assignment,these events can be discarded when in-ference tasks are performed.

Using the restrictions above,the probability of relevance of a given docu-ment only depends on the set of terms used to formulate the query and it can be

1In this paper,we will only deal with documents,or in a broader sense,textual representations of any type of object,i.e.a research article,a book,a message in an electronic mail?le,etc.

2Note that these tasks are NP-hard[10]in the number of variables.

L.M.de Campos et al./Internat.J.Approx.Reason.34(2003)265–285267 computed without truly performing inference tasks,i.e.without propagating the evidences through the networks.In these cases,it is su?cient to consider the evaluation of a set of functions.

Our objective in this paper is to show that it is possible to relax these re-strictions,and therefore to obtain a more expressive Bayesian network-based IR model.This relaxation involves considering new theoretical and practical trends:how to infer the set of relationships between variables from the col-lection,how to estimate and store all the needed probabilities e?ciently,and ?nally,assuming that we have the solutions for the previous problems,it will be necessary to study how to perform exact inference e?ciently through the network.

Following these ideas,this paper is divided into the following sections:in Section2,we introduce the Bayesian network background needed to under-stand the rest of the paper.Section3presents other models based on these graphical models.Section4will explain the Bayesian Network Retrieval Model (BNRM)in detail:its topology and construction,the estimation of probability distributions,and the inference method.In Section5,the results of an exper-imentation with this new model is presented.The performance of the model is also compared with the e?ectiveness of other models such as the vector space model and inference network model.Finally,Section6shows the conclusions of this work,as well as future work that we plan to implement in order to improve the BNRM.

2.Preliminaries:Bayesian networks basics

In this section,we shall brie?y introduce the concept of the Bayesian net-work[18],the basis for the model presented in this paper.We shall attempt to answer questions such as what it is for,how it is composed,how it can be constructed,and how it can be used.

In formal terms,a Bayesian network is a directed acyclic graph(DAG)(a graph with links which are orientated,taking the name of arcs,and with no cycles in it),in which the nodes represent random variables and the arcs show causality,relevance or dependency relationships between them.3The variables and their relationships comprise the qualitative knowledge stored in a Bayesian network.A second type of knowledge also stored in the DAG is known as quantitative,since it establishes the strength of the relationships and is mea-sured by means of probability distributions.Associated with each node there is 3A dependence relationship between two variables,X and Y,implies a modi?cation of the belief in X,given that the value taken by Y is known.An Independence relationship means that the belief in X is not modi?ed,given the knowledge on Y.

268L.M.de Campos et al./Internat.J.Approx.Reason.34(2003)265–285

a set of conditional probability distributions,one for each possible combina-tion of values that its parents can take.

Formally,a Bayesian network can be considered an e?cient representation of a joint probability distribution that takes into account the set of indepen-dence relationships represented in the graphical component of the model.In general terms,given a set of variables f X1;...;X n g and a Bayesian network G, the joint probability distribution in terms of local conditional probabilities is obtained as follows

Y n

PeX1;...;X nT?

PeX i j peX iTT

i?1

where peX iTis any combination of the values of the parent set of X i,PeX iT,in the graph.If X i has no parents,then the set PeX iTis empty,and therefore PeX i j peX iTTis just PeX iT.

Once completed,a Bayesian network can be used to derive the posterior probability distribution of one or more variables since we have observed the particular values for other variables in the network,or to update previous conclusions when new evidence reach the system.Researchers have developed general inference algorithms that take advantage of the independences repre-sented in the network.Although it is possible to?nd algorithms that perform inference tasks in a time that is linear in the number of variables,high com-putational complexity inference algorithms result from having multiple path-ways connecting nodes in the graph.General inference has been proved to be NP-hard[10].

3.Related Bayesian network-based models

In this section we shall brie?y describe the two main retrieval models based on Bayesian networks.

The?rst model was developed by Croft and Turtle[25],the Inference Net-work Model,which is composed,in its simpli?ed form,of two networks:the document and query networks.The?rst,?xed for a given collection,represents the document collection,and contains two kinds of nodes:the document nodes,representing the documents,and the concept nodes,symbolizing the index terms contained in the documents.The arcs go from each document node to each concept node used to index it.In addition,the query network is speci?c for each query.In the simpli?ed form,there is a query node for each query representation used to express the information needed.This query node has as parents those concepts(terms)used to formulate the query,representing the connection between the two networks.The query nodes are also the parents of an information need node that represents the user’s generic information need.

L.M.de Campos et al./Internat.J.Approx.Reason.34(2003)265–285269 When there is only one query representation the information need node and the query node coincide.In the rest of the paper,we shall concentrate our attention on those models which use only one query representation.Fig.1,on the left hand side,shows this simpli?ed Inference network.In order to complete it,it is necessary to assess the conditional probability distributions of the nodes in the graph.The proper speci?cation of these probabilities allows the inference network to cover di?erent IR strategies.

A document D j may be ranked with respect to a query Q by measuring how much evidential support the observation of D j provides to the query Q.In order to obtain this ranking,a single document node D j is instantiated each time,and the probability that the information need is satis?ed given that this document has been observed,peQ^D j?trueT,is computed.

A direct computation of these values is unfeasible for practical purposes.In order to solve this problem,the inference network takes advantage of a par-ticular probability assessment.It is interesting to note that using these prob-abilities and considering that in the inference process there is only one document instantiated to relevant,the?nal probability peQ^D j?trueTonly depends on the set of terms indexing document D j that have been used to formulate the query.

We shall now present a second model based on BNs:the Belief Network Model[20].This model has been designed to provide a Bayesian network-based approach capable of simulating the vector space,and Boolean and probabi-listic schemes.Like the inference network model,their network is composed of three types of nodes:document nodes,concept(term)nodes,and the query node.The arcs go from concept nodes to the document nodes where they occur,and from the concept nodes(appearing in the query)to the query node. This model is represented on the right hand side of Fig.1.The ranking will be obtained by computing the probability peD j?relevant j QTfor each document D j.

Since a document can be indexed by hundred of terms,a straight compu-tation of this probability becomes unfeasible.Therefore,and like the inference

270L.M.de Campos et al./Internat.J.Approx.Reason.34(2003)265–285

network model,the probabilities are de?ned in such a way that the compu-tation will be reduced to a direct evaluation of a function.Also,depending on the model to be simulated,di?erent probability assessments might be used. Using the probability assessment,in order to compute the?nal ranking of a document,the individual contribution of each term of the document which also belongs to the query is considered.

In short,both models use a?xed document subnetwork structure for a given collection and a degenerated probability assessment in order to compute the probabilities of interest without truly propagating the evidences through the networks.Since these models use the same dependence model to represent any document collection,they do not take into account the particular dependence relationships between variables(terms and/or documents)that can be mined from the document collection.In addition,the computed probabilities of rel-evance will only depend on the terms used to formulate the query and not upon other concepts that might be related(either directly or indirectly)to the query. In the following section,we shall present our model to try to reduce these problems.

4.Foundations of the Bayesian network retrieval model

Our objective is to obtain a Bayesian network representation of a given document collection.Particularizing the de?nition given in[16]to the?eld of IR,a Bayesian networks will organize the knowledge that can be mined from a document collection by mapping out dependence relationships between terms and documents and encoding them with probabilities representing the extent to which they are related to each other.Once the network has been constructed,it shall be used to obtain the relevant documents for a given query.

With this de?nition in mind,in our approach we shall not include a query component(query nodes)as a proper part of the IR system,i.e.it will be a query independent model.In our case,the query is considered as an evidence that should be introduced into the system.This fact shall represent the?rst di?erence between our model and previous ones.

In order to present the BNRM,we shall?rst describe how we can determine the dependence relationships,i.e.the qualitative component;then we will present the assessment of the probability values,that is to say,the quantitative component;and?nally,we shall consider how the inference process is carried out.

4.1.Structure of the model

Since our objective is to obtain a model able to incorporate the most im-portant dependence relationships in the collection,a learning procedure must

L.M.de Campos et al./Internat.J.Approx.Reason.34(2003)265–285271 be performed.This process will give the?nal structure of the network as a result.In our model,we shall consider two sets of variables:Terms (T?f T i;i?1;...;M g,with M being the number of terms used to index the collection)and Documents(D?f D j;j?1;...;N g,N being the total number of documents).These variables are bivaluated,taking values from the set {relevant,not-relevant}.4In order to simplify the following expressions,we shall note the?term T i(or D j for documents)is relevant’as t ied jT,and?the term T ieD jTis not relevant’as t ie d jT.

Given a document collection,and due to the large number of variables in-volved,mining all the dependence relationships is unfeasible.We propose that a hybrid approach be followed whereby both‘‘expert knowledge’’(using a set of general coherence criteria)and‘‘collection knowledge’’(mined from the documentary data)will be taken into account.

4.1.1.Expert knowledge

Consists of a set of assumptions that the model must ful?ll and allows us to obtain a previous skeleton of the Bayesian network(limiting the automatic learning process).These assumptions are

1.There is a strong relationship between a document and each of the terms by

which it has been indexed.This principle is translated into the graph using links that connect each document node with all the term nodes that represent the index terms associated to the corresponding document.

2.The relationships between documents are only present through the terms

that index them.This assumption implies that there are no links joining the document nodes between them.

3.If we know the relevance or non-relevance of all the terms that occur in a

document,D i,our belief in its relevance is not a?ected by the fact that we know that another document,D j,or term,T k,are relevant or not.This as-sumption implies that documents are conditionally independent given the terms by which they have been indexed.In the network,the links joining the document nodes and their corresponding term nodes will be directed from the second to the?rst ones.

Taking these three assumptions into account,the structure of our model is similar to the Belief Network Model[20],except for the fact that we do not consider a query node.In this network,the terms are independent between each other.This point seems to be very restrictive because,in a collection,the

4We talk about the relevance of a term in the sense that the user believes that the term will appear in relevant documents(hence,he/she will explicitly use it when formulating a query). Similarly,a term is not relevant when users believe that the relevant documents do not contain it: they are not interested in documents containing this term.

terms are related in di?erent ways.This restriction will be removed by con-sidering collection-dependent knowledge.

4.1.2.Collection knowledge

Considering the above assumptions,the natural step for obtaining a more precise model is to incorporate the most important dependence relationships between the terms into the collection.Thus,we may distinguish two di?erent layers of nodes:the term and the document layer.As we will see in Section4.3, the separation in these two layers will allow inferences to be carried out e?-ciently.

In order to put this methodology into practice,we must use an automatic learning algorithm to build the term layer.The?rst task is to decide the un-derlying topology of the term layer.It is obvious that the more complex the topology,the more accurate the dependence and independence relationships will be re?ected by the topology,although,at the same time,and considering the number of terms and documents involved,the learning and propagation algorithms will be a very time-consuming tasks.In order to tackle this problem, we propose that simpler graphs be used.In this case,some precision is lost because the independence and dependence relationships that they can represent are more restrictive.

In this paper we consider that term to term dependences will be represented by means of a polytree5because there is a set of very e?cient learning[4,5,19] and propagation[18]algorithms running in a time proportional to the number of nodes,making the use of Bayesian networks in this context feasible.In particular,the term layer will be completed using a polytree learning algorithm which takes as input the inverted?le of a collection(a data structured that stores for each term those documents where it occurs),and generates a poly-tree,whose nodes(variables)are the terms.

The algorithms,which is explained in detail in[5],is composed of three main steps:

https://www.wendangku.net/doc/bb3340897.html,putation of the degrees of dependency between all pairs of nodes.

2.Construction of the tree skeleton.

3.Orientation of the edges in the tree,?nally making up a polytree.

Several remarks have to be made about these three parts.First,the measure used to establish the dependency between nodes(which is,in some sense, analogous to the functions usually employed in IR systems for measuring the similarity between the terms in the collection)is the following:

DepeT i;T j j;T?

T i;T j peT i;T jTln

peT i;T jT

peT iTpeT jT

e1T

5Graph in which there is no more than one directed path connecting each pair of nodes. 272L.M.de Campos et al./Internat.J.Approx.Reason.34(2003)265–285

L.M.de Campos et al./Internat.J.Approx.Reason.34(2003)265–285273 where T i is one of the possible values that the variable T i can have.This function is the Kullback–Leibler’s cross entropy(also called expected mutual information measure),which measures the dependency degree between two variables T i and T j(which is equal to zero if T i and T j are marginally inde-pendent,and such that the more dependent T i and T j are,the greater DepeT i;T j j;Tis).The probabilities peT i;T jTare estimated from the inverted?le by counting frequencies.Here,we use the marginal cross entropy,in opposi-tion to the approach in which the marginal dependency of two terms is com-bined with the conditional dependencies of these two terms conditioned to the rest of terms.The reason is that due to the great amount of terms in a col-lection,the computation of the conditional dependencies,although it has to be carried out only once,has been proved extremely time-consuming but also a large storage is needed.

The next step is the tree skeleton construction.If we assume that the com-puted dependency values are link weights in a graph,this algorithm gets a maximum weight spanning tree(MWST),i.e.a tree where the sum of the weights of its links is maximum.We considered the Prim’s algorithm[2]to obtain the MWST.

Due to the great number of terms that there are generally in a collection,the values of the dependencies are very low in general,and sometimes the algo-rithm does not have any good choice and selects as the highest value among all the dependencies being considered a very low value,adding the corresponding link to the tree.The problem lies in the fact that the two linked nodes are almost more independent than dependent,and therefore the model we are building loses accuracy with respect to the original one.

To solve this problem,the algorithm,once it has selected a new link T i–T j to be added to the tree,performs an independency test between T i and T j;then it really adds this link to the tree only if the independency test fails.In this way, we can obtain a non-connected tree,i.e.,a forest,as the result of this step.

Once the skeleton is built,the last part of the learning algorithm deals with the orientation of the tree,getting as a result a polytree.In a head to head pattern T i!T k T j,the instantiation of the head to head node T k should normally increase the degree of dependency between T i and T j,whereas in a non-head to head pattern such as T i T k!T j,the instantiation of the middle node T k should produce the opposite e?ect,decreasing the degree of depen-dency between T i and T j.So,we compare the degree of dependency between T i and T j after the instantiation of T k,DepeT i;T j j T kT,with the degree of depen-dency between T i and T j before the instantiation of T k,DepeT i;T j j;T,and direct the edges toward T k if the former is greater than the latter.Finally,the algo-rithm directs the remaining edges without introducing new head to head connections.This strategy produced,in our preliminary experiments,struc-tures where several nodes had a great number of parents;this fact leads to have very big probability tables and,as a consequence,it causes problems of storage

and reliability (in the estimation of these tables).For that reason we have re-stricted a bit the rule that produces head to head connections,by including another condition in the antecedent:we want to be sure that if we decide to include a head to head connection T i !T k T j ,then the nodes T i and T j are not conditionally independent given T i .So,we also test this condition,once again using a Chi square test of independency based on the value Dep eT i ;T j j T k T(in this case with two degrees of freedom).

Once the polytree has been learned,the last step to ?nish the retrieval model construction is to join each term node with its corresponding document node.Fig.2shows an example of the ?nal topology of the network.

4.2.Estimating the quantitative information

Once the structure of the network has been created,the second step in specifying a Bayesian network completely is to estimate the strength of the relationships represented.This process implies estimating a set of conditional probability distributions.We have used several estimators [12],but the ones that perform best are the following.

4.2.1.Root term nodes

Given a root node representing the variable T i ,it will have to store the marginal probability of relevance,p et i T,and the probability of being non-rel-

evant,p e t i Tde?ned by means of p et i T?1and p e t i T?1àp et i T,with M being the

number of terms in the collection.

4.2.2.Non-root term nodes

In this case,for each non-root term node T i ,with parents P eT i T,we need to estimate a set of conditional probability distributions p eT i j p eT i TT,one for each possible combination of values that the parents of a node T i can have,p eT i T

274L.M.de Campos et al./Internat.J.Approx.Reason.34(2003)265–285

Given any set of terms S?f T1;T2;...;T k g,a con?guration C is de?ned as a vector h t1;t2;...;t k i,where each of its elements corresponds to a value that each variable T i2S can take.Therefore,t i?t i if the i th variable is relevant, and t i? t i,if T i is not relevant.For instance,for S?f T1;T2;T3;T4g,two possible con?gurations are h t1;t2; t3;t4i and h t1; t2;t3; t4i.Given a set of terms S and a con?guration C,we de?ne neCTas the number of documents that con-tains all the terms that are included as relevant in the con?guration,and do not contain those that are non-relevant in it.

The estimator is based on the Jaccard similarity measure[28].Given two sets X and Y,it computes the similarity between them as the quotient of the number of elements composing the intersection and the cardinality of the union of both sets,i.e.j X\Y j=j X[Y j.This measure(also used by Savoy[23])is adapted to our model using the following expression:

pe t i j peT iTT?

neh t i;peT iTiT

neh t i iTtnepeT iTTàneh t i;peT iTiT

e2T

In this formula,pe t i j peT iTTis initially estimated and later pet i j peT iTTis ob-tained by dualityepet i j peT iTT?1àpe t i j peT iTTT.

4.2.3.Document nodes

In this case,the probability peD j j PeD jTTmust be estimated,i.e.the proba-bility of a document node given the set of its parents(the nodes representing the terms by which it has been indexed).

The main problem to be faced in this task is that if a document has been indexed by m j terms,and taking into account that each term is represented by a binary variable,the number of probability distribution to be estimated is2m j. Taking into account that in a common size collection,the number of index terms per document may be around100or200,the total number of possible combinations is huge,leading to several problems such as the long time needed to estimate the probabilities,the low reliability of the estimation,the great amount of disk space required to store the distributions,and?nally,the slowness of the propagation process to manage them.The existence of these four chained problems lead us to consider an alternative way to estimate the probability matrices completely,and resulted in what we have called proba-bility functions,also known as canonical models of multicausal interaction[18].

In the inference process,the probability functions will compute the required conditional probabilities just at the moment when they were needed.In this way,the explicit representation of the probability matrix is substituted by an implicit one,avoiding most of the previously explained problems.

We have developed a new general canonical model:for any con?guration peD jTof PeD jT,we de?ne the conditional probability of relevance of D j as follows:

L.M.de Campos et al./Internat.J.Approx.Reason.34(2003)265–285275

ped j j peD jTT?

T i2R peD

w ije3T

with R peD

jTbeing the set of terms that are relevant in peD jT,and the weights w ij

have to verify that06w ij and P

T i2D j

w ij61.So,the more relevant terms in

peD jT,the greater the probability of relevance of D j.

4.3.The retrieval engine:inference in the Bayesian Network Retrieval Model

Once the Bayesian network has been built,it can be used to predict the values that certain variables can take.This process is known as inference and computes the probabilities of the di?erent cases that an unknown variable can have,given the values of the known variables or evidences.

Focusing on the BNRM,the query formulated by the user(or more spe-ci?cally,the terms in the query)plays the role of a new piece of evidence provided to the system.The last aim is to obtain the probability of relevance of each document in the collection given a query.The terms from the query are instantiated to relevant in the network.This information will be propagated toward the document nodes,?nally obtaining ped j j QT,8D j.The documents are presented to the user decreasingly sorted according to their corresponding probabilities of relevance.

Taking into account the number of nodes in our Bayesian network and the fact that it contains cycles and nodes with a great number of parents,general purpose inference algorithms cannot be applied due to e?ciency consider-ations,even for small document collections.Therefore,we ought to look for a solution to carry out the inference in an acceptable time.Our proposal for solving this problem has been named Propagation+Evaluation,and consists of a two-stage approximate propagation:

1.Exact propagation in the term layer,obtaining pet i j QT,8T i.Bearing in mind

that the evidences will always be term nodes composing the query,we could use Pearl’s exact propagation algorithm[18]in order to obtain the posterior probability of each term node.These probabilities can be computed in a polynomial time in an exact way.

2.Evaluation of a probability function in the document nodes,computing

ped j j QT8D j,using the posterior probabilities obtained in the previous stage.

With this evaluation,we are modifying the strength with which the terms in-?uence the relevance of the documents.

Therefore,the computation of ped j j QTcan be carried out as follows: ped j j QT?

X m j

i?1

w ijápet i j QTe4TIn the following theorem,we show the conditions under which we could put the two-stage propagation into practice,with total equivalence in results with respect to an exact propagation:

276L.M.de Campos et al./Internat.J.Approx.Reason.34(2003)265–285

Theorem 4.1.Given a set of evidences corresponding to the terms of a query Q ,if the probability function used can be expressed as

p ed j j p eD j TT?

X T i 2R p eD j T

w ij ;8j ?1;...;N e5Tthat is to say,as the sum of weights for the relevant terms of a document,where 06w ij ,8i ?1;...;m j ,P T i 2D j w ij 61and R p eD j Tis the set of terms that are rel-evant in a configuration of parents of D j ,p eD j T,then the exact propagation in the term layer plus the evaluation of a probability function in each document (Eq.

(4))is equivalent to carrying out an exact propagation in the entire Bayesian network.

Proof.The posterior probability obtained applying to the exact inference process,p ed j j Q Tcan be expressed as

p ed j j Q T?

X p eD j T

p ed j j p eD j T;Q Táp ep eD j Tj Q TAs the set of terms indexing a document makes the document and the evidences independent,then

p ed j j Q T?X p eD j T

p ed j j p eD j TTáp ep eD j Tj Q T

Substituting in the previous expression the value of p ed j j p eD j TTin Eq.(5),we obtain:

p ed j j Q T?X p eD j TX T i 2R p eD j T

w ij áp ep eD j Tj Q T0@1A e6T

The next step is to break down the previous expression into two parts.In the ?rst,we include the con?gurations where the term T m j is relevant,and in the second,those where it is not relevant.In order to make this fact explicit,we will use notation h p ?eD j T;t m j i ,where ep ?eD j TTcorresponds with the con?guration h t 1;t 2;...;t m j à1i ,i.e.without the last variable,T m j ,in p eD j T.

p ed j j Q T?X

h p ?eD j T;t m j i X T i 2R h p ?eD j T;t m j i w ij áp ep eD j Tj Q T0@

1A tX h p ?eD j T; t m j i X T i 2R h p ?eD j T; t m

j i w ij áp ep eD j Tj Q T0

B @1

C A e7T

L.M.de Campos et al./Internat.J.Approx.Reason.34(2003)265–285277

Considering that P T i 2R h p ?eD j T;t m j i w ij áp ep eD j Tj Q Tis equal to X T i 2R h p ?eD j Ti

w ij áp ep eD j Tj Q Ttw m j j áp ep eD j Tj Q T

and that P T i 2R h p ?eD j T; t m j i

w ij áp ep eD j Tj Q T?P T i 2R h p ?eD j Ti w ij áp ep eD j Tj Q T,and sub-stituting them in expression (7),the posterior probability is p ed j j Q T?X h p ?eD j T;t m j i X T i 2R h p ?eD j Ti w ij áp ep eD j Tj Q Ttw m j j áp ep eD j Tj Q T2435

h p ?eD j T; t m j i X T i 2R h p ?eD j Ti w ij áp ep eD j Tj Q T2435Notice how both sums in the con?guration have P T i 2R h p ?eD j Ti w ij áp ep eD j Tj Q Tin common when T m j is taken into account and when it is disregarded.We could unify these two addends into only one by means of a marginalization operation

over the variable T m j .Consequently,the variable T m j will be removed from the resultant addend.

Therefore,the posterior probability of the document will be p ed j j Q T?

X p ?eD j TX T i 2R p ?eD j Tw ij áp eep ?eD j TTj Q TtX h p ?eD j T;t m j i

w m j j p ep eD j Tj Q TFocusing our attention on the second addend in the previous expression:

X h p ?eD j T;t m j i w m j j p ep eD j Tj Q T?w m j j áX h p ?eD j T;t m j i

p ep eD j Tj Q T

which implies that we are considering all the possible con?gurations in p ?eD j T,and therefore the ?nal result is w m j j áp et m j j Q T.Note that p et m j j Q Thas been ob-tained previously by applying the exact propagation process in the term layer.Therefore,

p ed j j Q T?X p ?eD j TX T i 2R p ?eD j T

w ij áp eep ?eD j TTj Q Ttw m j j áp et m j j Q T

It should be noted that the ?rst addend is completely analogous to the initial expression,Eq.(6),but where the term T m j has been removed.We now repeat the process applied to this ?rst addend to remove a new variable T m j à1and extract the addend w m j à1j áp et m j à1j Q T.By repeating the process until we have removed all the terms,we obtain a ?nal expression of the probability of a document given all the evidences:

p ed j j Q T?X m j i ?1

w ij áp et i j Q T

278L.M.de Campos et al./Internat.J.Approx.Reason.34(2003)265–285

We conclude that we can compute the probability ped j j QTexactly,8D j running an exact propagation only in the term layer.h

4.4.Two modi?cations to the basic retrieval process in the BNRM

Until now,we have assumed that the presence of a term in a query implies its instantiation to relevant,assigning the same?strength’to all of them. However,it might be interesting to highlight the importance of a term with respect to others that could be classi?ed as secondary.This information is naturally used in the vector space model[22].

In the BNRM,these terms which occur more frequently in the query,will have a greater in?uence in the propagation stage than those that appear a few times.In order to put this idea into practice,we could clone these query terms in the network.Therefore,if the query frequency(qf)of a term T i were three, then two new?ctitious nodes would be created in the network with the same information contained in the node T i.These nodes would be used in the evaluation of each document which was indexed by T i.In this case,the pos-terior probability can be computed(see[12])as

ped j j QT?

X m j

i?1

w ijápet i j QTá?qf i

In this expression,the factor?qf

i will have the value1if the i th term is not in

the query,and the corresponding qf

i if it occurs,which is why it has been noted

between brackets.

A second modi?cation to the basic process set out in the previous subsection is the following:a high posterior probability of relevance after propagating might be due to a positive in?uence of the instantiated query terms on a document,or to the fact that the prior probability is high and the in?uence received by the query terms does not decrease the posterior probability of relevance of the document.The?rst case implies that the document is very relevant.However,the second means that if documents are ranked according to their posterior probabilities,mistakes can be made,and greater importance given to a document which has brie?y increased its belief and therefore, worsening the retrieval performance.

Therefore,the ranking could be generated by taking the di?erence ped j j QTàped jT,8D j into account.In this case,the important fact is the relative value(the increment of probability)and not the particular value of ped j j QT.

5.Measuring the performance of the model:experiments and results

Our objective in this section is to measure the e?ectiveness of the retrieval. This evaluation can be carried out with di?erent methods,but the main one is L.M.de Campos et al./Internat.J.Approx.Reason.34(2003)265–285279

that based on recall and precision estimates[22,28].The?rst measures the ability of the IR system to present all the relevant documents(recall?number of relevant documents retrieved/number of relevant documents).The second, precision,measures its ability to present only the relevant documents(preci-sion?number of relevant documents retrieved/number of documents re-trieved).For each experiment,we o?er the mean precision for the eleven recall points[22],A-11PTS.

In this section,we shall present the experimentation that we have carried out in order to determine the quality of the proposed model.We have applied the BNRM to?ve well-known test document collections:ADI,CACM,CISI, CRANFIELD and MEDLARS.The main characteristics of these collections with respect to the number of documents,terms and queries are shown in Table1.

For each collection,our objective is to show the behavior of the proposed methodology and to compare the obtained results with other IR systems like SMART,based on the vector space model,6and with other Bayesian network-based models such as the Inference Network Model.

All the collection has been indexed by SMART,not only for the experiments carried out by our system,but also for those run by SMART and INM. Speci?cally,after removing stop words,a stemming process is run,leading each word to its corresponding stem.The SMART weighting scheme with which the experiments have been run is‘‘ntc’’,because it is the one with which this IRS obtains the best results with these?ve test collections.

The speci?c weights w ij,for each document D j and each term T i2D j,used by our models(see Eq.(3))are:7

w ij?aà1

tf ijáidf2

?????????????????????????????????

T k2D j

tf kjáidf2

qe8T

Table1

Main features of the standard test collections

Collection No.documents No.terms No.queries

ADI8282835

CACM3204756264

CISI14604985112

CRANFIELD139********

MEDLARS1033717030

6These values can also be considered the ones obtained with the Belief Network Model when simulating the vector space model.

7Other probability functions designed for BNRM are shown in[12].

280L.M.de Campos et al./Internat.J.Approx.Reason.34(2003)265–285

where tf ij is the frequency of the term T i in document D j ,idf i is the inverse document frequency of the term T i in the collection 8and a is a normalizing constant (to ensure that P T i 2D j w ij 618D j 2D ).This weight has been designed with a similar form to those used in the cosine similarity formula [22].

First,we shall consider the e?ect of the structure of the term layer,i.e.the set of dependence relationships between terms,on the performance of the system.As mentioned in Section 4.1,we restrict the topology of this subnetwork to a polytree (mainly due to reasons of e?ciency).We shall therefore consider two di?erent polytrees that have been obtained using the same learning algorithm:the ?rst where we consider the dependences between all the terms in the col-lection,T ,and the second where we only consider the relationships in a subset of T ,T ?,which only contains those terms that can be considered as good discriminators to distinguish between relevant and non-relevant documents.Thus,the terms in T n T ?will only be connected to the documents that they belong to.In order to obtain the T ?set,we consider a frequency-based ap-proach:T i 2T ?if the term has a document frequency in the interval ?5;N =10 ,with N being the number of document in the collection.A more detailed study of this problem can be found in [7].

Table 2shows the average precision for the eleven standard recall points for all the queries of each collection,obtained by SMART [22]and the Inference Network (INM)[24],and the behavior of our model with respect to the par-ticular experimentation.

With respect to INM,we have built our own implementation,and used the con?guration parameters proposed by Turtle in [24]:

p et i j d j ?true T?0:4t0:6átf ij áidf i and p et i j all parents false T?0:3

e9T

Taking these results into account,we could conclude that the proposed methodology shows a good performance,being comparable with SMART and INM.However,the best results were obtained by considering a particular combination of the parameters:on the one hand,if we ?x the structure of the term layer and we study the results of the inclusion (or not)of the ?qf’in the evaluation of the probability function,we can say that the behavior is quite homogeneous,and is clearly dependent on the collections (CACM and CISI support the use of the frequency of query terms whereas CRANFIELD and MEDLARS do not).On the other hand,when ?xing the inclusion of the ?qf’,we generally obtain better results when considering all the terms in the col-lection.If we do not include the ?qf’,it seems convenient to consider the re-duced model.

idf i ?lg N n i ,where N is the number of documents in the collection,and n i is the number of documents that contain the i th term.L.M.de Campos et al./Internat.J.Approx.Reason.34(2003)265–285281

Our last experiment attempted to discern which method is better for gen-erating the document ranking:sorting the documents according to their probability of relevance,p ed j Q T,or by means of the di?erence of the posterior and prior probabilities,p ed j Q Tàp ed T.In this case,and in order to reduce the number of experiments,we carried out this last test using all the terms in the collection eT T.We also considered the best results obtained in each case,the use of the ?qf’in ADI,CACM and CISI,and CRANFIELD and MED-LARS without using it.The results are presented in the last column of Table 1.Again,it seems that we have a collection dependent behavior:CRANFIELD and MEDLARS perform better when considering the di?erence of probabili-ties and the opposite is true for the other collections.

6.Conclusions and future work

In this paper,we have presented an IR model based on Bayesian networks.The topology of the network (qualitative information)that supports the model has been speci?ed,as well as the di?erent probability distributions stored in the nodes (quantitative information).Finally,we have provided it with an infer-ence mechanism to retrieve documents.

IR is a very complex problem due not only to the intrinsic uncertainty re-lated to many aspects of the ?eld,but also because of the size of the problem in terms of the number of documents and terms.We have therefore had to de-velop several techniques which are able confront the complexity of the prob-lems considered:

?Regarding the learning problem,we propose that the structure be restricted to a type of simpli?ed networks (polytrees)in order to reach acceptable learning and,above all,acceptable inference times.We have also used a spe-ci?c algorithm which combines useful methodologies from other existing algorithms and incorporates particular features.

?Estimating the qualitative information also presents a very important prob-lem:the huge number of parents that document nodes have in the net-work.Consequently,this makes any attempt to estimate and later store

Table 2

Experiment with BNRM with and without ?qf’

Exp.

SMART INM T ?,not ?qf’T ?and ?qf’T ,not ?qf’T and ?qf’P ed j Q TàP ed TADI

0.47060.46120.46320.46050.41300.46130.4581CACM

0.37680.39740.36920.39830.37590.40460.3996CISI

0.24590.24980.21040.24540.20070.23010.2299CRAN.

0.42940.43670.43950.41010.43140.41160.4421MED 0.54460.55340.61800.57640.62000.57920.6407

282L.M.de Campos et al./Internat.J.Approx.Reason.34(2003)265–285

L.M.de Campos et al./Internat.J.Approx.Reason.34(2003)265–285283 the probability distributions impossible.It was for this reason that we devel-oped the probability functions,which allow the probability matrices to be used implicitly.

?The last problem is the inference in our model,because the common meth-ods of propagation in Bayesian networks showed themselves to be totally unable to deal with the large size of the https://www.wendangku.net/doc/bb3340897.html,works.We therefore designed an inference technique,Propagation+Evaluation,which was completely adapted to our topologies.With this mechanism,we obtain some bene?ts from the speci?c topology of our network as well as from the probability functions.Consequently,we have been able to put an exact inference in a globally complex network into practice.

The performance of the BNR model clearly depends on the collection as well as the di?erent values of the parameters.For instance,the inclusion of the frequency of the terms of a query in the evaluation of a probability function is good for three collections and not so good for the rest.The same situation arises,when document ranking is carried out,with the use of the di?erence between probabilities than with only the posterior probability.

The experimentation that we have carried out has attempted to determine the suitability of our new model for document retrieval.In this case,we were able to clearly observe our model’s behavior and not be distracted by any other element from our main objective(for instance,the management of greater collections such as those included in TREC).Of course,our next target will be to test our model with real size collections.

With respect to this last matter,the application to this model to very large document collection should su?er some modi?cations.On the one hand,and considering in a?rst stage the learning of the polytree with all the terms,as this task must be carry out only once,it is not so important the learning time. Despite this fact,using the idea presented in Section5,by which only a set of dependences among terms in the collection would be represented in the poly-tree,the time required to learn this type of graph must be reduced.A more developed selection method than the one presented in this paper is shown in[9], where using a clustering algorithm,terms are divided into two sets:those which are classi?ed as?good?and those labeled as?bad?,from the point of view of the retrieval.The size of the?rst class is usually very small,allowing this situation an application of a fast learning,and subsequently also a fast propagation in retrieval time.

On the other hand,and once the polytree has been built,we could apply the techniques presented in[8]to reduce the propagation time.The two approxi-mation methods,modi?cations of the Pearl’s propagation algorithm[18],try to save time by not performing unnecessary inference steps.The reduction of time is considerable with the additional advantage that the loss,in terms of retrieval e?ectiveness,is almost null.Putting into practice these techniques,the size of the polytree could be very large.

284L.M.de Campos et al./Internat.J.Approx.Reason.34(2003)265–285 Another of our future lines of research that we are considering is the study of new mechanisms to represent the dependence relationships between terms and/or documents.We are also interested in the development of a method to automatically set up the best values for each parameter of the BNR model. This will be put into practice by analyzing the main characteristics of several collections(idf,document length,term lengths––in the inverted?le,and so on) and attempting to obtain common patterns between them. Acknowledgements

This work has been supported by the Spanish MCYT and FIS,under Projects TIC2000-1351and PI021147,respectively.

References

[2]N.Christo?des,Graph Theory,An Algorithmic Approach,Academic Press,New York,1975.

[3]F.Crestani,https://www.wendangku.net/doc/bb3340897.html,lmas,C.J.van Rijsbergen,L.Campbell,Is this document relevant?...

probably.A survey of probabilistic models in information retrieval,ACM Computing Survey 30(4)(1991)528–552.

[4]L.M.de Campos,Independency relationships and learning algorithms for singly connected

networks,Journal of Experimental and Theoretical Arti?cial Intelligence10(4)(1998)511–549.

[5]L.M.de Campos,J.M.Fern a ndez,J.F.Huete,Query expansion in information retrieval

systems using a Bayesian network-based thesaurus,in:Proceedings of the14th Uncertainty in Arti?cial Intelligence Conference,1998,pp.53–60.

[7]L.M.de Campos,J.M.Fern a ndez,J.F.Huete,Reducing term to term relationships in an

extended Bayesian network retrieval model,in:Proceedings of the11th Information Processing and Management of Uncertainty in Knowledge-based Systems,2002,pp.543–552.

[8]L.M.de Campos,J.M.Fern a ndez,J.F.Huete,Reducing term to term relationships in an

extended Bayesian network retrieval model,in:Proceedings of the First Workshop on Probabilistic Graphical Models,2002,pp.35–44.

[9]L.M.de Campos,J.M.Fern a ndez,J.F.Huete,Improving the e?ciency of the Bayesian

network retrieval model by reducing the relationships among terms,International Journal of Uncertainty,Fuzziness and Knowledge-Based Systems,in press.

[10]G.F.Cooper,Probabilistic Inference using belief networks is NP-hard,Arti?cial Intelligence

393–405(1990)1990.

[12]J.M.Fern a ndez-Luna,Modelos de recuperaci o n de informaci o n basados en redes de creencia,

Ph.D.thesis,Universidad de Granada,2001.

[16]L.Helm,Improbable inspiration,Los Angeles Times,1996.

[17]M.Indrawan,D.Ghazfan,B.Srinivasan,Using Bayesian networks as retrieval engines,in:

Proceedings of the Sixth Text Retrieval Conference,1996,pp.437–444.

[18]J.Pearl,Probabilistic Reasoning in Intelligent Systems:Networks of Plausible Inference,

Morgan and Kaufmann,San Mateo,CA,1988.

[19]G.Rebane,J.Pearl,The recovery of causal polytrees from statistical data,in:Uncertainty in

Arti?cial Intelligence,North Holland,Amsterdam,1989,pp.175–182.

[20]B.A.Ribeiro-Neto,R.R.Muntz,A belief network model for IR,in:Proceedings of the19th

ACM SIGIR Conference,1996,pp.253–260.