文档库 最新最全的文档下载
当前位置:文档库 › An Evolutionary Multiobjective Approach for Community Discovery in Dynamic Networks

An Evolutionary Multiobjective Approach for Community Discovery in Dynamic Networks

1 An Evolutionary Multiobjective Approach for Community Discovery in Dynamic Networks

Francesco Folino,Clara Pizzuti

Abstract—The discovery of evolving communities in dynamic networks is an important research topic that poses challenging tasks.Evolutionary clustering is a recent framework for clustering dynamic networks that introduces the concept of temporal smoothness inside the community structure detection method.Evolutionary-based clustering approaches try to maximize cluster accuracy with respect to incoming data of the current time step,and minimize clustering drift from one time step to the successive one.In order to optimize both these two competing objectives,an input parameter that controls the preference degree of a user towards either the snapshot quality or the temporal quality is needed.

In this paper the detection of communities with temporal smoothness is formulated as a multiobjective problem and a method based on genetic algorithms is proposed.The main advantage of the algorithm is that it automatically provides a solution representing the best trade-off between the accuracy of the clustering obtained,and the deviation from one time step to the successive.Experiments on synthetic data sets show the very good performance of the method when compared with state-of-

the-art approaches.

Index Terms—Evolutionary clustering,complex networks,dynamic networks,community discovery.

!

1I NTRODUCTION

The adaptability of networks to represent many real world complex systems,including those undergoing dynamic shifts of their structure,is generating a growing interest in the study of their topological https://www.wendangku.net/doc/f66387069.html,works are modeled as graphs,where nodes represent individual ob-jects,and edges represent interactions among these ob-jects.Individuals in a network interact with each other and exchange information by forming communities.The detection of community structure,i.e.the organization of nodes into groups having many connections inside the same cluster and relatively sparse connections between vertices of different communities,is a fundamental research topic in the study of complex networks.An important standpoint to analyze in networks is their dynamic behavior,i.e.the evolutions they go through over time.

Dynamic networks,in fact,capture the modi?cations of interconnections over time,allowing to trace the changes of network structure at different time steps.Many approaches have been proposed for the analysis and temporal evolution of dynamic networks[1],[2],[3],[4],[5],[6],[7],[8],[9], [10],[11],[12].Some of these methods[2],[5],[10],[3] employ the concept of evolutionary clustering,introduced by Chakrabarti et al.in[13],to catch the evolution of clusters in temporal data.

Evolutionary clustering groups data coming at different ?F.Folino is with National Research Council of Italy(CNR),Institute for High Performance Computing and networking(ICAR),Via Pietro Bucci,41C,87036Rende(CS),Italy,Email:{ffolino}@https://www.wendangku.net/doc/f66387069.html,r.it ?Clara Pizzuti is with National Research Council of Italy(CNR), Institute for High Performance Computing and networking (ICAR),Via Pietro Bucci,41C,87036Rende(CS),Italy,Email: {pizzuti}@https://www.wendangku.net/doc/f66387069.html,r.it time steps to produce a sequence of clusterings by in-troducing a framework called temporal smoothness.This framework assumes that abrupt changes of clustering in a short time period are not desirable,thus it smooths each community over time.Smoothness is realized by trading-off between two different criteria.The?rst,called snapshot quality,is that the clustering should re?ect as accurately as possible the data coming during the current time step.The second,called temporal cost,is that each clustering should not dramatically shift from one time step to the next one. In this paper a multiobjective approach,named DYN-MOGA(DYNamic MultiObjective Genetic Algorithms),to discover communities in dynamic networks by employing genetic algorithms[14],is proposed.The detection of community structure with temporal smoothness,in fact,can be formulated as a multiobjective optimization problem.The ?rst objective is the maximization of the snapshot quality, that measures how well the clustering found represents the data at the current time.The second objective is the minimization of the temporal cost,that measures the distance between two clusterings at consecutive time steps. In order to maximize the snapshot quality to measure the goodness of the division in communities of a network, different quality measures,used in the literature to capture the intuition of network community[15],are considered.In particular,we?x the attention on the concept of modularity, introduced by Newman and Girvan in[16].

To minimize the temporal cost we compute the Nor-malized Mutual Information(NMI for short),a well known entropy measure in information theory that measures the similarity of two clusterings,between the community structure obtained at the current time step with that obtained at the previous one[17].

DYNMOGA exploits the bene?ts of these two functions

Digital Object Indentifier 10.1109/TKDE.2013.1311041-4347/13/$31.00 ? 2013 IEEE

2

and discovers the communities in the network by selectively exploring the search space,without the need to know in advance the exact number of groups.This number is automatically determined by simultaneously optimizing the objectives.

Experiments on synthetic and real life networks show the capability of the multiobjective genetic approach to correctly detect communities with results competitive w.r.t. the state-of-the-art approaches.

It is worth to note that,though multiobjective evolution-ary algorithms have been proposed for partitioning static graphs[18],[19],[20],and for data clustering[21],their use for dynamic networks,however,has not been explored very much[22].

The main contributions of the paper can be summarized as follows:

?The detection of community structure in dynamic networks is formalized as a multiobjective optimiza-tion problem.The?rst objective searches for highly modular structures at the current time step,the second objective tries to minimize the differences between the community structure at the current time step and that obtained at the previous time step.

?The approach proposed can be considered as a general framework for evolutionary clustering.In fact,it is suf-?cient to change one of the two objective functions(or both)to implement and test different quality functions for analyzing dynamic networks.

?The method is parameter free.It does not need neither the parameter introduced by Chi et al.[2]to give different weight to the snapshot and temporal costs, nor the number of clusters to?nd.The former param-eter is automatically determined by the multiobjective method,while the number of communities to?nd at each time step is determined by the genetic represen-tation employed.

The paper is organized as follows.In Section2the concept of dynamic network is de?ned and the evolutionary cluster-ing problem is formalized.Section3gives a review of the evolutionary clustering approaches.Section4formulates the community detection problem in dynamic networks as a multiobjective optimization problem.Section5describes the method,the genetic representation adopted and the variation operators used.In section6the results of the method on synthetic and real life networks are presented, and a comparison with the approaches of[5]and[3]is reported.Section7,?nally,concludes the paper.

2P ROBLEM F ORMULATION

Let{1,...,T}be a?nite set of time steps and V= {1,...,n}be a set of individuals or objects.A static network N t at time t can be modeled as a graph G t= (V t,E t)where V t is a set of objects,called nodes or vertices,and E t is a set of links,called edges,that connect two elements of V t at time t.Thus G t is the graph representing a snapshot of the network N t at time t. V t?V is a subset of individuals V observed at time t.An edge(u t,v t)∈E t if individuals u and v have interacted at time t.

A community(also called cluster or module)in a static network N t is a group of vertices V t i?V t having a high density of edges inside the group,and a low density of edges with the remaining nodes V t/V t i.Let C t denote the sub-graph representing a community.

A clustering,or community structure,CR t= {C t1,...C t k}of a network N t at time t is a partitioning of G t in groups of nodes such that for each couple of communities C t i and C t j∈CR t,V t i∩V t j=?.

A dynamic network is a sequence N={N1,...,N T}of static networks,where each N t is a snapshot of individuals and connections among these individuals at time t.

3R ELATED WORK

Analyzing networks and their evolution is recently receiv-ing increasing interest from researchers[7],[23],[24].In fact,the representation of many complex systems through a static graph,even when the temporal dimension describing the varying interconnections among nodes is available,does not allow to study the network dynamics and the changes it incurs over time.Kumar et al.[4]studied the evolution of network properties of two large blogosphere networks, and tried to classify members of each network into groups and their changes.Sun et al.[9]introduced GraphScope, an ef?cient parameter-free method,based on the Minimum Description Length principle,to discover communities in evolving graphs.Asur et al.[1]have characterized the evolution of communities by de?ning critical events that occur in dynamic graphs.Tang et al.[25]studied multimode networks and introduced a spectral clustering framework to discover communities and?nd out how they evolve.There are several other methods for detecting communities in dynamic networks,not mentioned here.For a recent survey see[26].Many of the proposed approaches divide the community detection step from the temporal analysis of the network,i.e.?rst the communities are extracted,and then the structural differences over time are analyzed in order to determine the correspondences among communities in two consecutive time steps.

A different methodology,known as evolutionary cluster-ing,has been proposed by Chakrabarti et al.in[13].Since our method has been inspired by the evolutionary clustering framework,in the following lines a more detailed descrip-tion of evolutionary clustering approaches is reported. Chakrabarti et al.in[13]observe that,often,changes of connections in short time periods could be caused by noise.Thus,though clustering mainly depends on the current object connections,in many dynamic application domains,abrupt drifts from the most recent history should not be expected.At each time step a new clustering must be produced by simultaneously optimizing two con?icting criteria.The?rst is that the clustering should re?ect,as accurately as possible,the data coming during the current time step.The second is that each clustering should not shift dramatically from one time step to the successive.

3

To satisfy this last property a framework,called temporal

smoothness,is de?ned.This framework assumes that abrupt

change of clustering in a short time period is not desirable, thus it smooths each community over time.For smoothing,

a cost function composed by two sub-costs,the snapshot

cost(SC)and the temporal cost(T C),is de?ned.The snapshot cost SC measures how well a community structure CR t represents the data at time t.The temporal cost T C measures how similar the community structure CR t is to the

previous clustering CR t?1.As pointed out by the authors,

the clustering algorithm must trade-off the bene?t of main-taining a consistent clustering over time(temporal cost) with the cost of deviating from an accurate representation of the current data(snapshot cost)[13].The framework of evolutionary clustering is apt in those situations in which the clustering result is frequently and regularly consumed by a user.Thus mild changes are preferred over dramatic shifts because in such a way the user is not required to learn a new data segmentation.Evolutionary clustering will provide a smooth view of the transition for successive time steps.In this setting it is possible to associate clusters within the historical context,and thus to trace their evolution. Though evolutionary clustering is robust to noise,it does not allow that the number of communities varies over time, neither that new communities may appear,nor that existing communities may dissolve.

Chakrabarti et al.proposed the framework for generic

data clustering,and applied it to two well known clustering

methods,k-means and agglomerative hierarchical cluster-ing,to deal with evolving data.Thus they introduced the cost function for generic data objects.A specialized version of this function in the context of dynamic networks has been ?rst introduced by Chi et al.[2]and adopted also by Lin et al.[5],and Kim and Han[3].Chi et al.[2]de?ne the cost function as follows:

cost=α·SC+(1?α)·T C(1) whereαis an input parameter used by the user to empha-size one of the two objectives.Whenα=1the approach returns the clustering without temporal smoothing.When α=0,however,the same clustering of the previous time step is produced,i.e.CR t=CR t?1.Thus a value between 0and1is used to control the preference degree of each sub-cost.The authors proposed two evolutionary spectral clustering algorithms by using the normalized cut concept of Shi and Malik[27].The two methods differ in how the temporal cost T C is measured.

An evolutionary-based framework for analyzing com-munities and evolutions in dynamic networks,named F acetNet,has been proposed by Lin et al.[5].The framework employs a stochastic block model for gener-ating communities,and a probabilistic model based on the Dirichlet distribution to catch community evolutions.The authors de?ne the snapshot cost by using the KL-divergence between the observed node similarity matrix at time t and an approximate community structure computed by using a mixture model.F acetNet thus,at each iteration,updates the values of the approximate structure in order to decrease the cost function.Convergence to a local optimal solution is guaranteed by the monotonic decrease of the cost function.

F acetNet discovers communities that maximize the?t to the observed data and the temporal evolution.

Kim and Han[3]proposed a particle-and-density based clustering method based on the evolutionary approach of Chakrabarti et al.[13],extended to deal with a variable number of communities between different time steps.The method introduces the concept of nano-community and l-clique-by-clique(l-KK)to discover a variable number of communities that can evolve,form,and dissolve.A nano-community captures the evolution of a dynamic network over time at particle level.A community is modeled as a dense subset of nano-communities and l-KK.A biclique is a complete bipartite graph such that two nodes are connected if and only if they are in different partites.Being complete, each node in a partite is connected with all the nodes in the other partite.An l-clique-by-clique is an extension of a biclique to a number l of bicliques.A cost embedding technique to allow temporal smoothing and a density-based clustering method to?nd local clusters by optimizing the clustering modularity are proposed.Furthermore,to map communities between consecutive time periods,the mutual information concept among the clusterings obtained at time t?1and t is employed,by purifying the link distributions between consecutive networks in order to maximize the mutual information.The mapping can determine the stage of each community as evolving,forming,splitting.It is worth to note that the approach does not take into account neither the possibility of a community to split into multiple communities,nor that multiple communities merge into only one.

The described methods present two main limitations.The ?rst regards the number of clusters to?nd,the second is relative to the valueαto choose in order to apply temporal smoothness.The approach of Chi et al.allows for a varying number k of communities between successive time steps,however the value of k must be given as input parameter.Some heuristics are argued by the authors in order to determine its optimal value.F acetNet assumes a?xed number k of communities over time,though the authors suggest that,to detect the best community number at time t,the algorithms could be executed for a range of different k values,and that giving the best value of modularity could be chosen.Kim and Han’s approach does not need the number of communities,however,since their approach is density-based,they must set the density parameters.Another problem is the value to use for the temporal smoothness.All the approaches must give it as input parameter in order to control the preference degree of a user with respect to either the snapshot quality or the temporal quality.The two quality functions,however,are competing.In fact,optimizing one produces a degradation of the other.

In the following we propose a parameter-free method that automatically determines both the number k of communi-ties,and the optimalαvalue.

4

4M ULTIOBJECTIVE C LUSTERING

Given a static network N t,a multiobjective evolutionary clustering problem(Ω,F1,F2,...,F h)can be de?ned as min F i(CR t),i=1,...,h subject to CR t∈ΩwhereΩ={CR t1,...,CR t k}is the set of feasible cluster-

ings of N t at time stamp t,and F={F1,F2,...,F h}is a set of h single criterion functions.Each F i:Ω→R is a different objective function that determines the feasibility of the clustering obtained.Since F is a vector of competing objectives that must be simultaneously optimized,there is not one unique solution to the problem,but a set of solutions are found through the use of Pareto optimality theory[28].Given two solutions CR1and CR2∈Ω, solution CR1is said to dominate solution CR2,denoted as CR1?CR2,if and only if

?i:F i(CR1)≤F i(CR2)∧?i s.t.F i(CR1)

The vector F maps the solution space into the objective function space.When the nondominated solutions are plot-ted in the objective space,they are called the Pareto front. Thus the Pareto front represents the compromise solutions satisfying all the objectives as best as possible.

In the last few years many efforts have been devoted to the application of evolutionary computation to the de-velopment of multiobjective optimization algorithms.Evo-lutionary algorithms,in fact,proved very successful to solve multiobjective optimization problems because of the population-based nature of the approach that allows the generation of several elements of the Pareto set in a single run[29],[14].

In the next section we propose a multiobjective evolu-tionary community detection approach that tries to optimize both the snapshot cost and the temporal cost without the need to?x the control parameterα.The solutions contained on the Pareto front of the multiobjective optimization problem will represent the best compromise satisfying both the snapshot and temporal costs.

It is worth to note that the word evolutionary has a different meaning with respect to the context in which it is used.For Chakrabarti et al.in[13]the term evolutionary is intended as temporal evolution.In the context of mul-tiobjective optimization it means evolutionary algorithms implementing the concept of Darwinian biological evolu-tion[30].

5T HE DYNMOGA ALGORITHM

The MultiObjective Genetic Algorithm(MOGA)we used is the Nondominated Sorting Genetic Algorithm(NSGA-II) proposed by Srinivas and Deb in[31]and implemented in the Genetic Algorithm and Direct Search Toolbox of MATLAB.NSGA-II builds a population of competing individuals and ranks them on the basis of nondominance (for a detailed description of the approach see[14]).In order to employ NSGA-II,DYNMOGA has been adapted with a customized population type that suitably repre-sents a partitioning of a network and endowed with two complementary objectives.In the following,we?rst give a brief introduction to Genetic Algorithms,and than the objective functions selected,the genetic encoding adopted, and the modi?ed variation operators used to work with this encoding are described.

Genetic Algorithms(GAs)are a class of adaptive general-purpose search techniques inspired by natural evolution proposed by Holland[32]in the early1970s as computer programs that simulate the evolution process in nature.A standard Genetic Algorithm evolves a constant-size popu-lation of individuals(called chromosomes)by using the genetic operators of reproduction,crossover and mutation. Each chromosome is composed by a number of genes and represents a candidate solution to a given problem.An individual is associated with a?tness value that re?ects how good it is,with respect to the other solutions in the popula-tion.Reproduction operator copies elements of the current population into the next generation with a selected strategy. Crossover generates two new chromosomes by crossing two elements of the population.Mutation randomly alters genes of individuals.

Objective Functions:As described in the previous section,we are interested in optimizing the cost function (formula(1))composed by the two competing objectives, the snapshot cost SC and the temporal cost T C.Since SC measures how well a community structure C t represents the data at time t,we need an objective function that maximizes the number of connections inside each community while minimizing the number of links between the communities. To this end we employ four quality measures,well known in the community detection?eld[15],that formalize the intuitive concept of community.

Let CR t={C t1,...C t k}be a clustering of a network G t=(V t,E t)at time t with n nodes and m edges,C t a cluster having n s nodes and m s edges,m s(u)={v| v∈C t}be the number of nodes in C t connected with u,c s={(u,v)|u∈C t,v/∈C t}be the number of edges on the boundary of C t,l s the total number of edges joining vertices inside the module C t s,and d s the sum of the degrees of the nodes of C t s.

The scores we consider are modularity[16], conductance[33],normalized cut[27],and community score[22].Their de?nition is as follows:

Q=

k

s=1

[l s

m

?(d s

2m

)2](2)

In modularity Q[16]the?rst term of each summand of is the fraction of edges inside a community,while the second one is the expected value of the fraction of edges that would be in the network if edges fall at random without regard

5

Fig.1.A network of7nodes partitioned in two communities{1,2,3,4}

and{5,6,7},and the corresponding locus-based representation.

to the community structure.Values approaching1indicate

strong community structure.

CO=

k

s=1

c s

2m s+c s(3)

Conductance CO[33]measures the fraction of edges

pointing outside the clustering.

NC=

k

s=1

c s

2m s+c s

+c s

2(m?m s)+c s(4)

Normalized Cut NC[27]measures the fraction of total

edge connections to all the nodes in the graph.

CS=

k

s=1

(

u∈C t

(m s

(v)

n s

)2)×

2m s

n s

(5)

Community Score CS[22]measures the fraction of internal

edges of each cluster per nodes.

The second objective must minimize the temporal cost

T C,thus we need a metric to measure how similar the

community structure CR t is to the previous clustering

CR t?1.To this end we employ the Normalized Mutual

Information,a well known entropy measure in information

theory[17].Given two partitionings A={A1,...,A a}

and B={B1,...,B b}of a network in communities,let C

be the confusion matrix whose element C ij is the number

of nodes of the community A i∈A that are also in the

community B j∈B.The normalized mutual information

NMI(A,B)is de?ned as:

NMI(A,B)=

?2

c A

i=1

c B

j=1

C ij log(C ij N/C i.C.j)

c A

i=1

C i.log(C i./N)+

c B

j=1

C.j log(C.j/N)

(6)

where c A(c B)is the number of groups in the partitioning

A(B),C i.(C.j)is the sum of the elements of C in row

i(column j),and N is the number of nodes.If A=B,

NMI(A,B)=1.If A and B are completely different,

NMI(A,B)=0.Thus our second objective at a generic

time step t is to maximize NMI(CR t,CR t?1).

Genetic representation:Our clustering algorithm uses

the locus-based adjacency representation proposed in[34],

adopted in[21]for data clustering and in[20]for com-

munity detection in static networks.In this graph-based

representation,an individual of the population consists of

n genes g1,...,g n,where n is the number of nodes.Each

gene can assume a value j in the range{1,...,n}.Genes

represent nodes of the graph G=(V,E)modeling a

network N,and a value j assigned to the i-th gene is

interpreted as a link between the nodes i and j of V.This

means that in the clustering solution found,i and j will be

in the same cluster.A decoding step,however,is necessary

to identify all the components of the corresponding graph.

The nodes participating in the same component are assigned

to one cluster.A main advantage of this representation is

that the number k of clusters is automatically determined by

the number of components contained in an individual and

determined by the decoding step.Figure1shows a network

partition and the corresponding encoded genotype.

Input:Given a dynamic network N={N1,...,N T},the sequence

of graphs G={G1,...,G T}modeling it,and the number T

of time steps.

Output:A clustering for each network N i of N.

Method:Perform the following steps:

1Generate an initial clustering CR1={C11,...C1k}of the

network N1without smoothing by optimizing only the?rst

objective;

2for t=2to T

3Create a population of random individuals whose

length equals the number n=|V t|of nodes of G t;

4while termination condition is not satis?ed do

5Decode each individual I={g1,...,g n}of the

population to generate the partitioning

CR t={C t1,...,C t k}of the graph G t in k

connected components;

6Evaluate the two?tness values of the translated

individuals;

7Assign a rank to each individual and sort them

according to nondomination rank;

8Create a new population of offspring by applying the

variation operators;

9Combine the parents and offspring into a new pool

and partition it into fronts;

10Select points on the lower front(with lower rank)and

apply the variation operators on them to create

the next population;

11end while

12return the solution CR t={C t1,...C t k}of the

Pareto front having the maximum modularity value;

13end for

Fig.2.The pseudo-code of the DYNMOGA algorithm.

Initialization:A population of random individuals is

generated such that for each node i,the value of g i is

randomly chosen among one of its neighboring nodes j.

This means that the edge(i,j)exists.

Uniform Crossover:Given two parents,a random binary

mask is created.Uniform crossover(see Table1)then

selects the genes where the mask is a0from the?rst

parent,and the genes where the mask is a1from the second

parent,and combines the genes to form the child.The child

at each position i contains a value j coming from one of

the two parents.Thus uniform crossover maintains node

connections in the child individual since the edge(i,j)

exists.

TABLE1

Example of uniform crossover.

Parent1:4322656

Parent2:3315476

Mask:0110011

O?spring4312676

6

Fig. 3.Normalized mutual information for different combinations of crossover and mutation rates for sysnthetic dataset#1.

Mutation:The mutation operator,similarly to initializa-tion,for each node i randomly changes the value of g i with one of the neighbors of i.This mutation guarantees the generation of a mutated child in which each node is linked only with one of its neighbors.

The pseudo-code of DYNMOGA is reported in Figure 2.Given a dynamic network N={N1,...,N T}and the sequence of graphs G={G1,...,G T}modeling it, DYNMOGA?nds a partitioning of the network N1by running the genetic algorithm that optimizes only the?rst objective.For a given number of time steps,the multiob-jective genetic algorithm creates a population of random individuals whose length is the number of nodes of the current graph G t.Then,for a?xed number of generations, it decodes the individuals to generate the partitioning at time step t,evaluates the objective values,assigns a rank to each individual according to Pareto dominance and sorts them.A new population is generated by applying the specialized variation operators described above.Parents and offspring are then combined,and the new pool is partitioned into fronts.The individuals with the lower rank are selected and variation operators are applied on them to create the new population.At the end of each time step DYNMOGA returns a set of solutions,i.e.all those contained in the Pareto front.Each of these solutions corresponds to a different trade-off between the two objectives and thus to diverse partitioning of the network consisting of various number of clusters.A criterion should be established to automatically select one solution with respect to another.To this end,we choose the partitioning having the highest value of modularity.This choice is motivated by the fact that, since the Pareto front already selected the nondominated solutions that satisfy the snapshot and the temporal cost at the best,a solution presenting a better community structure is preferable.

Computational Complexity:DYNMOGA uses the Non-dominated Sorting Genetic Algorithm(NSGA-II)proposed by Deb et al.in[36].NSGA-II builds a population of competing individuals and ranks them on the basis of non-dominance.In[37]Jensen proposed an ef?cient algorithm for nondominated sorting that can be applied for NSGA-II. Jensen showed that the run-time complexity of the NSGA-II algorithm is O(gp log h?1p),where g is the number of generations,p is the population size,and h is the number of objective functions.Since DYNMOGA optimizes two objectives,its complexity will be O(gp log p).At each generation,however,genetic operators must be ex-ecuted.In particular,crossover needs O(n)time,mutation O(1)time,while?tness computation is composed of three terms:decoding of an individual,modularity and NMI computation.Decoding can be ef?ciently performed by using the disjoint set algorithm described in[35],that represents sets by rooted trees,where each set corresponds to a community.With this data structure the decoding step requires O(n log n)time[35].

To compute modularity we need to consider,for each node i its d i neighbors,thus the time complexity is O(m), where m is the number of edges.As regards normalized mutual information,it has been shown[38]that it can be ef?ciently computed in O(n)time.Fitness computation can thus be realized in O(n log n)+O(m)+O(n) time.Therefore,the overall complexity of DYNMOGA is O((gp log p)×(n log n+m)).

In the next section we show that DYNMOGA is able to ?nd meaningful network structure for both synthetic and real life data sets.

6E XPERIMENTAL R ESULTS

In this section we study the effectiveness of our approach by?rst employing modularity as objective function that optimizes the snapshot cost,and compare the results ob-tained by DYNMOGA w.r.t.the algorithm of Lin et al.[5] and Kim and Han[3]on synthetic networks for which the partitioning in communities is known.In such a case we show that our multiobjective genetic algorithm successfully detects the network structure and it is very competitive with respect to the other approaches.In Section6.7we then show the behavior of the method when conductance,normalized cut and community score are applied,and report the results obtained by each objective.Finally,we apply our method on two real-world networks.

The DYNMOGA algorithm has been written in MAT-LAB,using both the Genetic Algorithms and Direct Search 2toolboxes.Parameter setting is a challenging research problem in evolutionary algorithms.In[39]the problem has been deeply investigated and the authors proved that, though it is possible to?nd good parameter values for a set of problems,general tuning that allows for good perfor-mance on a wide range of problems is dif?cult.As regards DYNMOGA,we employed a trial-and-error procedure on one of the synthetic dataset described below(dataset#1)by computing the normalized mutual information for different combinations of crossover and mutation rates.Figure3 shows the obtained values.It can be observed that they do not present high variation,thus we set crossover rate=0.8 and mutation rate=0.2,since,in general,high crossover rate and low mutation rate are suggested in the literature. Furthermore,we?xed elite reproduction=10%of the population size,roulette selection function,population size =100,and number of generations100.

6.1Evaluation measures

In order to compare DYNMOGA and the other approaches on the synthetic data sets,two validation measures,the

7

normalized mutual information(NMI),already described in the previous section,and the error rate are employed.The error rate,as reported in[5],is computed by considering an n×k indicator matrix Z storing,for each of the n nodes, the community membership to one of the k communities obtained by the algorithm,and a similar indicator matrix G built for the ground truth.The error rate is then de?ned as the norm||ZZ T?GG T||,which measures the distance between the community structures represented by Z and G.

6.2Synthetic dataset#1

The?rst dataset we consider is the benchmark adopted by Lin et al.in[5].This data set is generated by the authors analogously to the classical benchmark proposed by Girvan and Newman in[40].The network consists of128nodes divided into four communities of32nodes each.Every node has a?xed average degree avgDegree,and shares a number z of links with the nodes not belonging to its community.Increasing z augments the noise level of the network.Two different values of avgDegree and z have been considered.When avgDegree=16a less dense network is generated,while with avgDegree=20we have a denser network.As regards z,a value z=5generates a more distinct community structure,while?xing z=6,the community structure becomes less clear.Edges are placed with higher probability between a pair of nodes in the same community,and with lower probability between nodes in different communities.These probabilities depend on the value of z.In order to introduce dynamics in G,nC%of nodes are moved among communities.To this purpose,two different cases have been considered.In the former,the community structure presents soft changes,thus the10% of nodes are randomly selected from each community,and randomly assigned to the other three communities.In the second case the changes are more substantial,thus the30% of nodes change their community at each time step.We considered20time steps and the values averaged over50 experiments are reported.

Figures4and5show the error rate and the normalized mutual information obtained by DYNMOGA and F acetNet for the different networks built with the parameter values described.The results for F acetNet have been obtained forα=0.8,value chosen by the authors in their paper[5]. In particular,Figure4reports the error rates when(a)z= 5and percentage of node changes nC=10%,(b)z=5 and percentage of node changes nC=30%,(c)z=6and nC=10%,(d)z=6and nC=30%.Figure5reports the normalized mutual information for the same con?guration. These two?gures point out that DYNMOGA obtains lower error rate for all the4synthetic networks,at each time step.As regards the NMI,when the fuzziness is low and the community structure changes are mild(Figure5(a)),the NMI of the two methods are comparable.However,when the fuzziness increases,thus the networks present more dramatic changes,DYNMOGA is able to better detect the true community structure.This behavior is clear in Figures 5(c)and(d).

F acetNet needs that the parameterαbe?xed by the user.Lin et al.[5]point out that automatically?nding the

best value ofαhas not been investigated,and that setting a value pushes the algorithm to prefer a result more biased

towards either snapshot quality or temporal smoothness.

Figure6shows the error rate obtained by F acetNet when the average degree has been?xed to20,for increasing

values of the parameterα,compared with that obtained by DYNMOGA.Figure6(a)shows the results for z=5and

nC=30%,while Figure6(b)for z=6and nC=30%.

In the former case,DYNMOGA outperforms F acetNet whenα≤0.7.In the latter case,the errors obtained by

DYNMOGA are always lower,independently of the value

ofα,except for slight differences at three time steps for α≥0.8.The?gure points out that the choice ofαis not an easy task because it in?uences the quality of the results.As already emphasized,DYNMOGA is able to automatically

determine the best tradeoff between the two competing

objectives.

6.3Synthetic dataset#2

The second synthetic dataset has been generated by taking into account some main events that may characterize the

evolution of dynamic networks[7],[41],[1],[42].To this end we assumed four types of events,as introduced by

Greene et al.in[42].The events are the following:?Birth and death:from the second time step on,10%of new communities are created by removing nodes from

other existing communities,and randomly removing 10%of the existing communities.

?Expansion and contraction:10%of communities are randomly selected and expanded or contracted by25% of their size.When expanded,the new nodes are chosen at random from the other communities.?Intermittent communities:10%of communities from the?rst time step are hide.

?Merging and splitting:at each time step,10%of com-munities are split,10%of communities are chosen, and couples of communities are merged.

We generated four synthetic data sets for the four dif-ferent types of events,for20time steps.The parameters to the generator have been set such that each network is constituted by1000nodes having mean degree of15 and maximum degree50,number of communities between 20and50,and mixing parameter(percentage of edges between communities)0.2.Figures7and8depict the error rate and the normalized mutual information on the four different data sets.The?gures clearly show that DYNMOGA outperforms F acetNet on all these four types of networks.In particular,it is worth to note that the error rate of DYNMOGA is the same or higher than that obtained by F acetNet at the?rst time step.However,from the second time step on,the error rate of DYNMOGA is sensibly lower,and the NMI value is notably higher.Note that that the behavior of F acetNet is rather different with respect to the type of event that can occur.In particular,in the case of birth/death of communities(Figure8(a))its NMI

(a)

(b)

(c)

(d)

Fig.4.

Error rate on synthetic dataset #1when avgDegree =16:(a)z =5and nC =10%,(b)z =5and nC =30%,(c)z =6and nC =10%,(d)z =6and nC

=30%.

(a)(b)

(c)

(d)

Fig.5.

NMI on synthetic dataset #1when avgDegree =16:(a)z =5and nC =10%,(b)z =5and nC =30%,(c)z =6and nC =10%,(d)z =6and nC =30%.

(a)

(b)

Fig.6.Error rate on synthetic dataset #1when avgDegree =20:(a)z =5and nC =30%,(b)z =6and nC =30%,and αvarying between 0.1and 0.9.F N 0.i ,i=1,...9,stands for FacetNet result with input α=0.i

(a)(b)(c)(d)

Fig.7.

Error rate on synthetic dataset #2:(a)birth and death,(b)expansion and contraction,(c)intermittent communities,(d)merging and splitting.

drastically decreases from 0.99to 0.4over the 20time steps,while DYNMOGA maintains an NMI value near to 1.As regards the other three events,the NMI obtained by F acetNet diminishes in a more soft way,reaching a value not less than 0.80,while DYNMOGA does not obtain values lower than 0.92.

As already reported,the results for F acetNet have been obtained by ?xing α=0.8.In order to check the in?uence of different values of αalso on this kind of data set,Figure 9shows the NMI values obtained when

αassumes the values 0,0.4,0.6,and 1,respectively,in case of expansion and contraction events.When α=0the NMI obtained by F acetNet drastically decreases,since the cost function biases the method to ?nd solutions more similar to the previous time step,completely disregarding the current snapshot.The opposite behavior is obtained when α=1.Although in this case F acetNet has to ?nd the community structure that best ?ts the current snapshot,independently from the previous time step,it obtains NMI values between 0.97and 0.83,which are lower than that

(a)

(b)

(c)(d)

Fig.8.Normalized mutual information on synthetic dataset#2:(a)birth and death,(b)expansion and contraction,(c)intermittent communities,(d)merging and

splitting.

(a)α=0

(b)α=0.

4(c)α=0.

6

(d)α=1

Fig.9.NMIs on Derek-Green’s expansion and contraction datasets while varyingα.

(a)

(b)(c)

Fig.10.Error(a)and Normalized Mutual Information(b),and degree

distribution(c)in log-log scale on power law synthetic dataset(merging and

splitting).

obtained by DYNMOGA(above0.98from the second time

step on).

6.4Power-law networks

Lancichinetti et al.[43]proposed a new class of bench-

marks(LFR benchmark)that extend the Girvan and New-

man benchmark by introducing power law degree distri-

butions and different community size.In[44]it has been

experimented that many community detection algorithms

perform well on the Girvan and Newman benchmark,but

give poor results on the LFR benchmark.Although the

network generator of Greene et al.in[42]is based on

binary network generation tools proposed in[44],it does

not allow to?x the exponent of the power law distribution

from which assigning degrees to nodes.Thus,we generated

an LFR benchmark constituted by1000nodes,average

node degree20,maximum node degree50,exponent of

degree distribution-2,community size distribution-1,and

mixing parameter0.3.The generated networks have only

one node with maximum degree50,while almost70%

of nodes have degree lower than the average degree20.

In order to introduce dynamics,for5time steps,10%of

communities were chosen in a random way,and repeatedly

they were split(steps2,4)and merged(steps3,5).

Figure10(a)and Figure10(b)show error and nor-

malized mutual information obtained by F acetNet and

DYNMOGA.We can notice that,after the?rst time step,

DYNMOGA is able to better recover the true evolving

community structure,and it works well also on power law

networks.Figure10(c)reports the degree distribution of

networks.The?gure clearly shows the typical long tail of

power law distribution.

6.5Synthetic dataset#3

In this section we compare DYNMOGA with the method of

Kim and Han[3].It is worth to point out that the results

reported for this latter approach are those appearing in[3],

and provided by the authors.The comparison has been

performed on two kinds of data sets.The?rst one is a

dynamic network of a?xed number of communities(named

SYN-FIX).The second one is a dynamic network with a

variable number of communities(named SYN-V AR).

SYN-FIX is similar to the synthetic dataset#1.The

network consists of128nodes divided into four communi-

ties of32nodes each.Every node has an average degree

of16and shares a number z of links with the other

nodes of the network.In order to introduce dynamics,3

nodes are randomly selected from each community in G t?1

and randomly assigned to the other three communities.

SYN-V AR is obtained by modifying the generation method

of SYN-FIX to introduce the forming and dissolving of

communities and the attaching and detaching of nodes.

The initial networks contains256nodes,divided in4

communities of64nodes each.10consecutive networks are

generated by choosing8nodes from each community and

generating a new community with these32nodes.This is

done for5timestamps,then the nodes return to the original

communities.Thus,the number of communities for the10

timestamps is4,5,6,7,8,8,7,6,5,4.The average degree

(a)(b)

(c)(d)

Fig.11.Normalized mutual information on synthetic dataset #3:(a)SYN-FIX when z =3,(b)SYN-FIX when z =5,(c)SYN-VAR when z =3,(d)SYN-VAR when z =5.

of each node in a cluster is set to the half of the size of this cluster.Furthermore,at each time step 16nodes are randomly deleted and 16new nodes are added to the network.

We generated 10different networks for 10timestamps,run DYNMOGA on them,and computed the Normalized Mutual Information to measure the similarity between the true partitions and the detected ones.

Figure 11shows the average normalized mutual infor-mation over the 10timestamps for SYN-FIX when z =3(Figure 11(a))and z =5(Figure 11(b)),for SYN-V AR when z =3(Figure 11(c))and z =5(Figure 11(d)).

The ?gure shows the signi?cantly better results obtained by DYNMOGA with respect to Kim and Han’s algorithm.In fact,for SYN-FIX and SYN-V AR,when z =3,DYNMOGA obtains a value which is almost always 1,while the values obtained by Kim and Han’s algorithm are around 0.9for SYN-FIX and aroud 0.7for SYN-V AR.The differences,however,are much more remarkable when z =5.In this case DYNMOGA obtains values above 0.9for all the timestamps,while Kim and Han method fails to uncover the community structure.In fact,it returns values of normalized mutual information between 0.1and

0.2.

(a)

(b)

Fig.12.Error and Normalized Mutual Information on synthetic dataset #2(merging and splitting),single objective versus multiobjective.

6.6Pareto front solution selection

A characteristic of multiobjective optimization is the gen-eration of a set of solutions.Thus,a single solution out of this set must be selected.There has been a lot of re-search in this decision making problem,and many different approaches have been proposed [29].As already stated,DYNMOGA prefers,among the Pareto front solutions,that having community structure with the higher modularity value,since this concept has been recognized as the most

suitable to interpret the intuitive idea of community.It is worth to note that,the optimization of the two objectives employed by DYNMOGA ,i.e.modularity and normalized mutual information,does not produce the same results of optimizing the single objective of modularity.In fact,in the former case,the Pareto front contains those non-dominated solutions that try to optimize both the snapshot cost (modularity)and the temporal cost (NMI),which is different than optimizing only the snapshot cost (modularity alone).In order to emphasize that optimizing two objectives gives better results than optimizing only modularity,Figure 12compares Error and NMI values obtained by DYNMOGA for the synthetic dataset #2(merging and splitting),with those returned when DYNMOGA is executed by ?xing the temporal cost at zero for all the time steps,thus completely disregarding NMI.The ?gure clearly points out the better performance of the multiobjective approach with respect to the single objective one.

(a)

(b)

Fig.13.Normalized mutual information on synthetic dataset #2with different objective functions:(a)expansion and contraction,(b)merging and splitting.

6.7Changing objective function to compute snapshot cost

Results presented so far have been obtained by employing modularity as snapshot cost function.As already pointed out,DYNMOGA can be considered as a general framework for evolutionary clustering.In fact,it is suf?cient to change one of the two objective functions (or both),to implement and test different approaches for analyzing dynamic net-works.In this section we consider the other three quality measures for community detection introduced in Section 5,and compare the normalized mutual information values computed by using these scores.

11

NMI has been computed for the synthetic dataset #2,in particular expansion and contraction,and merging and splitting networks.The results on the other networks are analogous,thus we do not report them for lack of space.Figure 13shows that modularity overcomes all the other three objectives,though community score works quite well and is comparable with it.

It is worth to notice that it has been shown that the opti-mization of modularity has a resolution limit that depends on the total size of the network and the interconnections of the modules [45].This limit implies the important drawback that,searching for partitioning of maximum modularity,may lead to solutions in which important structures at small scales could not be discovered.However,to overcome this problem,Granell et al.[46]introduced a resolution control parameter γin the modularity formulation.The new

formula being Q R = k s =1[l s m ?γ(d s 2m )2

].When γ=1the

original formulation is obtained,while for increasing values of γ,smaller groups of nodes can be found.Thus,despite the criticisms regarding resolution limit problem,the use of modularity as snapshot cost seems the best choice.

(a)g

=50(b)g

=100(c)g =200

Fig.14.

(a)Running time of DYNMOGA in seconds against number

of nodes varying as {128,256,512,1024,2048,4096}with corresponding number of edges {1938,4018,8184,16158,33026,65256},on the synthetic dataset #1,z=5,nC=10%,for different combinations of population size p and number of generations g .

6.8Scalability Analysis

One of the main criticisms in using Genetic Algorithms,compared with traditional optimization algorithms,is the high execution time required to generate a solution.The major limitation of evolutionary algorithms is,in fact,the repeated ?tness function evaluation that,for complex prob-lems could often be prohibitive.The problem is exacerbated when large populations of individuals are used,and,in particular for the multiobjective approach.In our method ?tness evaluation is rather ef?cient,thus the main problem comes from the network size.

To evaluate the scalability of the method we used the synthetic dataset #1,avgDegree=16,z=5,nC=10%,with number of nodes increasing as {128,256,512,1024,2048,4096},and corresponding number of edges {1938,4018,8184,16158,33026,65256},population size p ,and number of generations g varying in the interval [50,100,200].

Figure 14reports the time requirements for one time step,for the different combinations of p and g .The ?gures show that the scaling of DYNMOGA is lower than quadratic in the number of nodes.

We now want to study the in?uence of population size and number of generations on the accuracy of the method.

(a)

(b)

Fig.15.Error variation (a)and execution times (b)for different combina-tions of population size p and number of generations g on synthetic dataset #2:merging and splitting,n=1000,m=7728.

It is worth to point out that population size p can be viewed as a measure of the parallel search level a GA supports,since p different solutions are considered at the same time to ?nd the local optimum.The more complex the problem to solve,the larger the population to use because the chance of ?nding optimal solutions augments with increasing p .Because of the computation requirements,the choice must balance the quality of the solution versus the computation times.A similar reasoning applies to the number of generations.The longer the algorithm works,the ?tter the solution it can obtain.However,after a number of generations,that depends on the problem,continuing the execution does not provide any improvement of the local optimum obtained so far because the algorithm gets stuck in that local optimum.Figure 15(a)shows how the error varies for different combinations of population size p and number of generations g on synthetic dataset #2(merging and splitting).In particular both p and g assume values in the range [50,100,200].The ?gure points out that the error can drastically drop down from around 9500with p =50and g =50to around 600if p =200and g =200.However the running time (Figure 15(b))sensibly increases.Thus,if the computing resources are limited,the choice of p and g should be done by considering the trade-off between execution time and desired accuracy.

It is worth to point out that,though Genetic Algorithms are naturally suited to be implemented on parallel architec-tures [47],the capabilities of the approach to deal with very large networks could be degraded because of the demanding computing requirements.6.9

Real-life data sets

In this section we apply our method to two real-life dynamic networks:Cell Phone Calls and Enron mail .Cell Phone Calls:the former data set comes from the VAST 2008mini challenge 3:Cell Phone Calls 1and consists of cell call phone records among the members of the ?ctitious Paraiso movement,covering a period of ten days in June 2006.These records have been used to build a network where each node corresponds to a unique cell phone,and an edge between two cell phones is created when a phone call between the two cell phones occurs.For each edge,the day and time of the phone call are reported.The number of cell phones is 400.Thus ten weighted

1.https://www.wendangku.net/doc/f66387069.html,/hcil/V ASTchallenge08/

12

TABLE2

Cell dataset statistics:T is the number of time steps(10days on June2006), M the modularity,|C|the number of communities,|V|the number of nodes, |E|the number of edges,|E|?the number of distinct undirected edges(i.e.

number of different phone calls between two members),Z the average degree,CC the Watts-Strogatz clustering coef?cient,and B the

betweenness of network.

T M|C||V||E||E|?Z CC B

10.664032370987525 2.62500.03280.2992

20.656135373964499 2.49500.01920.2532

30.658730374953509 2.54500.01370.2594

40.6540313741013514 2.57000.01790.2289

50.662632373991508 2.54000.01600.2895

60.665125373963512 2.56000.02070.2246

70.657133367936498 2.49000.01180.2356

80.6329363651005511 2.55500.02030.3127

90.653834374982518 2.59000.02900.2991

100.6467323841040530 2.65000.00950.2253 networks can be built by considering the phone calls for each of the ten days.Each edge is labelled with the number of phone calls occurred in that day between two members of Paraiso movement.Five persons are considered the most important in the network,Ferdinando Catalano(node201) and his brother Estaban(node6),David Vidro(node2),and his two brothers Jorge and Juan(nodes3and4).These?ve core members changed their cell phone numbers between days7and8,therefore,for the last three days their node number changed from201,6,2,3,4to301,307,310,361, 398,respectively.

Some statistical informations regarding the network are reported in Table2.Since the true community structure is not known,we followed the same approach of Lin et al.

[5].We?rst considered the overall network and computed the community structure by applying only the?rst step of our method.The resulting network division had an average modularity value of0.52and an average number of communities equal to25.

After that,the clustering obtained on the overall network was considered as the ground truth division,and both the error rate and the normalized mutual information have been computed by executing DYNMOGA on the ten networks. The values obtained are reported in Figure16.It can be observed that DYNMOGA?nds communities that constitute a balance between the snapshot and the temporal costs. In fact,the similarity among the groups obtained for the overall network and those computed for each time step is around0.5(Figure16(a)).It is worth to note that,for each time step,both the modularity values and the number of communities found are higher than those computed on the overall network.In fact we obtained a number of communities between30and36for the different time steps, with a modularity value varying from0.64to0.66(see Table2).This means that the execution of the method on the dynamic network provides a more structured division and allows for a deeper analysis of the network. Finally,in Figure17,the communities of the?ve most important members,described above,at the?rst time step are visualized,along with the evolution of their communi-ties at time steps7and8.The three?gures clearly point out that these?ve persons had a central role until the

(a)

NMI

(b)Error

Fig.16.Normalized mutual information and error rate of the CELL dataset.

(a)Time step1

(b)Time step7

(c)Time step8

https://www.wendangku.net/doc/f66387069.html,munity snapshots on CELL data set at different time steps. seventh day,directly communicating with almost all the other members of the same group.At the8th day,instead, this peculiarity is lost,as expected and obtained by other studies relative to this network[48].

Enron mail data set:the second real dataset is an email collection from a US enterprise with potential anomalous email communications spanning over a time range of about

13

(a)NMI(b)Error

Fig.18.Normalized mutual information and error rate for the Enron dataset.

TABLE3

Enron dataset statistics:T is the number of time steps(2001monthly partitioning),M the modularity,|C|the number of communities,|V|the number of nodes,|E|the number of edges,|E|?the number of distinct undirected edges(i.e.number of different mails between two employees),Z the average degree,CC the Watts-Strogatz clustering coef?cient,and B the

betweenness of network.

T M|C||V||E||E|?Z CC B

10.636811961070180 2.38410.44150.0850

20.68037931559204 2.70200.54060.1002

30.655012971844218 2.88740.46460.0984

40.6571121081869257 3.40400.44300.1462

50.5699151251919292 3.86750.49930.4384

60.6943101201001231 3.05960.39590.1389

70.6530101091325252 3.33770.48820.1833

80.535691312270396 5.24500.47830.4331

90.6324101283152361 4.78150.49500.3070

100.53021313586935757.61590.49570.3494

110.570791276276469 6.21190.50620.1915

120.615281132146325 4.30460.45190.2467

3years(i.e.,from1999to2002).The original dataset2 contains around517,431emails from151users distributed in3500folders.For our tests,we considered a cleaned version of this corpus3described in[49],and containing a subset of252,759emails from151employees.We further reduced the size to about50,000messages by considering only emails exchanged among Enron’s employees.

In order to perform a monthly-based analysis we concen-trated on the year2001since it encompasses the maximum number of emails(i.e.more than33,000).We split it in 12subsets,one for each month,by following the same approach used for the CELL dataset.The number of communities obtained on the overall network has been six. Figure18shows the NMI and Error for the12time steps. Also in this case,the dynamic approach provides a?ner division of the network,as pointed out in Table3,where some statistical informations regarding the network for each of the12time steps are reported.

7C ONCLUSIONS

A multiobjective method based on genetic algorithms for detecting communities in dynamic networks has been pre-sented.The algorithm,at each time step,provides the so-lution representing the best trade-off between the accuracy of the clustering obtained with respect to the data at the current time step,and the drift from one time step to the successive.The proposed approach can be considered

2.available at https://www.wendangku.net/doc/f66387069.html,/~enron/

3.available at ftp://https://www.wendangku.net/doc/f66387069.html,/sims/philpot/data/enron-mysqldump.sql.gz as a general framework for evolutionary clustering since changing one of the two objective functions(or both)allows to implement and test different criteria for the analysis of dynamic networks.Experimental results on several kinds of synthetic data sets showed the good performance of our approach compared with other state-of-the-art methods.

It is worth to point out that genetic algorithms,com-pared with traditional optimization algorithms,require high execution time to generate a solution.Experiments showed that increasing population size p and number of generations g produces a positively in?uence on the accuracy of the method.However the running time sensibly increases. Thus,if the computing resources are limited,the choice of p and g should be done by considering the trade-off between execution time and desired accuracy.

8A CKOWLEDGMENTS

We wish to thank Yu-Ru Lin for providing us the F acetNet software,including the generation of the synthetic data set #1,Derek Green for the software generator of the synthetic dataset#2,and Min-Soo Kim for the synthetic data set generator,and the results obtained by his method on these synthetic networks.

R EFERENCES

[1]S.Asur,S.Parthasarathy,and D.Ucar,“An event-based framework

for characterizing the evolutionary behavior of interaction graphs,”

ACM Transactions on Knowledge Discovery from Data,vol.3,no.4, p.Paper16,2009.

[2]Y.Chi,X.Song,D.Zhou,K.Hino,and B.Tseng,“On evolutionary

spectral clustering,”ACM Transactions on Knowledge Discovery from Data,vol.3,no.4,Article17,2009.

[3]M.Kim and J.Han,“A particle-and-density based evolutionary clus-

tering method for dynamic networks,”in Proc.of the International Conference on Very Large Data Bases(VLDB’09),2009,pp.–. [4]R.Kumar,J.Novak,and A.Tomkins,“Structure and evolution of

online social networks,”in Proc.Int.Conf.on Knowledge Discovery and Data Mining(KDD’06),2006,pp.611–717.

[5]Y.-R.Lin,S.Zhu,H.Sundaram,and B.L.Tseng,“Analyzing

communities and their evolutions in dynamic social networks,”ACM Transactions on Knowledge Discovery from Data,vol.3,no.2, Article18,2009.

[6]L.Leskovec,J.Kleinberg,and C.Faloutsos,“Graphs over time:

densi?cation laws,shrinking diameters and possible explanations,”

in Proc.Int.Conf.on Knowledge Discovery and Data Mining (KDD’05),2005,pp.177–187.

[7]G.Palla, A.Barabasi,and T.Vicsek,“Quantifying social group

evolution,”Nature,no.466,2007.

[8]M.Spiliopoulou,I.Ntoutsi,Y.Theodoridis,and R.Schult,“Monic-

modeling and monitoring cluster transitions,”in Proc.Int.Conf.on Know.Discovery and Data Mining(KDD’06),2006,pp.706–711.

[9]J.Sun,C.Faloutsos,S.Papadimitriou,and P.Yu,“Graphscope:

parameter-free of large time evolving-graphs,”in Proc.International Conference on Knowledge Discovery and Data Mining(KDD’05), 2005,pp.687–696.

[10]L.Tang,H.Liu,J.Zhang,and Z.Nazeri,“Community evolution in

dynamic multi-mode networks,”in Proc.International Conference on Knowledge Discovery and Data Mining(KDD’07),2007,pp.

677–685.

[11]T.Xu,Z.Zhang,P.S.Yu,and B.Long,“Evolutionary clustering by

hierarchical dirichlet process with hidden markov state,”in Proc.of the8th IEEE International Conference on Data Mining(ICDM’08), 2008,pp.658–667.

[12]——,“Dirichlet process based evolutionary clustering,”in Proc.of

the8th IEEE International Conference on Data Mining(ICDM’08), 2008,pp.648–657.

14

[13] D.Chakrabarti,R.Kumar,and A.Tomkins,“Evolutionary cluster-

ing,”in Proc.of the12th ACM Int.Conf.on Knowledge Discovery and Data Mining(KDD’06),2006,pp.554–560.

[14]K.Deb,Multi-Objective Optimization using Evolutionary Algo-

rithms.John Wiley&Sons,Ltd,Chichester,England,2001. [15]J.Leskovec,https://www.wendangku.net/doc/f66387069.html,ng,and M.Mahoney,“Empirical comparison of

algorithms for network community detection,”in Proc.Int.World Wide Web Conference(WWW2010),2010,pp.631–640.

[16]M. E.J.Newman and M.Girvan,“Finding and evaluating

community structure in networks,”Physical Review E,vol.69,p.

026113,2004.[Online].Available:https://www.wendangku.net/doc/f66387069.html,/abstract?

id=oai:https://www.wendangku.net/doc/f66387069.html,:cond-mat/0308217

[17]L.Danon,A.Daz-Guilera,J.Duch,and A.Arenas,“Comparing

community structure identi?cation,”Journal of Statistical Mechan-ics,vol.P09008,2005.

[18] D.Datta,J.Figuera,C.Fonseca,and F.Tavares-Pereira,“Graph

partitioning through a multi-objective evolutionary algorithm:A preliminary study,”in Proc.of the Genetic and Evolutionary Com-putation Conference(GECCO’08),2008,pp.625–632.

[19]G.N.Demir, A.S.Uyar,and S.G.˙Oˇg˙u d˙u c˙u,“Multiobjective

evolutionary clustering of web user sessions:a case study in web page recommendation,”Soft Computing,vol.14,no.6,pp.579–597,2010.

[20] C.Pizzuti,“A multiobjective genetic algorithm to?nd communities

in complex networks,”IEEE Transactions on Evolutionary Compu-tation,vol.16,no.3,pp.418–430,2012.

[21]J.Handle and J.Knowles,“An evolutionary approach to multiobjec-

tive clustering,”IEEE Transactions on Evolutionary Computation, vol.11,no.1,pp.56–76,2007.

[22] F.Folino and C.Pizzuti,“A multiobjective and evolutionary clus-

tering method for dynamic networks,”in Proc.of the International Conference on Advances in Social Networks Analysis and Mining (ASONAM10),,2010,pp.256–263.

[23]P.J.Mucha,T.Richardson,K.Macon,M.A.Porter,and J.Onnela,

“Community structure in time-dependent,multiscale,and multiplex networks,”Science,vol.328,pp.876–878,2010.

[24]P.Holme and J.Saramaki,“Temporal networks,”2011.[Online].

Available:https://www.wendangku.net/doc/f66387069.html,/pdf/1108.1780.pdf

[25]L.Tang,H.Liu,J.Zhang,and Z.Nazeri,“Identifying evolving

groups in dynamic multimode networks,”IEEE Transactions on Knowledge and Data Engineering,vol.24,no.1,pp.72–85,2007.

[26]T.Y.Berger-Wolf,C.Tantipathananandh,and D.Kempe,“Dynamic

community identi?cation,”in in P.S.Yu et al.(eds.)Link Mining: Models,algorithms,and Applications,2010,pp.307–336.

[27]J.Shi and J.Malik,“Normalized cuts and image segmentation,”

IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.22,no.8,pp.888–905,2000.

[28]M.Ehrgott,Multicriteria Optimization.Springer,Berlin,2nd

edition,2005.

[29] C.A.C.Coello,https://www.wendangku.net/doc/f66387069.html,mont,and D.A.V.Veldhuizen,Evolu-

tionary Algorithms for Solving Multi-Objective Problems.Springer, 2007.

[30] D.Goldberg,Genetic Algorithms in Search,Optimization,and

Machine Learning.Addison-Wesley,1989.

[31]N.Srinivas and K.Deb,“Multiobjective optimization using non-

dominated sorting in genetic algorithms,”Evolutionary Computation, vol.2,no.3,pp.221–248,1994.

[32]J.H.Holland,Adaptation in Natural and Arti?cial Systems.Univ.

of Michigan Press,Ann Harbor Mich.,1975.

[33]R.Kannan,S.Vempala,and A.Vetta,“On clustering:Good,bad and

spectral,”Journal of the ACM,vol.51,no.3,pp.497–515,2004.

[34]Y.Park and M.Song,“A genetic algorithm for clustering problems,”

in Proc.of3rd Annual Conf.on Genetic Algorithms,1989,pp.2–9.

[35]T.H.Cormen,C.E.Leiserson,R.L.Rivest,and C.Stein,Introduc-

tion to Algorithms-Second Edition.Mit Press,2007.

[36]K.Deb,A.Pratap,S.Agarwal,and T.Meyarivan,“A fast and elitist

multi-objective genetic algorithm:NSGA-II,”IEEE Transactions on Evolutionary Computation,vol.6,no.2,pp.182–197,2002. [37]M.T.Jensen,“Reducing the run-time complexity of multiobjective

eas:The nsga-ii and other algorithms,”IEEE Transactions on Evo-lutionary Computation,vol.7,no.5,pp.503–515,2003.

[38]L.Can-Tao and H.Bao-Gang,“Mutual information based on renyi’s

entropy feature selection,”in Proc.of the IEEE International Confer-ence on Intelligent Computing and Intelligent Systems(ICIS2009), 2009,pp.816–820.[39]S.K.Smit and A.E.Eiben,“Parameter tuning of evolutionary

algorithms:Generalist vs.specialist,”in Applications of Evolutionary Computation,Springer,2010,pp.542–551.

[40]M.Girvan and M.E.J.Newman,“Community structure in social

and biological networks,”in Proc.National.Academy of Science.

USA99,2002,pp.7821–7826.

[41] C.Tantipathananandh,T.Y.Berger-Wolf,and D.Kempe,“A frame-

work for community identi?cation in dynamic social networks,”in Proc.of the13th ACM SIGKDD Int.Conf.on Knowledge Discovery and Data Mining(KDD’07),2007,pp.217–226.

[42] D.Greene, D.Doyle,and P.Cunningham,“Tracking the evolu-

tion of communities in dynamic social networks,”in International Conference on Advances in Social network Analysis and Mining (ASONAM’10),2010,pp.176–183.

[43] https://www.wendangku.net/doc/f66387069.html,ncichinetti,S.Fortunato,and F.Radicchi,“Benchmark graphs

for testing community detection algorithms,”Physical Review E, vol.78,no.046110,2008.

[44] https://www.wendangku.net/doc/f66387069.html,ncichinetti and S.Fortunato,“Community detection algorithms:

a comparative analysis,”Phys.Rev.E,vol.80,no.056117,2009.

[45]S.Fortunato and M.Barth′e lemy,“Resolution limit in community

detection,”Proc.National Academy of Science,USA,vol.104, no.36,2007.

[46] C.Granell,S.G′o mez,and A.Arenas,“Unsupervised clustering

analisys:A multiscale complex network approach,”Journal of Bifurcation and Chaos,vol.22,no.7,p.1230023,2012.

[47]M.Tomassini,Parallel and Distributed Evolutionary Algorithms:A

Review.in Evolutionary Algorithms in Engineering and Computer Science,J.Wiley and Sons,Chichester et al.eds.,1999.

[48] A.Perer,“Using socialaction to uncover structure in social networks

over time,”in Proc.of IEEE Symposium on Visual Analytics Science and Technology(VAST’08),2008,pp.213–214.

[49]J.Shetty and J.Adibi,“The enron email dataset database

schema and brief statistical report,”2004.[Online].Available: https://www.wendangku.net/doc/f66387069.html,/~

enron/

Francesco Folino is currently researcher at

the Institute of High Performance Computing

and Networks(ICAR-CNR)of the National

Research Council of Italy.He graduated in

Computer Science Engineering in2003,and

he holds a Ph.D.in Computer Science Engi-

neering from the University of Calabria(Italy)

in2006.His research interests mainly con-

cern the?elds of Information Systems,Data

Mining,Process Mining and Social Network

Analysis.He is involved in several projects at ICAR-CNR concerning applications of data mining and knowledge discovery

techniques.

Clara Pizzuti received the Laurea degree

in Mathematics from the University of Cal-

abria,Italy.She is a senior researcher at

the Institute of High Performance Computing

and Networking(ICAR)of the Italian National

Research Council(CNR).Since1995she is

also a contract professor in the department

of Computer Science at the University of Cal-

abria.In the past,she worked in the research

division of a software company on deductive

databases,advanced logic based systems, and abduction.She has published more than seventy papers in conference proceedings and journals.Her research interests include knowledge discovery in databases,data mining,data streams,bioin-formatics,e-health,social network analysis,evolutionary computa-tion,genetic algorithms,and genetic programming.She is serving as program committee member of international conferences,and as reviewer for several international journals.

相关文档
相关文档 最新文档