文档库 最新最全的文档下载
当前位置:文档库 › An Information Diffusion-Based Recommendation Framework for Micro-Blogging-1

An Information Diffusion-Based Recommendation Framework for Micro-Blogging-1

A N I NFORMATION D IFFUSION BASED R ECOMMENDATION

F RAMEWORK FOR M ICRO-BLOGGING

Jiesi Cheng (Corresponding Author) Department of Management Information Systems University of Arizona, Tucson, Arizona

chengj@https://www.wendangku.net/doc/6f311703.html,

Daning Hu

Department of Information Systems

City University of Hong Kong, Kowloon

Hong Kong SAR, China

daninghu@https://www.wendangku.net/doc/6f311703.html,.hk

Aaron Sun

Department of Management Information Systems University of Arizona, Tucson, Arizona

asun@https://www.wendangku.net/doc/6f311703.html,

Daniel Zeng Department of Management Information Systems University of Arizona, Tucson, Arizona The Key Lab of Complex Systems and Intelligence Science, Institute of Automation, Chinese

Academy of Sciences, China

zeng@https://www.wendangku.net/doc/6f311703.html,

Abstract

Micro-blogging is increasingly extending its role from a daily chatting tool into a critical platform for individuals and organizations to seek and share real-time news updates during emergencies. However, seeking and extracting useful information from micro-blogging sites poses significant challenges due to the volume of the traffic and the presence of a large body of irrelevant personal messages and spams. In this paper, we propose a novel recommendation framework to overcome this problem. By analyzing information diffusion patterns among a large set of micro-blogs who play the role of emergency news providers, our approach could select a small subset as recommended emergency news feeds for regular users. We have evaluated our diffusion-based recommendation framework on Twitter during the early outbreak of H1N1 Flu. The evaluation results show that our method results in more balanced and comprehensive recommendations compared to benchmark approaches.

Keywords: micro-blogging, recommender system, information diffusion

1

Introduction

Micro-blogging – a new paradigm of Web-based and mobile application – is experiencing rapid growth and gaining explosive popularity worldwide. Compared to traditional blogging, micro-blogging allows a more instant and flexible form of communication. Micro-blogging sites typically restrict the length of posted messages. These messages can be published and received via a wide variety of means, including the Web, text messaging, instant messaging, and other third-party applications. Such a flexible and broad-based architecture significantly lowers the threshold for participation, and encourages users' frequent updates. Consequently, micro-blogging is now widely adopted by the public to share/seek real-time information, especially during emergency events. For example, at the early stage of the recent H1N1 Flu (Swine Flu) outbreak, the volume of H1N1 Flu-related messages on Twitter - one of the most popular micro-blogging sites - has increased 1500 times over four days (Apr 24 ~ 27, 2009), and accounted for nearly 2% of all Twitter traffic in that time period (Nielsen_News 2009). Meanwhile, a large number of people turned to Twitter searching for latest updates of the outbreak, causing the keyword “Swine Flu” listed as the “top trending topic” on Twitter Search consistently.

On the other hand, the exponentially-expanded micro-blogging community also results in a tremendously large and constantly updated information stream repository, making it increasingly difficult for users to find contents of interest. During emergencies, seeking newsworthy and timely information can be even harder due to the explosion in the volume of micro-blog postings. How to facilitate users to find relevant and timely information efficiently? The first intuition would be providing a search function. The current real-time search available in Twitter and some major search engines, such as

2

Google and Bing, let users input a query, and return the latest updates which contain the search keywords. The most advanced real-time search is able to return updates posted seconds before the search is performed (Singhal 2009). Nevertheless, the typical search results simply rank updates in reverse chronological order. Therefore, the quality of the search results fluctuates with the timing of the search action. A better approach is leveraging the social network feature in micro-blog communities. In Twitter, if a user u follows user v, all v’s updates will be displayed on u’s home page. In other words, with user u’s subscription of user v’s micro-blog, all v’s updates are instantly “pushed” to u. The “feed subscription” allows users to receive the latest updates instantly. Our observation is that there exist numerous micro-bloggers who play the role of “news reporters” during emergency events, by posting instant news stories on their micro-blogs. We empirically observe that these “news reporters” operate in social settings: they re-broadcast and refer to news stories from one another, maintaining strong interlinking to facilitating rapid diffusions of news stories. Our intuition is, if we could understand how these “news reporters” capture news stories during their diffusion processes, we could effectively measure the importance of each reporter from various diffusion perspectives (e.g. the number of diffusions captured and/or the average time needed for the capture) and make recommendations. Therefore, in this paper, we formulate the task of navigating micro-bloggers to their desired information as a recommendation problem. As such, instead of letting users actively perform searches, we aim to identify a small number of quality “news reporters” and recommend them to information seekers as emergency news feeds. Such a task is distinctly different from standard content-based and link-based recommendation investigated in the blogging domain with the primary task of “finding blog articles of interest that are not viewed yet” for users. (1) User interest in the blogging context could be extracted from user history. However, in an emergency context, information seekers in micro-blogging communities are not

3

necessarily the contributors of the emergency events. (2) Information seekers expect to find timely information from micro-blogging communities, especially for emergency events. With consideration of the unique challenges, a novel information diffusion-based framework is proposed to deal with the unique characteristics and requirements of micro-blogging recommendation. Specifically, we have developed diffusion-based metrics to evaluate micro-bloggers, formulated the recommendation problem into a multi-objective optimization problem, and subsequently proposed a diffusion-based candidate selection algorithm to recommend quality micro-bloggers during time-critical events. The purpose of the recommendation is not letting information seekers to read the retrospective updates. Instead, we aim to enable the users to receive future relevant tweets posted by the recommended micro-bloggers immediately after the updates are posted.

The proposed recommendation framework could be implemented as a function on Twitter, and/or a third-party Web service which uses Twitter API for retrospective tweets and social network structure. On Twitter, a snapshot of the most mentioned topics (trending topics) in the micro-blogging community is displayed in the sidebar of a user’s homepage. The snapshot is generated based on Twitter’s proprietary algorithm (Cheong and Lee 2009), and could be modified with user preferences on trending topics of the minute, day, week, etc. The main focus of our study is on how to leverage diffusion patterns of micro-blogging posts to facilitate information seeking, thus the front end user interface design is out of scope of the discussion. In practice, our proposed micro-blog recommendation works as follows: when clicking on a topic of interest from the trending topic, a user navigates to a list of recommended micro-bloggers, by following whom the user could receive timely updates on the particular topic. We

4

consider taking trending topics as the set of keywords for the recommendation. Compared to letting users input search terms freely, the proposed implementation has the following advantages: (1) Users do not need to know the topic of interest prior to receiving the recommendation. This is particularly important for bursty events. It is not practical to expect users to input search terms for new events they have never heard of. Hence leveraging the list of trending topics to navigate users for the recommendation is a more natural design. (2) The diffusion detection process with the list of terms in the trending topics is independent from user’s request. Hence it could be carried out offline.

(3) With the trending topics identified from tweets timeline, the system is prevented from dealing with potential ambiguity introduced by synonym or typo in user input. (4) Though the proposed implementation only provides recommendations for trending topics, it is reasonable to expect it could handle a majority of users’ interests due to the power-law distribution of user interest frequently observed in online communities. The main purpose of the proposed recommendation framework is to enable information seekers to find relevant and timely information efficiently, especially when there are large amount of discussion related to the topic of interest in the micro-blogging community.

The rest of this paper is organized as follows. We begin with reviewing major micro-blogging applications and relevant recommendation techniques in a blogging context in Section 2. In Section 3, we propose a diffusion-based micro-blogging recommendation framework that utilizes information diffusion patterns. We then present an empirical study to illustrate the potential usefulness and practical value of this diffusion-based recommendation method in Section 4. Finally, contributions and future directions are discussed in Section 5.

5

Literature Review

2.1 Micro-Blogging

After its launch in 2006, Twitter has become the largest and most well-known micro-blogging platform. As such, Twitter is an ideal candidate site for our study. Twitter allows users to send up to 140-character text-based posts (tweets) to a network of followers via a variety of means. By default, tweets are public so that users can follow and read each other's posts without permission. Early studies in this area have focused on understanding the prevalent usage and structural patterns of micro-blogging. In Java et al. (2007), the authors studied the topological and geographical properties of Twitter's social network, and summarized different user intentions of using Twitter, such as daily chatting and information sharing. Also focusing on the social networking aspects, Krishnamurthy et al. (2008) characterized distinct classes of Twitter users and their behaviors, including “broadcasters” (e.g. online radio stations and media outlets), “acquaintances” (users who exhibit reciprocal relationships), and “miscreants” (e.g. spammers).

Recent studies have shifted the attention to some novel applications of micro-blogging. For instance, Jansen et al. (2009) studied Twitter as a platform of online word of mouth branding. They analyzed more than ten thousand micro-blog posts containing branding information, and claimed that micro-blogging could play an important role in designing marketing strategies and campaigns. In Ehrlich and Shami (2010) and Zhang et al. (2010), the adoption and use of micro-blogging in the workplace – enterprise micro-blogging – were discussed. By analyzing users’ posting activities and reading behaviors, it was found that enterprise micro-blogging could facilitate conversation and mutual assistance. Such user-to-user exchanges and collaborations via Twitter were

6

also identified in a public setting (Honeycutt and Herring 2009), in which the authors explored the potential of using Twitter as a collaboration tool. The rich textual data that are freely available from Twitter also attract interest from the text mining community. In O' Connor et al. (2010), the authors applied sentiment analysis technique to extract public opinions and attitudes from a large body of tweets. The results were compared with opinions derived from standard polling and survey data, which highlighted the promise of using Twitter as a substitute or supplement for traditional polling. Similar, but simpler, techniques were used in Jansen et al. (2009) to understand user opinion fluctuations towards a particular brand.

Another important application of micro-blogging that is of our interest is its widespread adoption and use during mass crisis and emergency events. Though traditional official and media communication channels remain in place, Web-based social media, such as search-term surveillance (Ginsberg et al. 2009), micro-blogging, and online social networks have emerged as alternative forms of rapid dissemination of information (Brownstein et al. 2009). Apart from the H1N1-flu example presented above, micro-blogging has been widely used for status updates and live news reports in emergency occasions such as Southern California wildfires in 2007 (Sutton et al. 2008), Mumbai terrorists attack in 2008 (Caulfield and Karmali 2008), H1N1 Flu outbreak in 2009 (Ostrow 2009), Icelandic volcano eruption in 2010 (Nigam 2010), and etc. Such emergency usages of micro-blogging have received increasing attention from academic researchers. Hughes, Starbird, and Palen (Hughes and Palen 2009; Starbird and Palen 2010) were among the first researchers who studied this phenomenon. In their studies, usage patterns of Twitter surrounding emergency events were observed and compared with regular use. It was noted that information propagation was more likely to happen in

7

emergency occasions than in regular situations. In Hughes and Palen (2009) and Starbird and Palen (2010), the authors took advantage of the popularity of Twitter, and monitored incoming tweets for detecting crisis events such as earthquakes and epidemic outbreaks. These applications clearly indicate a role transition of micro-blogging from a daily chatting tool into a valuable information sharing platform during emergencies. However, as mentioned earlier, the explosion in the volume of messages can pose a significant challenge for finding noteworthy information in a timely manner. The occurrence of emergency compounds this problem when a considerable amount of unplanned messages arrive in a short period of time. As such, we propose to use a recommender system to alleviate this problem of information overload. In the next subsection, we discuss our research motivations, starting with reviewing relevant literature on blog recommendation.

2.2 Blog Recommendation

To our knowledge, this paper presents the first study on micro-blog recommendation and there has been limited published work in this area. The closest related work to ours is the blog recommender system that has been extensively studied in the literature. In this subsection, we review previous studies related to blog recommendation services only. For a comprehensive review of the recommender system, especially its application in the e-commerce domain, interested readers can refer to Herlocker et al. (2004) and Schafer et al. (1999) .

There exist two major types of blog recommendation techniques, content-based and link-based recommendation (Abbassi and Mirrokni 2007). The core of the content-based

8

recommendation is to recommend an item (e.g. a blog article or a blogger) to the reader based upon the degree of match between the content description of the item and user interest. A majority of blog recommendation methods can be grouped into this category. The simplest approach is to pre-label blogs to facilitate understanding and categorization. For example, Technorati (https://www.wendangku.net/doc/6f311703.html,) fetches blog posts that are associated with user-defined tags. Articles under the same tag-based category are then presented to interested readers. Another common approach is to represent a blog article as a term-frequency vector, and a scoring system is used to calculate the distance between this article and user interest, which is also represented as a vector. In Arguello et al. (2008), different document representation models were developed for recommending blogs in response to a user query. Li et al. (2009a) developed an incremental vector-space clustering method to identify new topics from the incoming stream of blog articles. The article which best represented a given topic was then selected and recommended to the reader. Note that for blogs annotated by descriptive tags, the contents can be directly characterized using vectors of tags (Hayes et al. 2007). Blog articles can also be transformed into a tree-like or graph-like hierarchical ontology. Ontology is defined as a formal specification of a shared conceptualization consisting of entities, attributes, and relationships. In Nakatsuji et al. (2006), the authors used Web Ontology Language to extract user-interest ontology from blog articles. An ontology-based similarity measurement was then applied to cluster bloggers whose interests were alike. Personalized recommendations could be made for a reader to find like-minded bloggers. A similar study was carried out in El-Arini et al. (2009) for achieving a different recommendation objective. The authors characterized the blog postings by various semantic features, such as name entities, topics identified from the corpus, and their high-level relations. A set of blogs were then selected and recommended which covered the most features (best coverage.)

9

At the other end of the spectrum, link-based recommendation is implemented on the basis of explicit or implicit network structures extracted from the blogging community. Explicit connections among blogs include hyperlinks from one blog to another and mentions/comments of other blogs in blog entries. Implicit connections, on the contrary, do not physically exist. Instead, they are inferred artificial links which supplement the explicit connections. Such connections are usually weighted, with the weight indicating the likelihood for two blogs to be connected in a certain manner. In the literature, various structural features can be used to estimate the common factors between individual bloggers or blog articles. Abbassi and Mirrokni (2007) measured the similarity among blogs using the eigenvalues of the adjacency matrix of an explicit blog graph. A spectral clustering method was later applied to partition relevant blogs and recommend. In Kritikopoulos et al. (2006), the authors developed a modified version of PageRank to rank nodes on the blog graph by their ranking scores. To address the sparsity problem of the hyperlink-based graph, a denser graph was created by incorporating artificial weighted links that denoted the similarity among bloggers into the original graph. Chau and Xu (2007) proposed a semi-automated approach which consists of a set of Web mining and network analysis techniques to identify influential opinion leaders in hate groups. Although the link structure is generally useful for blog recommendations, we found that only a limited number of studies are purely link-based. More frequently, structural features are used in combination with content features to improve the outcomes, which leads to a hybrid form of recommendation. In Hsu et al. (2006), the authors proposed such a hybrid approach by using both the link structure and interests declared by bloggers. Another hybrid method reported in Li and Chen (2009b) considered both link and content information as well. In addition, a third dimension – the trust among bloggers – was also taken into account to enhance the reliability of the recommender system.

10

As we have discussed before, micro-blogging differs significantly from regular blogging in its extensive use during emergency situations. As such, our research faces distinctive challenges. In a blogging context, the primary task of recommendation is to “find blog articles of interest that are not viewed yet” for users. However, this consideration is not applicable in a micro-blogging context especially during emergencies because (a) the lightweight design of micro-blogging tends to generate an overwhelming volume of messages which are inefficient to process (Kristina 2009). In addition, a substantial proportion of these messages are related to personal conversations which have little value to the general public, such as one's own fears of the epidemic when mentioning H1N1 Flu. (b) Another noteworthy phenomenon in micro-blogging is that there exist numerous “news reporters” who regularly post latest news stories on their micro-blogs, mostly on a voluntary basis. These “reporters” provide important information filtering and amplification services and can be effectively leveraged for recommendation. We believe that it is more convenient to recommend these “news reporters” to ordinary users as live news feeds, rather than recommending individual postings. In the next section, we will present our diffusion-based recommendation framework.

An Information Diffusion Based Recommendation Framework

3.1 Information Diffusion and Diffusion-Based Valuation Criteria

Information diffusion through online social networks has recently become an active research topic. In blogging communities, the propagation of information from one blog to the next is frequently observed, as a result of low-cost information sharing and publishing. Gruhl et al. (2004) examined the information propagation pattern from 11,000 blog sites at two different levels: individual-level diffusion among blog entries and community-level diffusion among blogspaces. A cascade model was then adapted to

11

characterize individual behaviors in different stages of diffusions. Adar and Adamic (2005) analyzed the internal link structure of blogspace to track the flow of information among blog entries. In particular, by addressing the problem of “infection inference” proposed by the authors, this study focused not only on utilizing the explicit link structure, but also inferring implicit routes of infection/diffusion. In the end, a diffusion tree can be built to visualize the likely routes of transmission for a specific diffusion. Such diffusion patterns are also prevalent among micro-blogs. Lerman and Ghosh (2010) studied the diffusion of news stories on Twitter, which was compared with the diffusion occurred on Digg in terms of the diffusion rate and scope. It was noticed that the diffusion on Twitter generally maintained a consistent rate and could penetrate farther than on Digg. In Yang and Counts (2010), the authors studied Twitter users' interaction behaviors. One important finding of the study was that the majority of the interactions were one-way rather than reciprocal. This finding was also supported in a recent study (Starbird et al. 2010), which claimed that news stories were more likely to distribute from media outlets and traditional service organizations, then spread into the population.

Our observations on Twitter also confirmed such uni-directional information flows. We noticed that during the outbreak of H1N1 Flu, first-hand news stories normally originated from a limited number of professional news agencies (e.g., BBC and Reuters) and public health organizations (e.g. CDC Emergency), though exceptions exist. These stories then propagated across the community through the process of reposting (retweeting) or commenting. Figure 1 illustrates such a news story diffusion example corresponding to the tweets collected during the early outbreak of H1N1 Flu (listed in Table 1.) A clear story diffusion can be traced from the source “CDCEmergency” to other Flu news

12

reporters (as indicated by their account IDs) within 36 hours after the story was first posted by the source. During a news story's diffusion process, any micro-blog that has posted this story is called a participant who “captures” the story. Now, if we raise the following question as to recommending no more than micro-blogs as emergency news feeds ( is an exogenous parameter which is reasonably small to avoid information overload) these diffusion patterns could be helpful because they reveal how each micro-blog participated in the past diffusion processes. Intuitively, a micro-blog is more likely to be favored and recommended during emergencies if it captures news stories of interest more accurately and rapidly. More specifically, we use the following measures to quantify various aspects of this valuation process.

(1) Story Coverage (SC). Multiple news stories regarding one broad topic could

simultaneously spread. For example, when the H1N1 Flu outbreak occurred, “CDCEmergency” reminded the public that they would not get infected from eating pork, while “ForbesNews” was concerned about the Flu's impact on financial markets. The SC metric measures the set of stories of interest captured by micro-blogs under study, and stronger recommendation is given to micro-

bloggers that have wider coverage with other conditions being identical.

(2) Reading Effort (RE). Certain micro-blogs can be crowded with messages.

However, too many messages compromise readability and raise the cognitive effort to filter out irrelevant contents. The RE measurement indicates the set of messages one has to read after subscribing to the selected micro-blogs.

(3) Delay Time (DT). The postings of one news story on micro-blogs are time-

stamped. The delay time of equals the time passed from the first appearance of in the community until is captured by one of the selected micro-blogs, usually measured by hours. In our application setting, a delayed capture of is certainly undesirable.

13

Figure 1: A Story Diffusion Example

Table 1: Story Diffusion among Micro-blogs

Timestamp Micro-blog Tweet Content

4/28/2009 19:24 CDCEmergency CDC reminds you that you can NOT get swine flu from

eating pork. http://bit.ly/16YpY1 #swineflu

4/28/2009 19:31 BirdFluGov RT @CDCemergency CDC reminds you that you can

NOT get swine flu from eating pork. http://bit.ly/16YpY1

#swineflu

4/28/2009 19:32 swine_flu RT @CDCemergency CDC reminds you that you can

NOT get swine flu from eating pork. http://bit.ly/16YpY1

#swineflu

4/28/2009 20:40 SwineFlu RT CDCemergency: CDC reminds you that you can NOT

get swine flu from eating pork. http://bit.ly/16YpY1

#swineflu http://ow.ly/4kKe

4/29/2009 0:50 MSNHealth RT @cdcemergency CDC reminds you that you can NOT

get swine flu from eating pork. http://bit.ly/16YpY1

#swineflu

4/29/2009 5:25 N1H1CDC CDC reminds you that you can NOT get swine flu from

eating pork. http://bit.ly/16YpY1 #swineflu

4/29/2009 7:35 PublicHealth @cdcemergency CDC reminds you that you can NOT get

swine flu from eating pork. http://bit.ly/16YpY1 #swineflu

4/29/2009 9:02 ForceHealth RT @CDCemergency CDC reminds you that you can

NOT get swine flu from eating pork. http://bit.ly/16YpY1

#swineflu

4/29/2009 14:44 kcnews RT @CDCemergency CDC reminds you that you can

NOT get swine flu from eating pork. http://bit.ly/16YpY1

#swineflu

4/30/2009 2:35 H1N1CDC CDC reminds you that you can NOT get swine flu from

eating pork. http://bit.ly/16YpY1

14

Obviously there are other criteria that can also evaluate the quality of a recommendation, but in this study we focus exclusively on these three basic measures. The above discussions are illustrated in Figure 2 where rounded rectangles represent micro-blogs, and solid/open shapes represent news stories relevant/irrelevant to the user interest. Directed links indicate the flow of a news story, and each link is associated with a numeric label representing the units of time spent. We define a diffusion path as the route through which a news story flows from its source to other micro-blogs. In Figure 2, three diffusion paths co-exist with each corresponding to a news story. In terms of recommending micro-blogs, many combinations turn out to have the largest coverage, such as the sets 1, 2, 3 , { 3, 5 , and 7 . However, when considering other measurements, each combination has its own advantages and limitations. 1, 2, 3 is relatively weak in RE but competitive in DT; { 3, 5 and 7 have a better RE score but spend longer DT in capturing each story. Such examples show that it is possible to customize the combination of micro-blogs and achieve various recommendation objectives.

Figure 2: An Example of Diffusion Path

15

16

3.2 The Recommendation Framework

Inspired by the work of Krause et.al. (2008) on outbreak detection in water distribution networks, we have developed a diffusion-based recommendation approach. In general, our approach is to design a set function b , and associate each recommendation set I with a real number b I , the benefit score. Given user preferences, we intend to maximize this benefit score and recommend the user an optimized result set. To this end, we have quantitatively assessed each candidate micro-blog from the above three diffusion-related aspects. Suppose that we can identify a set of major news stories S during a certain period and subsequently reconstruct their diffusion paths, we denote the raw observations for micro-blog m as SC m , RE m , and DT m . As described in the previous section, SC m is a subset of S whose member stories are captured by m ; RE m is the set of posts one has to receive and read from m in order to discover SC m ; and DT m indicates a set of delay time corresponding to each story in . In case story is not captured, we let DT m ∞. Based on these raw observations, we assign a score vector b m b SC m ,b RE m ,b DT m to micro-blog m, which quantifies the benefit incurred by following m based on the diffusion histories of S. The components of the score vector are set functions b X · ,X SC,RE,DT that can transform the raw observation sets into real numbers. More concretely, b SC m equals the number of stories covered by m , denoted by |SC m |. This set function takes a simplifying assumption that all diffusion stories are of equal importance. In practice, we have also developed an improved set function by assigning an importance weight w to story . The importance weight w could be defined as the number of micro-blogs in the community who capture , which can be estimated from the constructed diffusion paths. Based on this weighted function, b SC m equals ∑w SC , and a micro-blog gains a higher benefit score for capturing more

17

“important” stories. For the second component of the score vector, we define b RE m |RE m |, where |RE m | is the size of set RE m . Note that RE takes a negative value such that smaller |RE m | can lead to a higher benefit. The last component of the score vector transforms the DT m set as follows: for any , we assume that the score of DT drops exponentially with increased delay time, determined by b DT, m e β· DT where β is a positive scalar. The cumulative score for the entire story set is then defined as the sum of individual scores, b DT m ∑e β· DT S . Note that if none of the stories is captured by m , we consider no benefits can be obtained from following m so m is removed from the candidate set.

The score vector can be also associated with multiple micro-blogs to measure their aggregated benefits. Given a set of micro-blogs M , SC M and RE M can be expressed as SC m and RE m respectively. For any s S , the delay time of by subscribing M is the minimum delay time of subscribing m M , denoted as min M DT m . DT M can now be readily written as min M DT m S .

Finally, a score vector b M

b SC M ,b RE M ,b DT M can be built based on these raw measurements.

3.3 Multicriterion Optimization

We now formulate the micro-blog recommendation problem in an optimization framework. Given the diffusion story set and the micro-blog set M , we aim to identify a subset of at most micro-blogs I M and |I| that optimizes the benefit score of b I . In this optimization problem, we intend to simultaneously optimize multiple objectives.

However, the three objectives might be in conflict, and the situation can arise that two

recommendations I and J are incomparable, e.g. b SC I b SC J , but b DT I b DT J . In this case, we use the Pareto-optimality (Boyd and Vandenberghe 2004). A

recommended set I is called Pareto-optimal if there does not exist another

recommendation J such that b X J b X I for all measurements X SC,RE,DT , and b Y J b Y I for at least one measurement Y SC,RE,DT . One common approach for

finding such Pareto-optimal sets is scalarization (Boyd and Vandenberghe 2004; Hayes

et al. 2007). By choosing weights λSC,λRE andλDT, we can optimize an objective

. Any solution that function b M ∑λX b X M

X SC,RE,DT as an alternative to b M

, and by adjusting exogenous optimizes b M is guaranteed to be Pareto-optimal to b M

parameterλX, our recommendation can take full consideration of all three aspects, but

also allowing varying degree of emphasis depending on user preferences1. In practice,

some users might expect to receive as many timely posts as possible to get informed on

everything new about the topic of interest without caring too much on being flooded by

too many posts, while some others expect to read the minimum number of posts to

capture the overall story of the interested events. These different preferences could be

operationalized as optimization problems with different weightsλX on the three

objectives. In the next subsection, we prove that this scalarized objective function is

submodular. In general, submodular function optimization is NP-hard (Khuller et al.

1999). We then propose a greedy algorithm as an effective heuristic solution.

1 In practice, we apply a Z-transformation to score vectors so that the mean and variance (mean = 0 and standard deviation = 1) of scores are equivalent across different categories.

18

3.4 A Heuristic Greedy Algorithm

Consider an arbitrary function f that maps subsets of a finite ground set U to real numbers. We call that f is submodular if it shows a “diminishing returns” property: “the marginal gain from adding an element to a set A is at least as high as the marginal gain from adding the same element to a superset of A (Kempe et al. 2003).” Formally, this submodular property can be expressed as: f A v f A f B v f B for all elements v and all pairs of sets A B. For submodular objective functions, greedy algorithm is frequently used for obtaining a bounded approximation guarantee (Nemhauser et al. 1978). The optimization objective in our recommendation problem satisfies the submodular property as well, which can be proved as follows.

Theorem 1: the scalarized objective function is submodular.

Theorem 1.1: the set function · is submodular.

Theorem 1.2: the set function · is submodular.

Theorem 1.3: the set function · is submodular.

Proof. See the appendix.

Intuitively, in the micro-blogging context, the submodular property can be understood as: reading a micro-blog after we have only read a couple of blogs provides more information (and other benefits) than reading it after we have read more micro-blogs.

The submodular property can be utilized to develop a greedy hill-climbing algorithm which approximates the optimum of the problem to within a factor of 1 1/ (where is the base of the natural logarithm) (Nemhauser et al. 1978). We have followed this approach and developed a greedy algorithm, illustrated in Algorithm 1. This algorithm

19

starts with the empty recommendation set, and repeatedly adds a micro-blog to the set that maximizes the benefit score. The algorithm stops once micro-blogs are selected or the incremental benefit is less than a predefined small value . In the next Section, we evaluate this diffusion-based recommendation method using a recent emergency case on Twitter.

1 Function: Greedy ( , , , , , , )

2 , Δ ∞, 0;

3 While \ :| | andΔ do

4 ∑

, ,

5

B

6 Δ ∑

, ,

7 ∑

, ,

8 Return

Algorithm 1: A Greedy Algorithm

An Empirical Study

4.1 The H1N1 Flu Dataset

We have collected data from https://www.wendangku.net/doc/6f311703.html, using its API from May 10 to May 16, 2009 during the early outbreak of H1N1 flu. We used keywords “swine flu” and “h1n1” to search Twitter every 15 minutes throughout the week. Each time Twitter search provided a maximum of 1,500 real-time messages ranked by their published time, and we have identified 1,034 unique accounts who had mentioned either keyword for more than 5 times during that week. We then continued to retrieve each user's all available tweets (up to 3,200 historical tweets.) In our data set, for a majority of users, 3,200 tweets are more than adequate to cover their two months histories, which means that we were able to collect these users' near-complete tweets since the outbreak of H1N1 Flu (late April, 2009.) In the end, a total of 1,308,800 tweets from the 1,034 candidate accounts were

20

相关文档