Scholarly article on topic 'Mining Explainable User Interests from Scalable User Behavior Data'

Mining Explainable User Interests from Scalable User Behavior Data Academic research paper on "Computer and information sciences"

CC BY-NC-ND
0
0
Share paper
Academic journal
Procedia Computer Science
OECD Field of science
Keywords
{"Behavior Targeting" / "Probabilistic Latent Semantic Analysis" / "User Category"}

Abstract of research paper on Computer and information sciences, author of scientific article — Li Jun, Zhang Peng

Abstract Capturing user interests from big user behavior data is critical for online advertising. Based on the user interests, advertisers can significantly reduce their advertising cost by delivering the most relevant ads for the user. The state-of-the-art user Behavior Targeting (BT) models treat user behaviors as documents, and thus use topic models to extract their interests. A limitation of these methods is that user behaviors are usually described as unexplainable hidden topics, which cannot be directly used to guide online advertising. To this end, we propose in this paper a systematic User Interest Distribution Mining (UIDM for short) Framework to extract explainable user interests from big user behavior data. In the solution, we first use the Probabilistic Latent Semantic Analysis (PLSA) to discover the relationship between users and their behaviors, which can be described as hidden topics. Then, we construct a mapping matrix between the hidden topics and user interests by manually labeling a feature entity matrix. Experiments on real-world data sets demonstrate the performance of the proposed method.

Academic research paper on topic "Mining Explainable User Interests from Scalable User Behavior Data"

Available online at www.sciencedirect.com

SciVerse ScienceDirect

Procedía Computer Science 17 (2013) 789 - 796

Information Technology and Quantitative Management (ITQM2013)

Mining Explainable User Interests from Scalable User Behavior

Li Juna*, Zhang Pengb

aState Grid Energy Research Institute, Beijing, 100052, China _h Institute of Information Engineering,Chinese Academy of Sciences,Beijing,100093_

Abstract

Capturing user interests from big user behavior data is critical for online advertising. Based on the user interests, advertisers can significantly reduce their advertising cost by delivering the most relevant ads for the user. The state-of-the-art user Behavior Targeting (BT) models treat user behaviors as documents, and thus use topic models to extract their interests. A limitation of these methods is that user behaviors are usually described as unexplainable hidden topics, which cannot be directly used to guide online advertising. To this end, we propose in this paper a systematic User Interest Distribution Mining (UIDM for short) Framework to extract explainable user interests from big user behavior data. In the solution, we first use the Probabilistic Latent Semantic Analysis (PLSA) to discover the relationship between users and their behaviors, which can be described as hidden topics. Then, we construct a mapping matrix between the hidden topics and user interests by manually labeling a feature entity matrix. Experiments on real-world data sets demonstrate the performance of the proposed method.

© 2013 The Authors. Published byElsevierB .V.

Selection and peer-review under responsibility of the organizer s of the 2013 International Conference on Information Techno logy and Quantitative Management

Keywords.Behaviar Targeting; Probabilistic Lafenf Semantic Analysis; User Category

1. INTRODUCTION

Today, online advertisers are able to collect Internet user behavior data from their page visits, the links they click on, the searches they make and the things that they interact with. These big behavior data motivate them to develop data mining models to identify potential buyers from huge Internet users. As a result, they can deliver their ads to those who are potential buyers, instead of to all Internet users, which greatly reduces their advertising expense.

Corresponding author

E-mail address: lijun@software.ict.ac.cn

1877-0509 © 2013 The Authors. Published by Elsevier B.V.

Selection and peer-review under responsibility of the organizers of the 2013 International Conference on Information Technology and Quantitative Management

doi:10.1016/j.procs.2013.05.101

Behavioral Targeting (BT) has become a popular research area for online advertising. It aims to discover user interests from big user behavior data. Existing BT approaches for capturing user interests can be categorized into two types: classification or clustering of big user behavior data. The former often labels user behavior data to train a user interest classification model, while the latter simply group users into different interest groups. These methods have their own shortcomings. Classification models need to manually label a large portion of training examples, while the clustering models cannot accurately assign a user having multiple interests.

Recently, topic models have been popularly used to solve BT tasks. These models treat user behaviors as documents, and group user interests into hidden topics. The merit of topic models is to eliminate the ambiguity of interests by semantically clustering similar interests into the same group. The limitation is that the hidden topics are often unexplainable, and thus cannot be directly used to guide the online advertising. This limitation motivates us to develop an explainable user interest model based on the topic models.

Example 1. In an online advertising system, we collect the most recent behavior data of three users ub u2, u3 as shown in Table 1. Each user's behavior consists of her search keywords in search engines (denoted as "2" in Table 1) and the web links she clicked (denoted as "1"). We can observe that user u1 recently searched three keywords ^cannon, samsung and camera", and clicked a link "www.cannon.com". The second user u2 searched a "jordan" and a "all-star", and clicked a link "sports.sina.com/nba". Then, we treat each user as a document, with each behavior decomposed as a couple of words in the document. This way, topic models can be used to learn the hidden topics "T" behind these behavior data.

Table 1. The behavior data of the three users in Example 1

userid Recent behavior data Interest distribution(target)

U1 U2 Cannon:2 www.cannon.com:1 samsung:2 camera:2 Jordan:2 sports.sina.com/nba:1 all-star:2 Digital:0.8 Sports:0.05 Movie:0.15 Digital:0.05 Sports:0.7 Movie:0.25

U3 Comedy:2 movie:2 www.youtube.com:1 jordan:2 Digital:0.05 Sports:0.1 Movie:0.85

However, the hidden topics "T" are usually unexplainable. Thus, we want to map these topics into explainable user interest categories. For simplicity, we assume that there are only three user interest categories in this problem Y={Digital, Sports, Movie}. Therefore, the aim is to obtain user interest categories (i.e., the target interest distribution matrix), as shown in the last column in Table 1, through which we can easily explain the three users' interests. For example, user ui is more likely to buy a digital equipment such as a camera, and thus we can recommend cameras ads to her.

In this paper, we present a practical user interest distribution mining (UIDM for short) framework to extract explainable user interests from big user behavior data. Technically, the model have the following two steps: (1) it first treats the user-behavior matrix as a document-word matrix, and uses the Probabilistic Latent Semantic Analysis (PLSA) to mine the relationship between users and their behaviors, described as hidden topics. (2) it builds a mapping matrix M between hidden topics and user interests based on a small portion of manually labeled feature entities. Experiments on real behavior data sets demonstrate the performance of the method.

The rest of the paper is structured as follows. Section 2 systematically gives the solution. Section 3 reports experimental results. We conclude the paper in Section 4.

2. SOLUTION.

We first introduce the notations used in this paper. Consider a Internet user u, her behavior data x is a set of vectors X={xi, x2, ..., xn}, where each vector x; (1 < i < n) is a bunch of text words representing a query extracted from the search engine or a URL clicked by her. Let Y=(yi(u), y2(u), ..., yK(u)) be the distribution matrix of her interests, where each ylM is her preference on the ith word. The learning task can be described as " learning the mapping function between the behavior data X and the interest distribution matrix Y".

2.1. The procedure of UIDM

Figure 1 shows the procedure of the UIDM method. In the first step, we use the Bag-of-Words method to extract features from the behavior data, such as the search keywords and the clicked links. Thus, each user can be represented by a collection of documents, and we can apply the semantic analysis method PLSA to model the user interests. However, as the hidden topic extracted by PLSA is a mixture of unexplainable feature entities that cannot be directly used to supervise online advertising, we build a mapping matrix between topics and interest categories. As a result, the learning task is further extended to a new one that how to map user-topic to user-interest. In the second step, we build a mapping matrix from the user-topic matrix to the user-interest topic. This requires to label each hidden topics by attaching it explainable user interests. For example, in Figure 1, we build the classification distribution matrix C of all the feature items.

7annon(0.15) inoncom(0.20) amsung(0.45) amera (0.20)

Digtal Sport Mo '1.0 0 0

0.1 0.7 0.2

0.15 0.05 0.8

[0.6 0.2 0. 0.1 0.7 0. 0.15 0.0.5 0.:

Digital Sport Movie

~ 0.9 0.05 0.05"

0.15 0.75 (

0.05 0.15 0

Mo 0.1"

0.2 0.85 I

Fig. 1. An illustration of the UIDM method. Two steps are incorporated. The first step maps high-dimension behavior data into low-dimension hidden topics in the latent semantic space. As the hidden topics are unexplainable, the second step converts the user-topic matrix

to the user-interest matrix.

2.2. Feature Selection

Generally, user behavior data is heterogeneous and sparse. A data-driven approach is to use granular events as features, such as page views and search queries. The dimensionality of search queries can be unbounded. Common approaches for feature selection evaluate terms according to their ability to distinguish the given user from the whole users, which have been popularly used in text categorization and clustering.

Here we use a frequency-based feature selection method. It first counts the entity frequency in terms of online users. Then, it selects the most frequent entities into the feature space. As the feature entity is often the unique identifier of the current event (e.g., URL or query), it is one level higher than features as the latter is identified by the pair (feature type, entity). In this work, we consider two types of entities: URL, and search.

Thus the output of feature selection is two dictionaries.

2.3. Latent Semantic Analysis

In this section, we introduce the semantic analysis algorithm for the user historical actions (queries and clicks). The aim is to find the latent relationship between user and their behavior. Since we treat each query as an term entity, each user can be represented by a bag-of-words.

Formally, given a collection of users ut e. U = {u^u2,...,un}, when using PLSA method for clustering, at

first a latent topic model, which associates to an unobserved latent variable z^ e Z = {z1, z2,..., zl} with each

occurrence of the behavior t in user u, should be defined. Suppose that tj e T —11, t2,..., tm is a feature entity,

where T represents the vocabulary of all features used by all users. We use Tu as the set of all feature entities

in ui. This way, we have Eq.~(1),

t=LT; (i)

Then, we define the co-occurrence matrix N = cnt (ui, tj ), where cnt (ui, tj ) describes the number of

tj behaved by ui . To semantically analysis the user's purchasing preference, latent topic

zkeZ = {z1, z2,..., zl} is used to build the relationship between user and their behavior. From the user's perspective, topics are strongly related to interest.

From generative model construction point of view, it can be done in three steps:

1. select a user u with probability P(u);

2. choose a latent topic z with probability P(z\u);

3. generate a behavior t with probability P(t\z).

Consider observations (t,u) of behavior features and users is generated independently and the behavior t and user u are conditional independent given the latent topic z. the joint probability P(t,u) of behavior t cooccurrence with user u can be calculated as:

P(u, t) = P(w)£ P(z | u)P(t | z) (2)

Based on the Bayes' rule, the above equation can be further rewritten as:

P(u, t) = £P(z)P(u | z)P(t | z) (3)

Eq.4 is the symmetric formulation, where u and t are both generated from the latent class Z in similar ways(using the conditional probabilities P(u\z) and P(t\z). Eq.5 is an asymmetric formulation, where, for each user u, a latent class is chosen conditionally to the user according to P(z\u), and a feature entity is then generated from that class according to probability P(t\z). The graphical model representation is shown in Figure 2.

Users Tendencies Behaviors

Fig. 2. An illustration of modeling the user behavior in a graphical model.

If we treat a user and a feature entity as a document and a term respectively, the PLSA evaluation step of the relationship between user and behavior will be the same problem as text mining. Then best parameters P(t\z), P(u\z) and P(z) can be determined by maximizing the user-behavior log-likehood function.

Max = ^ ^ cnt (ui, tj) log P(ui, tj) (4)

¿=i j=i

Combining Eq.(2) and Eq.(3), we have,

Max = YT.cnt(U,tj)log£P(Zk)P(ut | Zk)P(tj | z^) (5)

i=1 j=1 k=1

In order to solve the above problem, we use the Expectation Maximization(EM) approach which iteratively calculates the two steps.

1. Expectation step(E-step). Based on the current estimates of parameters, compute the posterior probabilities P(zk\uit}) for the latent variable.

2. Maximization Step(M-step). Maximize the likelihood function E[Lc], update P(zk),P(ui\zk) and P(tj\zk).

At last, our UIDM method evaluates all the P(zk\u) between a user ^ and a topic zk using Eq.(4). The topic is unexplainable, and it is unable to explicitly describe denote the interest. The second step builds the bridge between topics and interests. Since the user-topic matrix (T in Figure 1) and the user-topic distribution matrix (D in Fig. 1) can be evaluated by the PLSA algorithm with the user-behavior data sets, the original problem can be converted to a new one that how to get the final user-interest matrix?

To solve this problem, as the number of labeled feature entities is limited compared to the total feature space. Thus, we propose to use SVM classifiers to predict the interests of all feature entities, and then integrate the predict result with the label feature, and get the feature-interest matrix C, which is shown in Fig. 1. Based on the matrixes C and T, we multiple them by using Eq. (5) to get the topic-interest matrix M, which represents the final user-interest distribution matrix Y.

3. EXPERIMENTS

In this section, we use real-world Internet user search sessions recorded by a popular commercial search engine to empirically validate the effectiveness of the proposed UIDM method. All experiments were conducted on two Linux servers with Intel Xeon E5620(2.40GHz)*16 CPU and 24GB memory. All the source code and test data can be downloaded from http://streammining.org.

3.1. Data Set

A One-day behavior log data set was collected from a commercial search engine. Specially, the log data contain users' URL clicks, page visits and search queries. A data preprocessing step was applied to remove users that have more than 100 clicks per hour.

Table 2. The interest-category system

Interest category Number of labeled features

finance 8976

sports 1644

healthy 1738

Table 2 shows the interest-category system. Meanwhile, we label 16,860 URLs for classifier training. In addition, we also label queries by the referred information from the search engine.

3.2. Benchmark methods and Measures

We implemented two benchmark methods for comparison.

(1) A UIDM method with all feature entities (UIDM): In our experiments, we use base SVM classifiers to predict all the feature entities, and integrate the predicted results with all labeled feature entities to build the topic-interest matrix M as shown in Figure 1.

(2) Label-based Categories Statistics (LCS for short): In this framework, the interest evaluation is based on the labeled hosts and queries, for every behavior of user u, if this behavior is labeled, we directly update the interest distribution matrix by adding the labeled vector.

The accuracy of our UIDM method is measured by Rpv and Ruv which are defined as follows: cnt(w, c)

R(c, w)= pA; (6)

sumpv(w)

R„.(c,w) = ^^ (7)

sumuv(w)

where c is the corresponding category, w is the test web site, sumuv(w) is the number of users who visited w. For each web site, the larger the Nc^w) is, the more accurate should be observed.

On the other hand, we use Pavg to evaluate the total performance of different mining framework, which is defined as follows:

Pavg=~Tp(k' i) (8)

where n is the scale of label sets, k is the number of categories.

3.3 Experimental results

We compare the two methods under different parameter settings of topics.

The number of Topic N: To study the impact of N to the final predict accuracy, we use the R-measure, which integrates both interest and behavior, to evaluate UIDM method. Figure 3 shows the averaged Rpv and Ruv under different topics in our data set. We can observe that in most cases, 13 topics yield the best result. In addition, to reduce memory space, only the results of the top three web sites are provided. The x-axis of the figure stands for the number of topics, and the y-axis stands for precision. To sum up, in terms of R-measure, UDIM always performs better than benchmark methods on our advertising data.

Fig. 3. The result of topic vs. R-Measure

4. Conclusion

Behavior Targeting (BT) plays an important role in online advertising. The state-of-the-art BT approaches uses topic models for user interest mining. These models, albeit effective, are often unexplainable, and thus cannot be directly applied. In this paper, we propose a systematic User Interest Distribution Mining(UIDM) Framework to accurately predict the long-term interests from big user behavior data. UIDM further extends the Probabilistic Latent Semantic Analysis (PLSA) model for better understanding and explanation of user interests. An interesting research direction in the future is to parallelize the PLSA model based on the Map-Reduce procedure to handle big user behavior data.

Reference

[1] S.Beitzel, E.Jensen, O.Frieder, D.Lewis, A.Chowdhury, and A.Kolcz. Improving automatic query classification via

semi-supervised learning. In Data Mining, Fifth IEEE international Conference on, pages 8-pp. IEEE, 2005

[2] H.Cao, D.Hu, D.Shen, D.Jiang, J.Sun, E.Chen, and Q.Yang. Context-aware query classification. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 3-10. ACM, 2009.

[3] H.Cao, D.Jiang, J.Pei, E.Chen, and H.Li. Towards context-aware search by learning a very large variable length hidden markov model from search logs. In Proceedings of the 18th international conference on World wide web, pages 191200. ACM, 2009..

[4] J. Hu, G. Wang, F.Lochovsky, J. Sun, and Z. Chen. Understanding user's query intent with wikipedia. In Proceedings of the 18th international conference on World wide web, pages 471-480. ACM, 2009.

[5] D. Shen, J. Sun, Q. Yang, and Z. Chen. Building bridges fro web query classification. In Proceedings of the 29h annual international ACM SIGIR conference on Research and development in information retrieval, pages 131-138. ACM, 2006.