I Introduction
As the number of online resources and Web documents continues to increase, the need for better organizational structures that guide readers towards the information they seek increases. Hierarchies and taxonomies are invaluable tools for this purpose. Taxonomies are widely used in libraries via the Library of Congress System or the Dewey Decimal System, and hierarchies were a fixture of the early World Wide Web; perhaps the most famous example being the Yahoo search engine, which originally was a taxonomiccollection of hyperlinks organized by topic. These systems were developed because their effectiveness at topical organization and their logarithmic depth allowed users to quickly find the relevant documents for which they were searching.
Unfortunately, taxonomy curation of documents, articles, books, etc. is mostly a manual process, which is only possible when the number of curated documents is relatively small. This process becomes increasingly impractical as the number of documents grows to Webscale, and has motivated research towards the automatic inference of taxonomies (Adams et al., 2010; Blei et al., 2003; Ho et al., 2012; Chambers et al., 2010; Chang and Blei, 2010; Mimno et al., 2007).
Most document repositories contain linkages between the documents creating a documentgraph. These links provide, among other things, proper context to the terms and topics in each document. Documentgraphs are especially common in nonfiction and scientific literature, where citations are viewed as interdocument links. Similarly, the World Wide Web (WWW) can be considered to be a single, very large documentgraph, where Web pages represent documents and hyperlinks link documents. Web sites on the WWW could also be considered document graphs because a Web site is simply a subgraph of the WWW. Web site subgraphs, in particular, are a collection of documents with a specific and purposeful organizational structure, which are often carefully designed to guide the user from the entry page, i.e., a homepage, to progressively more specific Web pages.
Similarly, scientific literature can be categorized into a hierarchy of increasingly specific scientific topics by their citation links, and encyclopedia articles can be categorized into a hierarchy of increasingly specific articles by their cross references. Thus, we assert that most documentgraphs, or, more generally, information networks, contain hidden node hierarchies.
i.1 Taxonomies versus Hierarchies
In this paper we draw specific distinctions between a hierarchy and a taxonomy
. A taxonomy is defined to be a classification of objects into increasingly finer granularities, where each nonleaf node is a conceptual combination of its children. A biological taxonomy is a canonical example of this definition because a classified species, say
homo sapiens (i.e., humans), can only be placed at a leaf in the taxonomy; the inner nodes, e.g., primate, mammal, animal, do not declare new species, rather they are conceptual agglomerations of species. Furthermore, each species is described by its path through the taxonomy. For example, homo sapiens, can be more generally described as primates, mammals and as animals (among others).A hierarchy, on the other hand, is an arrangement of objects where some objects are considered to be above, below or at the same level as others. This necessarily means that objects of a hierarchy live at the internal nodes. For example, a business, government or military chain of command is a hierarchy because the president is a specific object that is above the generals, which are above the captains, and so on. In this case, highlevel nodes like president are not agglomerations of captains just as the CEO is not an aggregation of managers or clerks.
i.2 Hierarchies of Documents
In the similar studies, “document hierarchies” were not actually hierarchies of documents in the literal sense. For example, Hierarchical LDA (hLDA) (Blei et al., 2004, 2010), TopicBlock (Ho et al., 2012), and the Tree Structured Stick Breaking (TSSB) model (Adams et al., 2010) learn a conceptual taxonomy in which the nonleaf topics are a combination of words and does not represent a real document in the corpus; the hierarchical Pachinko allocation model (hPAM) (Mimno et al., 2007) constructs a treelike conceptual taxonomy like hLDA, but where each topic can have multiple parents.
In these related models, only the leaves contained the actual, literal documents. Contrary to our perspective, the internal nodes of existing models contain ephemeral wordtopic distributions, rather than actual documents. See Figure 3 for a brief comparison of model outputs. The HDTM model introduced in this paper requires that inner nodes, which in previous work are made of ephemeral distributions, be literal documents. This requirement asserts that some documents are more general than others. Here we explore this assertion through examples and a review of similar assertions made in previous research.
i.2.1 Web sites as Document Hierarchies
A Web site can be viewed as a directed graph with Web pages as vertices and hyperlinks as directed edges between Web pages – excluding intersite hyperlinks. In most cases, designating the Web site entry page as the root allows for a Web site to be viewed as a rooted directed graph. Web site creators and curators purposefully organize the hyperlinks between documents in a topically meaningful manner. As a result, Web documents further away from the root document typically contain more specific topics than Web documents graphically close to the root document.
For example, the Web site at the University of Notre Dame, shown in Figure 1 contains a root Web document (the entry page), and dozens of children Web documents. Even with a very small subset of documents and edges, the corresponding Web graph can be quite complicated and messy. A breadth first traversal of the Web graph starting with the root node is a simple way to distill a document hierarchy from the Web graph. Unfortunately, a fixed breadthfirst hierarchy cannot account for many of the intricacies of real world Web graphs.
For explanation purposes, let’s assume there are four types of hyperlink edges in a Web site: (1) parenttochild links, (2) upward links, (3) shortcuts, and (4) crosstopic links. Parenttochild links direct the user from one Web page to a more topically specific Web page; e.g., a hyperlink from engineering.nd.edu to cse.nd.edu is a parenttochild hyperlink because computer science is topically more specific than engineering. Upward links are hyperlinks that reference a more general document; e.g., there may exist a hyperlink from engineering.nd.edu to /thearts because the engineering college would like to reference artistic happenings at the university. Shortcut links are hyperlinks that skip from very general Web documents to very specific Web documents as a way of featuring some specific topic; e.g., if a computer science professor wins a prestigious award or grant, the professor’s Web page may be linked to from the news section of the root Web page. Cross topic links are hyperlinks that move across topical subtrees; e.g., the college of science may reference some working relationship with the college of engineering by creating a hyperlink between the two Web pages.
Because the goal is to infer the document hierarchy, the HDTM model is, in a sense, trying to find parenttochild links. In the event that there is more than one parenttochild link to a particular Web page, the goal is to find the best topicallyrelevant parent for each Web document in the inferred hierarchy (except the root).
Web researchers and practitioners have used hyperlink structures to organize Web documents for many years. The PageRank and HITS algorithms are two of the most famous examples of information propagation through links. PageRank, for instance, uses the model of a random Web surfer (i.e. random walker), who randomly follows hyperlinks over the Web; a current measure of a Web page’s authority corresponds to the probability that a random surfer lands upon that Web page – the PageRank score. In the HDTM model, PageRank’s notion of authority loosely corresponds to topical generality, that is, Web pages with a high random surfer probability are likely to be topically more general than others.
i.2.2 Term propagation in Document Graphs
The documentgraph structure is also used to enrich the document description by adding features to improve retrieval performance. Some of the intuition behind these previous works are helpful in framing the generative model.
A limitation of the random walker model is that it only looks at the graphical structure of the network. The word distributions found in each document are clearly an important factor to consider when generating document hierarchies. Previous work by Song, et al. (Song et al., 2004) and Qin, et al. (Qin et al., 2005) show that a given Web page can be enriched by propagating information from its children. Their relevance propagation model modifies the language distribution of a Web page to be a mixture of itself and its children according to the formula:
(1) 
where is the frequency of term in a document before propagation, is the frequency of term in document after propagation, is a child page of in the sitemap , and is a parameter to control the mixing factor of the children. This propagation algorithm assumes that the sitemap, , is constructed ahead of time.
For the purposes of Web information retrieval language models are often used to normalize and smooth word distributions. For illustration purposes, we apply a Dirichlet prior smoothing function (Zhai and Lafferty, 2004) to smooth the term distribution where the from above is used in place of the usual from the original Dirichlet prior smoothing function yielding:
(2) 
where is the distribution over all terms in , is the smoothing parameter, and the length is modified by the propagation algorithm to be .
As a result of the upward propagation function, the root document (Web site entry page) will contain all of the words from all of the Web pages in the Web site with different nonzero probabilities. The most probable words are those that occur most frequently and most generally across all documents, and are thus propagated the most.
hLDA  HDTM  


notre  computer  computer 
dame  engineering  engineering 
engineering  science  science 
university  notre  notre 
science  dame  dame 
computer  department  department 

As a small, preliminary example, Table 1 shows the top six most probable words in the top document (via text propagation) and in root topics of hLDA and HDTM of the computer science department’s Web site at the University of Notre Dame^{1}^{1}1http://cse.nd.edu. This small example reinforces the intuition that certain Web sites have a hidden hierarchical topical structure.
In the previous term propagation work, Web site sitemaps were constructed ahead of time using URL heuristics or manually. The goal of HDTM is to learn the document hierarchy automatically and in conjunction with the topical hierarchy.
i.3 Other hierarchies
Documents from many different collections exist in hidden hierarchies. While technically a Web site, Wikipedia documents and categories form a unique document graph. Wikipedia categories are especially interesting because they provide a type of ontology wherein categories have more specific subcategories and more general parentcategories. Most Wikipedia articles are are represented by at least one category description; this allows for users to drill down to relevant articles in a very few number of clicks by browsing the category graph.
A partial example of the Wikipedia category graph is shown in Figure 2. This figure illustrates how a document graph can be construed into a document hierarchy. Specifically, the Wikipedia community has handcrafted a category hierarchy, represented by colored circles and edges, on top of the articlegraph, represented by grey squares and grey edges. Although the category graph is imperfect and incomplete, Wikipedia browsers can get a sense of the granularity, topic and fit simply by viewing the article’s placement in the category hierarchy. With the Wikipedia construction in mind, our goal can be loosely interpreted as automatically inferring the Wikipedia category hierarchy from the Wikipedia article graph; this inference is tried and evaluated in Section IV.
Bibliographical networks may also be hierarchically structured. In a bibliographic network, papers or authors (wherein each author could be a collection of documents) are represented by nodes and each citation is represented by an edge in the graph.
Apart from Web and citation graphs, bioinformatics networks, for example protein networks, can also be hierarchically organized for protein fold prediction (Lin et al., 2013). The nodes in such network are proteins and two proteins are connected if they have structural or evolutionary relationships (Murzin et al., 1995).
i.4 Challenges and Contributions
The goal of this work is to construct node hierarchies from an information network using the node features (e.g., text) and internode edges. For these purposes a node which appears above, below or at the same level as another node refers to the conceptual granularity of the nodes. In other words, given an information network with an explicitly identified root, such as a Web site homepage, we aim to learn a nodehierarchy that best captures the conceptual hierarchy of the documentgraph. This problem poses three technical challenges:

DocumentTopic Inference. In document hierarchies the parent documents consist of topics that are more general than their children. This requires that the parent documents are viewed as a mixture of the topics contained within its children, and children documents should topically fit underneath their selected parent. In this paper we introduce the Hierarchical DocumentTopic Model (HDTM) which generates a coursetofine representation of the text information, wherein highlevel documents live near the top of the hierarchy, and lowlevel, more specific documents live at or near the leaves.

Selecting document placement. Placement of a document within the hierarchy drives the topic mixing during inference. Because links between edges hint at the context of the relationship between documents, the document placement in the inferred hierarchy is constrained by their edges within the original documentgraph. In other words, if an edge exists in the final, inferred hierarchy, then it must also exist in original documentgraph (not vice versa). Unlike existing models, such as hLDA (Blei et al., 2004, 2010), that select topic paths using the nested Chinese Restaurant Process (nCRP), HDTM performs document placement based on a stochastic process resembling random walks with restart (RWR) over the original documentgraph. The use of a stochastic process over the documentgraph frees the algorithm from rigid parameters; perhaps most importantly the adoption of the RWR stochastic process instead of nCRP allows documents to live at nonleaf nodes, and frees the algorithm from the depth parameter of hLDA.

Analysis at Web sitescale. In many documentgraph collections, the number of edges grows quadratically with the number of nodes. This limits the scalability of many topic diffusion algorithms (Nallapati et al., 2011; Furukawa et al., 2008). An important sideeffect of the RWR process is the ability to adapt the HDTM model inference algorithm into a distributed graph processing system that is capable of processing billionnode graphs.
The remainder of this paper is organized as follows: After reviewing the related work we introduce the HDTM model and show how inference is performed. Next we show how the inference can be adapted to a large scale graph processing system in order to run on Webscale data sets. The experiments section describes a myriad of tests that were run on various data sets and the quantitative and qualitative evaluations that were performed. We conclude with a examples of realworld use cases, discuss avenues for future research, and provide a link to HDTM’s source code.
Ii Related Work
Initial efforts in hierarchical clustering used greedy heuristics such as singlelink or completelink agglomoration rules to infer dendrograms
(Willett, 1988), in which the root node is split into a series of branches that terminate with a single document at each leaf. Ho, et al., point out that manuallycurated Web hierarchies like the Open Directory Project^{2}^{2}2http://www.dmoz.org are typically flatter and contain fewer inner nodes than agglomerative clustering techniques produce (Ho et al., 2012). Other hierarchical clustering algorithms include topdown processes which iteratively partition the data (Zhao et al., 2005), incremental methods like CobWeb (Fisher, 1987), Classit (Gennari et al., 1989), and other algorithms optimized for hierarchical text clustering.The processes that typically defines most hierarchical clustering algorithms can be made to fit in a probabilistic setting that build bottomup hierarchies based on Bayesian hypothesis testing (Heller and Ghahramani, 2005). On the other hand, a lot of recent work uses Bayesian generative models to find the most likely explanation of observed text and links. The first of these hierarchical generative models was hierarchical latent Dirichlet allocation (hLDA). In hLDA each document sits at a leaf in a tree of fixed depth as illustrated in Figure 3(a). Note that all nonleave nodes in Figure 3(a) are conceptual topics containing word distribution instead of a document. Each document is represented by a mixture of multinomials along the path through the taxonomy from the document to the root. Documents are placed at their respective leaf nodes stochasically using the nested Chinese restaurant process (nCRP) along side an LDAstyle word sampling process.
NCRP is a recursive version of the standard Chinese Restaurant Process (CRP), which progresses according to the following analogy: An empty Chinese restaurant has an infinite number of tables, and each table has an infinite number of chairs. When the first customer arrives he sits in the first chair at the first table with probability of 1. The second customer can then chose to sit at an occupied table with probability of or sit at a new, unoccupied table with probability of , where is the current customer, is the number of customers currently sitting at table , and is a parameter that defines the affinity to sit at a previously occupied table.
The nested version of the CRP extends the original analogy as follows: At each table in the Chinese restaurant are cards with the name of another Chinese restaurant. When a customer sits at a given table, he reads the card, gets up and goes to that restaurant, where he is reseated according to the CRP. Each customer visits restaurants until he is finally seated and is able to eat. This process creates a tree with a depth of and a width determined by the parameter. This process has also been called the Chinese Restaurant Franchise because of this analogy (Blei et al., 2003).
Adams, et al. proposed a hierarchical topic model called tree structured stick breaking (TSSB), illustrated in Figure 3(c), wherein documents can live at internal nodes, rather than exclusively at leaf nodes(Adams et al., 2010)
. However, this process involves chaining together conjugate priors which makes inference more complicated, and it also does not make use of link data.
Other work along this line include hierarchical labeled LDA (hLLDA) by Petinot et al. (Petinot et al., 2011) hLLDA, as well as fixed structure LDA (fsLDA) by Reisinger and Pasca (Reisinger and Paşca, 2009) which modify hLDA by fixing the hierarchical structure and learning hierarchical topic distributions. The hierarchical pachinko allocation model(hPAM), shown in Figure 3(b), produces a directed acyclic graph (DAG) of a fixed depth allowing for each internal (nondocument) node to be represented a mixture of more abstract, i.e., higher level, topics (Mimno et al., 2007).
In networkonly data, community discovery is the process of finding selfsimilar group, or clusters. The SHRINK algorithm creates hierarchical clusters by identifying tightlyknit communities and by finding disparate clusters by looking for hubs and other heuristics (Huang et al., 2010). Clauset, et al, discover dendrograms by Monte Carlo sampling (Clauset et al., 2008); however, dendrograms poorly represent the manually curated hierarchies and taxonomies that we are pursuing.
Stochastic block models (SBM) are an alternative line of network clustering research that partitions nodes into communities in order to generatively infer link probabilities (Holland et al., 1983). Several extensions to the original SBM have since been proposed (for a survey see (Goldenberg et al., 2009)). One downside to blockmodel processes is that they assign probabilities to every possible edge requiring complexity in every sampling iteration. Furthermore, SBM methods typically are not concerned with topical/conceptual properties of the nodes.
Because HDTM merges document text and interdocument links into a single model, we assume that the words and their latent topics fit within the link structure of the graph, and that the graph structure explains topical relationships between interlinked documents. Topic Modeling with Network Structure (TMN) is similar in this regard because it regularizes a statistical topic model with a harmonic regularizer based on the graph structure in the data; the result is that topic proportions of linked documents are similar to each other (Mei et al., 2008). However, hierarchical information is not discovered nor can be easily inferred from this model.
Topicsensitive PageRank combines document topics with the PageRank algorithm, by arguing that the PageRank score of a document ought to be influenced by its topical connection to the referring document (Haveliwala, 2002). Like TMN model, Topicsenstive PageRank does not construct any type of information network hierarchy.
Other work on generative models that combine text and links include: a probabilistic model for document connectivity (Cohn and Hofmann, 2000), the LinkPLSALDA and PairwiseLinkLDA methods (Nallapati et al., 2008), the Latent Topic Model for Hypertext (LTHM) method (Gruber et al., 2008), role discovery in social networks (McCallum et al., 2005), the authortopicmodel (RosenZvi et al., 2004)
, and others. The above models operate by encoding link probability as a discrete random variable or a Bernoulli trial that is parameterized by the topics of the documents. The relational topic model (RTM) builds links between topics, where observed links are given a very high likelihood
(Chang and Blei, 2010). The TopicBlock model combines the nonparametric hLDA and stochastic block models (Holland et al., 1983) to generate document taxonomies from text and links (Ho et al., 2012); however, TopicBlock, like hLDA, does not permit documents to reside at nonleaf nodes of the resulting tree.To apply topic modeling algorithms on webscale data, several parallel algorithms have been introduced. Newman, et al proposed an exact distributed Gibbs sampling algorithm as well as an approximate distributed Gibbs sampling algorithm that uses local Gibbs sampling and global synchronization (Newman et al., 2007). Smyth, et al introduced an asynchronous distributed algorithm that was capable of learning LDAstyle topics (Smyth et al., 2009); and Ahmed, et al recently released Yahoo LDA, which is a scalable approximate inference framework on largescale streaming data (Ahmed et al., 2012). These parallel approaches use the conventional, procedural programming paradigms, and as a result cannot guarantee statistically sound samples from their Gibbs iterations. Although conventional parallel and distributed algorithms indeed divide the document set into smaller groups, the maximum number of subgroups is subject to the number of processors. In contrast, the vertexprogramming paradigm, e.g., Pregel, GraphLab and GraphX (Low et al., 2012; Xin et al., 2013), can distribute sampling operations at a much finer granularity by treating each graphnode as an independent computing unit.
Although the distributed variants of topic inference have made significant contributions in large scale topic models on the one hand, and large scale graph processing on the other hand, we are unaware of any parallel algorithm capable of joining these two subjects to infer topical hierarchies on large scale information networks.
In contrast to the previous work, HDTM builds a hierarchy of documents from text and interdocument links. In this model, each node in the hierarchy contains a single document, and the hierarchy’s width and depth is not fixed. The distributed version of proposed algorithm has the ability to handle graph with millions of nodes and billions of tokens.
Iii Hierarchical Document Topic Model
The problem of inferring the document hierarchy is a learning problem akin to finding the single, best parent for each documentnode. Unlike previous algorithms, which discover latent topic taxonomies, the hierarchical documenttopic model (HDTM) finds hidden hierarchies by selecting edges in the document graph. This section presents a detailed description of the model.
Beginning with a document graph of documents and edges . Each document is a collection of words, where a word is an item in a vocabulary. The basic assumption of HDTM and similar models is that each document can be generated by probabilistically mixing words from among topics. Distributions over topics are represented by , which is a multinomial variable with an associated set of distributions over words , where
is a Dirichlet hyperparameter. Documentspecific mixing proportions are denoted by the vector
. ParametricBayes topic models also include a parameter that denotes the number of topics, wherein is one of possible values and is a D vector. HDTM, and other nonparametric Bayesian models, do not require a parameter as input. Instead, in HDTM there exist topics, one for each graph node, and each document is a mixture of the topics on the path between itself and the root document.In the original LDA model, a single document mixture distribution is . The process for generating a document is (1) choose a of topic proportions from a distribution , where is a Dirichlet distribution; (2) sample words from the mixture distribution for the chosen in step 1.
HLDA is an extension of LDA in which the topics are situated in a taxonomy of fixed depth . The hierarchy is generated by the nested Chinese restaurant process (nCRP) which represents as an dimensional vector, defining an level path through from root to document. Because of the nCRP process, every document lives at a leaf and the words in each document are a mixture of the topicwords on the path from it to the root.
iii.1 Random Walks with Restart
The nCRP stochastic process could not be used to infer document hierarchies because the nCRP process forces documents to the leaves in the tree. HDTM replaces nCRP with random walk with restart (RWR) (which is also known as Personalized PageRank (PPR)) (Bahmani et al., 2010). In contrast, random walk with teleportation (aka PageRank) random walks by selecting a random starting point, and, with probability , the walker randomly walks to a new, connected location or chooses to jump to a random location with probability , where is called the jumping probability^{3}^{3}3Most related works denote the jumping probability as , however, this would be ambiguous with the Dirichlet hyperparameter ..
In HDTM, the root node is fixed, either as the entry page of a Web site, by some other heuristic or manually. Therefore, for the purposes of hierarchy inference, the random walker is forced to start and restart at the root node.
Say we wish to find the RWRprobability between some node and some target node . We model this by a random walker visiting document at time . In the next time step, the walker chooses a document from among ’s outgoing neighbors in the hierarchy uniformly at random. In other words, at time , the walker lands at node with probability , where is the outdegree of some document . If at any time, there exists an edge to , i.e, an edge between the current node and the target node in the original graph , then we record the probability of that new path possibility for later sampling. Alg. 1 describes this process algorithmically. This procedure allows for new paths from the root to be probabilistically generated based on the current hierarchy effectively allowing for documents to migrate up, down and through the hierarchy during sampling.
iii.2 Generating document paths
Because a document hierarchy is a tree, each documentnode can only have one parent. Selecting a path for a document in the graph is akin to selecting a parent (and grandparents, etc.) from in the document graph . HDTM creates and samples from a probability distribution over each documents’ parent, where the probability of document being the parent of is defined as:
(3) 
where is the walkers current position at time , is the depth of in , and is the outdegree of in the hierarchy . In other words, the probability of landing at is the product of the emission probabilities from each document in the path through from to .
This random walker function assigns higher probabilities to parents that are at a shallower depth than those at deeper positions. This is in line with the intuition that flatter hierarchies are easier for human understanding than deep hierarchies (Ho et al., 2012). Simply put, the restart probability controls how much resistance there is to placing a document at successive depths.
Algorithmically, HDTM infers document hierarchies by drawing paths from the to the document . Thus, the documents are drawn from the following generative process:

Each document is assigned a topic Dir():

For each document :

Draw a path RWR()

Draw an dim topic proportion vector from Dir(), where len.

For each word :

Choose topic Mult().

Choose word Mult(), where is the topic in the th position in .


In this generative process hierarchical nodes represent documents and topics, where internal nodes contain the shared terminology of its descendants.
The illustration in Figure 4 shows the two potential outputs from a threenode graph of the Wikipedia articles: Computer Science, Data Mining and Machine Learning
. Clearly, data mining and machine learning could both be regarded as children of computer science, but there is indeed a wikilink from
Data Mining to Machine learning in the graph. There are two key ideas conveyed in this illustration.The first is that the hierarchy on the left picks edges so as to ignore the Computer Science to Machine Learning wikilink (denoted as a thingrey line), whereas the hierarchy on the right picks edges so as to ignore the Data Mining to Machine Learning wikilink. Thus, in this toy example, there are two possible hierarchy outcomes presented: the one on the left and the one on the right in Figure 4.
The second idea conveyed in this illustration is that the wordtopic distributions (in colored circles) are stored along with documents’ words (in black squares) at the same position in the hierarchy. The word distribution within each topic is constrained such that a document may only propagate terms upwards through the currently selected/sampled hierarchical topology. Thus, it is impossible, in the hierarchy on the left, for the topic distribution at Machine Learning to contain a word unique to Computer Science or Data Mining. It is similarly impossible, in the hierarchy on the right, for the topic distribution in Data Mining to contain a word unique to Machine Learning even though this is indeed possible in the hierarchical topology on the left.
Like in earlier models, there is statistical pressure to have more general terms in topics towards the root of the hierarchy. This is because every path in the hierarchy includes the root node and there are more paths through nodes at higher levels than through nodes at lower levels. Moving down the tree, the topics, and therefore the documents, become more specific.
Hyperparameters also play an important role in the shape and character of the hierarchy. The parameter affects the smoothing on topic distributions, and the parameter affects the smoothing on word distributions. The parameter is perhaps the most important parameter because it affects the depth of the hierarchy. Specifically, if is set to be large (e.g., ) then resulting hierarchy is shallow. Low values (e.g., ) may result in deep hierarchies, because there is a smaller probabilistic penalty for each step that the random walker takes.
iii.3 Inference
Exact inference on this model is intractable, so an approximation technique for posterior inference is used. The Gibbs sampling algorithm is ideal in this situation because it simultaneously allows exploration of topic distributions and potential graphical hierarchies.
The variables needed by the Gibbs sampler are: , the th word in document ; , the assignment of the th word in document ; and , the topic corresponding to document at the th level. The and variables are integrated out forming a collapsed Gibbs sampler.
The sampling is performed in two parts: (1) given the current level allocations of each word sample the path , (2) given the current state of the hierarchy, sample . In other words, we use the topic distributions to inform the path selections that make up the hierarchy, and the hierarchy topology to inform the topic distributions.
iii.3.1 Sampling document paths
The first Gibbs sampling step is to draw a path from each document to the root through the graph. The sampling distribution for a path is
(4) 
where is the count of terms in document , and are the words without document
. This equation is an expression of Bayes’ theorem where the first term represents the probability of data given some choice of path from the root, and the second term represents the probability of selecting some path.
Specifically, the second term represents the probability of drawing the path to document at depth from the RWR process. Recall that each node has an emission probability of , and a restart probability of . The probability is defined recursively:
(5) 
In other words, the probability of reaching is equal to the probability of a random walker with restart probability being at document at time .
The first term in Eq. 4 is the probability of a given word based on the current path and topic assignment :
(6) 
where counts the elements of an array that satisfy the given condition, and is the maximum depth of the current hierarchy state. The expression counts: (ii) , i.e., the number of words that do not appear in , for each (i) , i.e., the number of paths to the current document except those where the path length is . The expression counts: (iii) , i.e., the number of words such that, (ii) , i.e., the words appear in document and are situated at the end of a path of length , where (i) , i.e., is one of the topics in . is the size of the vocabulary. Eq. 6 is adapted from the standard ratio of normalizing constants for the Dirichlet distribution (Blei et al., 2010).
iii.3.2 Sampling word levels
Given the current state of all the variables, the word sampler must first pick an assignment for word in document . The sampling distribution of is
(7) 
where and . The first term is a distribution over word assignments:
(8) 
which is the smoothed frequency of seeing word in the topic at level in the path .
The second term is the distribution over levels
(9) 
where is the number of elements in the vector which satisfy the given condition. Eq. 9 abuses notation so that the product from to combines terms representing nodes at the th level in the path down to the parent of , and the second set of terms represents document at level . The symbol in Eq. 9 refers to terms representing all ancestors of a particular node, and refers to the ancestors of a node including itself.
iii.4 Distributed HDTM
A common complaint among data science practitioners is that graphical models, especially, nonparametric Bayesian graphical models, do not perform well at scale. With this in mind, we also implemented the HDTM inference algorithm in the scalable, distributed vertexprogramming paradigm.
The mechanism behind Gibbs sampling, and other Markov Chain Monte Carlo methods, requires sequential sampling steps, and execution of each step depends on the results of the previous step making Gibbs samplers, and MCMC method in general, difficult to parallelize. Approximate Distributed LDA (ADLDA) is one attempt to find approximate, distributed solutions to the serial inference problem by dividing documents into
parts where is the number of processors and initializes the topic distribution globally. Then, for every Gibbs iteration, each processor samples of the dataset using the from last Gibbs sampling iteration. When all processors are finished, a global synchronization is performed and is updated (Newman et al., 2007).Following the distribution lessons from ADLDA we sought to also implement a scalable, distributed version of HDTM. However, a major difference between the LDA/hLDA and HDTM is that hLDA uses the nCRP stochastic process to assign terms to topics, while HDTM samples paths from a graph of documents using graphbased random walk with restart method. The process of random walks over the network topology combined with the term sampling process described above is a good candidate for the vertexprogramming paradigm using frameworks like Pregel (Malewicz et al., 2010) or GraphLab (Low et al., 2012).
iii.4.1 Vertex Programming
Although MapReduce is a widely used, general purpose parallel scheme that can easily deal with scalable data, it is not optimized for iterative
computational tasks such as statistical inference or logistic regression
(Zaharia et al., 2010). This is because MapReduce materializes all intermediate results and to disk in order to tolerate task failures. Mapreduce, therefore, has relatively high I/O costs compared to other designs that keep data in memory across iterations (Lee et al., 2012).Apart from MapReduce, another scalable solution is to build a custom distributed system using message passing interface (MPI). Custom approaches are usually closer to optimal because developers can tune the code based on their own needs and minimize the unnecessary overhead. However, the drawbacks are also significant: because MPI is a barebone communication specification developers need to write their own code for job dispatching, load balancing, and dealing with node failure (Zou et al., 2013).
After evaluating aforementioned approaches, we decided to use an emerging computational paradigm called vertex programming. Vertex programming aims to improve the performance of graph/network computing by automatically distributing inmemory computation and with vertexcentric scheduling. Unlike the MapReduce paradigm, which writes every intermediate result to disk, vertex programming keeps data uptodate, inmemory and reduces I/O overhead by only materializing data to disk through periodic checkpoints. Vertexcentric scheduling views every graphvertex as an elementary computing unit and uses MPI (or some other message passing system) to transfer data over graphedges. Typically, vertexprograms are rather easy to implement, can be distributed easily, and are much more computationally efficient than conventional, procedural programming when working with iterative computational tasks (McCune et al., 2015).
Several vertex programming frameworks, including SparkGraphX, HAMA, GraphLab, GraphLab Create and GraphLab 2.0 (PowerGraph), were evaluated, and the PowerGraph framework was ultimately chosen. The decision was based, in part, from insurmountable troubles experienced during several months of implementation attempts on Spark’s GraphX framework. Ultimately, we concluded that C++ and MPI based PowerGraph was faster, more scalable and had a much smaller memory footprint than other frameworks.
iii.4.2 Distributed Inference Algorithm
The distributed HDTM inference algorithm is similar to procedural HDTM. We do not detail the entire distributed HDTM inference algorithm in this paper; however, the source code is referenced in the Section V. To fit HDTM to the vertexprogramming model, changes in sampling sequence and attention to global synchronization were required. Firstly, random walk with restart must be executed during each Gibbs iteration so that every visited node can have a random walk probability. Next, every node gathers , , and separately and decides their new path according to Eq 4 and shown in Alg 2. After a path is sampled, each node will pick sample assignments for each word across all documents/nodes in parallel according to Eq 7. A global synchronization step is required so that each document/node can update the number of words and the topic assignments globally; fortunately, because the topics and words are sampled according to the information provided in the path from root to the each document (document at level ), it suffices to update the nodes on the path from to instead of an actual global update to all nodes in . This vertexprogramming based pathglobal update function is shown in Alg 3. Furthermore, this update is executed during the synchronization barrier, which is builtin to most vertexprogramming frameworks, is highly optimized, and does not lock the global system any more than the synchronization barrier already does.
Iv Experimental Results
This section describes the method and results for evaluating the HDTM model. A quantitative and qualitative analysis of the hierarchical documenttopic model’s ability to learn accurate and interpretable hierarchies of document graphs is shown. The main evaluations explore the empirical likelihood of the data and a very large case study wherein human judges are asked to evaluate the constructed hierarchies.
Wikipedia (cat)  Wikipedia (article)  CompSci Web site  Bib. Network  



documents  609  1,957,268  1,078  4,713 
tokens  5,570,868  1,316,879,537  771,309  43,345 
links  2,014  44,673,134  63,052  8,485 
vocabulary  146,624  4,225,765  15,101  3,908 

iv.1 Data
HDTM is evaluated on four corpora: the Wikipedia category graph, the Wikipedia document graph, a computer science Web site from the University of Illinois, and a bibliographic network of all CIKM and SIGIR conferences.
The Wikipedia data set has been used several times in the past for topic modeling purpose. However, because of the computational resources needed to infer topic models, prior studies have severely constricted the dataset size. Gruber et al., crawled 105 pages starting with the article on the NIPS conference finding 799 links (Gruber et al., 2008). Ho et al. performed a larger evaluation of their TopicBlock model using 14,675 document with 152,674 links; however, they truncated each article to only the first 100 terms and limited the vocabulary to the 10,000 most popular words (Ho et al., 2012).
The Wikipedia category data set is a crawl of the category graph of Wikipedia, beginning at the category Computing as shown in Figure 2. In Wikipedia each category has a collection of articles and a set of links to other categories; however, categories don’t typically have text associated with them, so the text of each article associated with a particular category is associated to the category’s text. For example, the category Internet includes articles: Internet, Hyperlink, World Wide Web, etc. In total the crawled category graph consisted of 609 categories with text from 6,745 articles. The category graph is rather sparse with only 2,014 edges between categories, but has vocabulary size of 146,624 with 5,570,868 total tokens. We did not perform any text preprocessing procedure, including stop word removal and stemming, in any of the experiments due to empirical findings that these models are robust to the presents of stop words, etc (Mccallum et al., 2009). Such settings can also explore HDTM’s robustness in presence of a large and noisy corpus.
A computer science department Web site from the University of Illinois UrbanaChampaign was chosen as the second data set because it represents a rooted Web graph with familiar topics. By inferring the document hierarchy, the goal is to find the organizational structure of the computer science department. The intuition is that Web sites reflect the business organization of the underlying entity; thus it is expected that subtrees consisting of courses, faculty, news, research areas, etc. are found at high levels, and specific Web pages are found at lower levels in the hierarchy. Web site was crawled starting at the entry page and captured 1,078 Web pages and 63,052 hyperlinks. In total there were 15,101 unique terms from 771,309 tokens.
The bibliographic network consists of documents and titles from 4,713 articles from the SIGIR and CIKM conferences. There exist 3,908 terms across 43,345 tokens in the document collection. In this collection, links include citations between papers within the CIKM and SIGIR conferences. Due to different citation styles, vendor abbreviations, and ambiguous names, without human interference most bibliographical metadata extraction algorithms can only extract a portion of correct data (Giles et al., 1998; Ley, 2002; Tang et al., 2008). Here we choose to use the most complete citation data set available from by the authors of the ArnetMiner project (Tang et al., 2008), however the citation graph is not guaranteed to be complete. A SIGIR 1998 paper by Ponte and Croft (Ponte and Croft, 1998) was chosen to be the root document because, in the data set, it had the most incollection citations.
To construct a larger data set with finer granularity, the English Wikipedia article graph is used. In Wikipedia, every Wikiarticle has two different link types: 1) category links, which point from the article to its categories, and 2) Wikilinks, which point from the article to other Wikiarticles. The category links are important because they hint at some encoded hierarchical structure that was codified on top of Wikipedia after the articlegraph was largely developed. For example, the Wikiarticle Barack Obama has a category link to Harvard Law School alumni and Wikilinks to Harvard Law School and Dreams from My Father among others. In later experiments we infer a document hierarchy from Wikipedia and compare it to the crowdencoded Wikipedia category hierarchy. Our Wikipedia snapshot has 4,606,884 documents with 1,878,158,318 tokens, 44,739,242 links, and a vocabulary of 6,064,216 words. Because the document graph is not a single connected component, we picked the largest component in that graph as data set. This giant component contains 1,957,268 documents and 44,673,134 document to document links; 42.49% of nodes and 70% of edges are in this component. Additionally, 1,316,879,537 terms with 4,225,765 unique tokens appear in this component, which is approximately 70% of the total number of terms. As a graphpreprocessing step, we replaced all edges pointing to redirection pages with edges that point directly to the actual Wikiarticle.
iv.2 Summary of the Sampling Algorithm
Using either the traditional inference model from Sec. III.3 or the highthroughput distributed sampling algorithm from Sec. III.4 we basically sample a full hierarchy by sampling paths for each node except the root and the words in the document . Given the state of the sampler at some time , i.e., and , we iteratively sample each variable conditioned on the others as illustrated in III.2.
The conditional distribution of the latent variables in HDTM model the given documentnetwork. After running the Markov chain a sufficient number of times we approach the stationary distribution; the process of approaching the stationary distribution is called the “burnin” period. After the burnin we collect samples at a selected interval, i.e., the sampling lag.
The collected “samples” from the Markov chain are full hierarchies constructed from the selection of a path for each node and a word for each document
. Therefore each sampled hierarchy contains one estimation about position of each document in the hierarchy and the position of each word in a document. For a given sampled hierarchy, we can assess the goodness of the hierarchy by measuring the log probability of that hierarchy and the observed words conditioned on the hyperparameters:
Using this log likelihood function it is possible to pick the most sampled hierarchy that maximizes the log likelihood as the final result. Later, we will relax this assumption and use a more robust measurement that computes the mode of the sampled hierarchies and determines the certainty of the result.
iv.3 Quantitative Analysis
HDTM has some distinct qualities that make apples to apples comparison difficult. Because HDTM is the first model to generate document hierarchies based on graphs, there is nothing to directly compare against. However, some of the models in the related work perform similar tasks, and so comparisons were performed when applicable.
The related models typically perform quantitative evaluation by measuring the log likelihood on held out data or by performing some other task like link prediction. Log likelihood analysis looks at the goodness of fit on held out data. Unfortunately, the creation of a “hold out” data set is not possible, because each document, especially documents on the first or second level document, is very important to the resulting hierarchy. Removing certain documents might even cause the graph to separate, which would make hierarchy inference impossible. Instead, for the first quantitative evaluation, we compare the highest log likelihood generated from each model’s sampler.
Quantitative experiments were performed on many of the aforementioned algorithms including: HLDA (Blei et al., 2004, 2010), TopicBlock (Ho et al., 2012), TSSB (Adams et al., 2010), and fsLDA (Reisinger and Paşca, 2009). The fixed structure in fsLDA is determined by a breadth first iteration over the document graph because URL heuristics used in the original paper do not exist in most of our data sets. The depth of HLDA and TopicBlock is 4.
In all cases, a Gibbs sampler was run for 5,000 iterations; with a burnin of 2,000 and a sampling lag of 20. Figure 5(a) shows the log likelihood for each sample on the computer science Web site; the other data sets showed similarlyshaped results. Interestingly, Figure 5(b) shows that higher likelihood values are strongly correlated with hierarchies of deeper average depth.
The Gibbs sampling algorithm was run on HDTM for various values of , and Figure 6 shows the best cumulative log likelihood for each of the tested values of . We observe that HDTM with achieved the best likelihood score. Likelihood scores decreased steadily for increasing values, and HDTM with achieved the worst likelihood score.
CompSci Web site  Wikipedia (Cat)  Bib. Network  



HDTM  1.8570  148.071  0.4758 
HDTM  9.2412  148.166  0.5183 
HLDA  8.5306  50.6732  8.5448 
TopicBlock  0.2404  2.9827  0.4192 
TSSB  0.5689  0.0336  0.4655 
fsLDA  48.9149  149.622  0.6602 

Table 3 shows the results of the different algorithms on the three smaller data sets. The TopicBlock and TSSB clearly infer models with the best likelihood. The remaining algorithms, including HDTM, have mixed results. Although these log likelihood results initially seem as if HDTM performs poorly, we demonstrate in Section IV.4 that these likelihood results are expected and preferred.
iv.3.1 Largescale Analysis
Recall that the goal of HDTM is to generate document hierarchies from some original documentgraph. Thus, a positive result should somehow demonstrate that HDTM demonstrates hierarchies that are similar to the hierarchy that a human (or many humans) would create given the same document graph. With this in mind we performed a large scale experiment on the full English Wikipedia article graph. Also recall that the Wikipedia category graph establishes a hierarchy that was humanbuilt on top of Wikipedia that maintains a pseudohierarchical classification of almost every document. The structure of categories is described to be a tree rooted by Main topic classifications or Fundamental categories; however, we find that 87.01% of categories have an indegree larger than 1 indicating the actual category is more complex. We find that although the Wikipedia category graph mostly resembles a topdown (coarse to fine) hierarchy, it is technically an undirected loopy graph. As for Wikiarticles, they almost always belong to more than one category because the idea or object described in each article can rightfully have multiple classification perspectives. For example, the Wikiarticle Computer Science belongs to a category of the same name, but it also belongs to Computer Engineering and Electrical Engineering among others.
For a large scale evaluation our task is to infer a hierarchy from the full Wikipedia article graph and see how well it matches the category graph. Unfortunately, it is difficult to measure similarity between category graph and HDTMhierarchies using conventional evaluation methods like the log likelihood. Consider, for example, the category University of Notre Dame faculty, which contains Wikiarticles of faculty members from vastly different topical fields. In such cases, the bond is described primarily by the links rather than the content; nevertheless, the log likelihood score judges models based on the goodness of fit of words. Thus it is unreasonable to measure hierarchy quality by likelihood scores alone; indeed, the hierarchy with the best possible likelihood score may be very different from actual, human created hierarchy. In order to perform a more thorough evaluation we should also compare the the graph structure.
Graph topologies can be compared in many different ways. Among many options we chose DeltaCon (Faloutsos et al., 2013) to measure the similarity between HDTM’s hierarchy and the Wikipedia category hierarchy three reasons: 1) DeltaCon can calculate the similarity between two graphs with nodes that partially overlap, as observed in the messy Wikipedia category graph, 2) the underlying metric in DeltaCon is the affinity score calculated by fast belief propagation (FABP), which models the connectivity of nodes, and 3) DeltaCon can process millions of nodes in only a few hours. The DeltaCon metric returns a similarity score between the two compared graphs ranging between 0 and 1 inclusive where 0 is completely dissimilar and 1 is completely similar.
Although comparing with hLDA and its variants is important, all existing hierarchical topic models are incapable of processing on data sets at this scale. We attempted to use the hLDA implementation in the Mallet package (McCallum, 2002), but 130 hours was needed in order to compute a single Gibbs sampling iteration. Fortunately, we were able to use Graphlab’s implementation of LDA which is based on models developed by Ahmed et al. (Ahmed et al., 2012) and Smyth et al. (Smyth et al., 2009). Although LDA does not directly generate a graph, we are able to compare it with the category hierarchy by connecting the documents within a single LDAcluster to a madeup root node to simulate a hierarchy. The large scale models were inferred using 400 Gibbs sampling iterations with a burnin of 100 and a lag of 10.
In addition to HDTM and LDA, we also use DeltaCon to compare the original Wikipedia article graph and a randomly generated articlehierarchy as baselines. In the case of Wikipedia article graph, we expect the Wikipedia article graph (the full graph, not a hierarchy) to perform reasonably well because the category hierarchy is largely built on top of the article graph topology and they currently coexist and coevolve. In the case of the randomly generated article hierarchy, we randomly select a single parent for each node in the original article graph.
DeltaCon  



HDTM (=Science, )  0.046852 
HDTM (=Science, )  0.046851 
HDTM (=Barack Obama, )  0.046857 
HDTM (=Barack Obama, )  0.046856 
Random Article Hierarchy  0.046700 
Article Graph  0.044208 
LDA ()  0.026770 
LDA ()  0.037949 
Table 4 shows that the hierarchy inferred by HDTM is indeed most similar to the Wikipedia category graph, followed by the random article hierarchy. Recall that DeltaCon looks at graph topology similarity, thus the HDTM hierarchy is expected to more topologically similar to another hierarchy (even if random) than the article graph. These results demonstrate that HDTM can identify and preserve critical topological features when inferring a hierarchy from the graph. As usual, an increase in the number of LDA topics increases the goodness of fit score (in this case measured by DeltaCon instead of likelihood). Interestingly, the parameter and the selection of different roots did not significantly influence DeltaCon score. To understand why this is, recall that different rootnodes and values result in different classification perspectives, but the different perspectives are still subsets of the same category graph with similar topological properties. Hence they should have similar scores using the DeltaCon metric.
Recall that the HDTM, in its most basic form, generates a hierarchy by picking the best parent for each node/document (except the root). In the iterative Gibbs sampling process it is an almost certainty that a given node will pick different parents during different iterations. For example, say node has two parents and , it is possible that during iterations 1–5 node samples node as its parent, but then in iterations 6–20 node samples node to be its parent. Two questions come to mind: 1) which parent should be ultimately picked for node in the final output graph? and 2) what can the distribution of samples say about the certainty of our inferred graph?
In classic Gibbs sampling, maximum a posteriori estimation is determined by the the mode of the samples. For some node
, this translates to choosing the parent that appeared in the most samples as the final parent; in the above case node would chose node as its parent because node was sampled 15 times compared to only 5 samples of node . To answer the second question, and therefore model the certainty of an inferred hierarchy, we calculate the number of times some node picks its final parent over the total number of samples normalized by the number of possible choices, i.e., parents, node has. This results in a certainty score for node :(10) 
where represents the total number of samples, is the number of times the final parent is sampled, and is the indegree of node representing the total number of parents that node could pick from. In the certainty score the outside fraction is used to measure the normalized difference between the raw probability and probability of random guess. Applying this function to the above example we would calculate that the certainty score of node would be .
Figure 7 shows the probability density functions for the certainties in HDTMs parent sampling process. In the figures on both left and right we find that the probability densities appear to be polynomial distributions with interesting plateaus that end just before the 50%, 66%, and 75% certainty scores corresponding to the nearcertain (but not perfectlycertain) scores for nodes with indegree values of 2, 3, and 4 respectively. In these results, and others not displayed in this paper, we find that when the root is a general document like Science, then changing does not affect the certainty distribution. Given a relatively specific root like Barack Obama, larger values increase certainty overall.
As discussed earlier, measuring topical similarity in hierarchies can be precarious because, in many cases, documents that are correctly situated under a topically unrelated parent may still have a strong contextual association with the parent document that outweighs the topical/languageoriented similarity as determined by the parameter. As an example, consider the Wikiarticles Honest Leadership and Open Government Act and Alexi Giannoulias: even though these two articles are topically dissimilar, HDTM parameterized with a high value is likely to place both articles as children of the hierarchy’s root Barack Obama because the high values weigh the topological link as more important that the documents’ inferred topicality. If, on the other hand, HDTM was parameterized with a very low value, then Alexi Giannoulias is more likely to be situated with other state senators, and the Honest Leadership and Open Government Act is more likely to be situated with other legislation.
To quantitatively evaluate the topical and topological fitness of hierarchy together, we compare the similarily among the sets of parents and children in the inferred hierarchy against sets of Wikiarticles and their Wikipedia categories. Because we are comparing sets of items we use the Jaccard coefficient for this task. Specifically, the Jaccard coefficient is calculated by
(11) 
for each document that belongs to a set of categories , and where is the union of the corresponding categories of the nodes in ’s ancestors . The Jaccard coefficient is useful here because it gives an quantitative measure describing how well the inferred hierarchical structure matches with the humanannotated categorical structure of Wikipedia.
Figure 8 shows the relationship between Jaccard coefficients of the hierarchies. For illustration clarity, in this figure all Jaccard coefficients are binned into 10 equal intervals. We observe a clear correlation between the node’s certainty function and its Jaccard coefficient. This means that decisions that HDTM is most certain about tend to be the most similar to the humancreated hierarchies.
Recall that HDTM uses the parameter to balance the weight of topological links and topical similarities. Hence when specific nodes like Barack Obama are chosen as the root, higher will reduce Jaccard coefficient for high certainty nodes. This is because higher values favor topology rather than content during the path and parent selection process.
These results also underscore the importance of the selection of the root node. When provided a root node, which can be any node in the document graph, the inferred hierarchy is created from the perspective of that root. The hierarchy rooted at Barack Obama would therefore organize, say, congressional legislation differently than a hierarchy generated by the Wikiarticle on Science. Yet, as we’ve seen, the graph similarity metrics views the drastically different documenttopic hierarchies as topologically similar.
iv.4 Discussion
In order to properly understand the results captured in Table 3 from above, recall that log likelihood is a metric on the fit of the observations on the configuration of the model. The original work on LDA (Blei et al., 2003) found that likelihood increases as the number of topics increases. Along those lines, Chang, et al. demonstrated that more fine grained topics, which appear in models with a larger number of topics have a lower interpretability, despite having higher likelihood scores (Chang et al., 2009). Simply put, there exists a negative correlation between likelihood scores and human interpretability in LDA and similar topic models.
Applying those lessons to our experiments, recall that HDTM has as many topics as there are documents, and nonroot document topics are mixtures of the topics on the path to the root. Also recall that HLDA, TopicBlock and TSSB all generate a large number of latent topics. In HLDA and TopicBlock, there are infinitely many topics/tables in the nCRP; and practically speaking, the number of topics in the final model is much larger than the number of documents (conditioned on the parameter). In TSSB, the topic generation is said to be an interleaving of two stick breaking processes; practically, this generates even larger topic hierarchies. The fsLDA algorithm has as many topics as there are in hLDA, however, the fsLDA hierarchy is not redrawn during Gibbs iterations to fit the word distributions resulting in a lower likelihood. Simply put, the number of topics in HDTM and fsLDA hPAM, hLDA and TopicBlock TSSB.
Similarly, Figure 5(b) shows that deeper hierarchies have higher likelihood scores. This is because long documenttoroot paths, found in deep hierarchies, are able to provide a more fine grained fit for the words in the document resulting in a higher likelihood.
Therefore, the better likelihood values of HLDA, TopicBlock and TSSB are due to the larger number of topics that these models infer and not necessarily due to better model outputs. In this case, a better way to evaluate model accuracy is by some external task or by manually judging the coherence of the topics.
iv.5 Qualitative Analysis
To measure the coherence of the inferred groupings, the word intrusion task developed by Chang et al Chang et al. (2009) is slightly modified to create the document intrusion task. In this task, a human subject is presented with a randomly ordered set of eight document titles. The task for the human judge is to find the intruder, that is, the judge is asked to find which document is out of place or does not belong. If the set of documents without the intruder document all make sense together, then the human judge should easily be able to find the intruder. For example, given a set of computer science documents with titles {systems, networking, databases, graphics, Alan Turing}, most people, even noncomputer scientists, would pick Alan Turing as the intruder because the remaining words make sense together – they are all computer science disciplines.
For the set {systems, networking, RAM, Minesweeper, Alan Turing}, identifying a single intruder is more difficult. Human judges, when forced to make a choice, will choose an intruder at random, indicating that the grouping has poor coherence.
To construct a set of document titles to present to the human judge, a grouping from the hierarchy is selected at random, and select seven documents are further selected at random from the grouping. If the there are fewer than 7 documents available in the selected grouping, then we select all of the documents available; groupings of size less than 4 are thrown out. In addition to these documents, an intruder document is selected at random from among the entire collection of documents minus the documents in the test group. Titles are then shuffled and presented to the human judges.
iv.5.1 Comparison Models
Extracting document groupings for evaluation is slightly different for each model. HDTM and fsLDA store a document at each node in the hierarchy. A grouping is selected by first picking a document at random, and then choosing its siblings. TopicBlock and HLDA store documents at the leaves of the taxonomy, which often include several documents. A grouping is selected from these models by first picking a document at random, and then choosing the other documents in the leaftopic.
The hierarchies that the TSSB model constructed allowed multiple documents to live at inner nodes. Attempts to evaluate groupings on inner nodes with more than 4 documents were unsuccessful. Nodes with 4 or more siblings were also difficult to find because the hierarchies that were generated were too sparse to find practical groupings. Thus human judges with TSSB groupings could not be found.
Each documentgraph collection had different types of labels presented to the judges. The CompSci web site collection was labeled by the Web Page title and URL; the Wikipedia collection was labeled by the category title as shown in Figure 9; the bibliography network was labeled by the title of the paper.
iv.5.2 Analyzing human judgments
The intruder detection tasks described above were offered on Amazon Mechanical Turk. No specialized training is expected of the judges. 50 tasks were created for each data set and model combination; each user was presented with 5 tasks at a time at a cost of $0.07 per task. Each task was evaluated by 15 separate judges. In order to measure the trustworthiness of a judge, 5 easy tasks were selected, i.e., groupings with clear intruders, and goldstandard answers were created. Judges who did not answer 80% of the goldstandard answers correctly are thrown out and not paid. In total the solicitation attracted 31,494 judgments, across 14 models of 50 tasks each. Of these, 13,165 judgments were found to be from trustworthy judges.
The model precision is measured based on how well the intruders were detected by the judges. Specifically, if the intruder word is from model and task , and is the intruder selected by the human judge on task in model then
(12) 
where is the indicator function and is the number of judges. The model precision is basically the fraction of judges agreeing with the model.
Figure 10 shows boxplots of the precision for the four models on three corpora. In most cases, HDTM performs the best. As in Chang et al. (2009)
, the likelihood scores do not necessarily correspond to human judgments. Paired, twotailed ttests of statistical significants (
) performed between HDTM and and the other models are represented by and in Figure 10 respectively.The bibliography network data had relatively low precision scores. This is almost certainly because it was more difficult for the judges, who were probably not computer scientists, to differentiate between the topics in research paper titles. Figure 11 shows a small portion of the document hierarchy for the bibliographic network data set constructed with HDTM . The root document has 20 children in the hierarchy despite having 145 incollection links. The remaining 120 documents live deeper in the hierarchy because HDTM has determined that they are too specific to warrant a first level position, and have a better fit in one of the subtrees.
Recall that each document is associated with the topics from itself to the root, where the root is a single, general topic. The seven most probable terms at the root level are also shown adjacent to the root’s title in Figure 11. These terms, like in HLDA and TopicBlock, are terms that are general to the entire collection. Similar sets of words exist at each node in the hierarchy, but are not shown in this illustration to maintain clarity.
iv.6 Reproducibility
HDTM source code, analysis code and the scripts which generated the results found in this paper can be downloaded from https://github.com/nddsg/HDTM/releases/tag/kais. The Wikipedia, Web site and bibliographical data are all publicly available and free to download and are replicated in our own data repository.
V Conclusions
Hierarchical documenttopic model (HDTM), is a Bayesian generative model that creates document and topic hierarchies from rooted document graphs. The initial hypothesis was that document graphs, such as Web sites, Wikipedia and bibliographic networks contain a hidden hierarchy. Unlike most previous work, HDTM allows documents to live at nonleaf nodes in the hierarchy, which requires the random walk with restart path sampling technique. An interesting sideeffect of the random walker adaptation is that the path sampling step, Eq. 4, is much faster and easier to scale than the nCRP because RWR only creates a sampling distribution for the parents of a document, whereas the nCRP process creates a sampling distribution over all possible paths in the taxonomy.
Several quantitative experiments comparing HDTM with related models were performed. However, the results show that likelihood scores are a poor indicator of hierarchy interpretability, especially when the number of topics are different between comparison models. A large qualitative case study was also performed which showed that the cohesiveness of the document groupings generated by HDTM were statistically better than many of the comparison models despite the poor likelihood scores.
Acknowledgements.
This work is sponsored by an AFOSR grant FA95501510003, and a John Templeton Foundation grant FP053369M.References
 Adams et al. (2010) R. P. Adams, Z. Ghahramani, and M. I. Jordan, in NIPS (2010) pp. 19–27.
 Blei et al. (2003) D. M. Blei, A. Y. Ng, and M. I. Jordan, The Journal of Machine Learning Research 3, 993 (2003).
 Ho et al. (2012) Q. Ho, J. Eisenstein, and E. P. Xing, in WWW (New York, New York, USA, 2012) p. 739.
 Chambers et al. (2010) A. Chambers, P. Smyth, and M. Steyvers, in NIPS (2010) pp. 334–342.
 Chang and Blei (2010) J. Chang and D. M. Blei, Annals of Applied Statistics 4, 121 (2010).
 Mimno et al. (2007) D. Mimno, W. Li, and A. McCallum, in ICML (New York, New York, USA, 2007) pp. 633–640.
 Blei et al. (2004) D. M. Blei, T. L. Griffiths, M. I. Jordan, and J. B. Tenenbaum, in NIPS (2004) pp. 17–24.
 Blei et al. (2010) D. M. Blei, T. L. Griffiths, and M. I. Jordan, Journal of the ACM 57, 7 (2010).
 Song et al. (2004) R. Song, J.R. Wen, S. Shi, G. Xin, T.Y. Liu, T. Qin, X. Zheng, J. Zhang, G.R. Xue, and W.Y. Ma, in TREC (2004).
 Qin et al. (2005) T. Qin, T.Y. Liu, X.D. Zhang, Z. Chen, and W.Y. Ma, in SIGIR (New York, New York, USA, 2005) p. 408.
 Zhai and Lafferty (2004) C. Zhai and J. Lafferty, ACM TOIS 22, 179 (2004).
 (12) http://cse.nd.edu.
 Lin et al. (2013) C. Lin, Y. Zou, J. Qin, X. Liu, Y. Jiang, C. Ke, and Q. Zou, PloS one 8, e56499 (2013).
 Murzin et al. (1995) A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia, Journal of molecular biology 247, 536 (1995).
 Nallapati et al. (2011) R. Nallapati, D. A. McFarland, and C. D. Manning, in AISTATS, Vol. 15 (2011) pp. 543–551.
 Furukawa et al. (2008) T. Furukawa, Y. Matsuo, I. Ohmukai, K. Uchiyama, and M. Ishizuka, in ICWSM (2008).
 Willett (1988) P. Willett, Information Processing & Management 24, 577 (1988).
 (18) http://www.dmoz.org.
 Zhao et al. (2005) Y. Zhao, G. Karypis, and U. Fayyad, Data Mining and Knowledge Discovery 10, 141 (2005).
 Fisher (1987) D. H. Fisher, Machine Learning 2, 139 (1987).
 Gennari et al. (1989) J. H. Gennari, P. Langley, and D. Fisher, Artificial Intelligence 40, 11 (1989).
 Heller and Ghahramani (2005) K. A. Heller and Z. Ghahramani, in ICML (New York, New York, USA, 2005) pp. 297–304.
 Petinot et al. (2011) Y. Petinot, K. McKeown, and K. Thadani, in ACL (2011) pp. 670–675.
 Reisinger and Paşca (2009) J. Reisinger and M. Paşca, ACL , 620 (2009).
 Huang et al. (2010) J. Huang, H. Sun, J. Han, H. Deng, Y. Sun, and Y. Liu, in CIKM (New York, New York, USA, 2010) p. 219.
 Clauset et al. (2008) A. Clauset, C. Moore, and M. E. J. Newman, Nature 453, 98 (2008).
 Holland et al. (1983) P. W. Holland, K. B. Laskey, and S. Leinhardt, Social Networks 5, 109 (1983).
 Goldenberg et al. (2009) A. Goldenberg, A. X. Zheng, S. E. Fienberg, and E. M. Airoldi, Foundations and Trends in Machine Learning Foundations and Trends® in Machine Learning, 2, 129 (2009).
 Mei et al. (2008) Q. Mei, D. Cai, D. Zhang, and C. Zhai, in WWW (New York, New York, USA, 2008) pp. 101–110.
 Haveliwala (2002) T. H. Haveliwala, in WWW (New York, New York, USA, 2002) pp. 517–526.
 Cohn and Hofmann (2000) D. A. Cohn and T. Hofmann, in NIPS (2000) pp. 430–436.
 Nallapati et al. (2008) R. M. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen, in SIGKDD (New York, New York, USA, 2008) pp. 542–550.
 Gruber et al. (2008) A. Gruber, M. RosenZvi, and Y. Weiss, in UAI (2008) pp. 230–239.
 McCallum et al. (2005) A. McCallum, A. CorradaEmmanuel, and X. Wang, in IJCAI (2005) pp. 786–791.
 RosenZvi et al. (2004) M. RosenZvi, T. L. Griffiths, M. Steyvers, and P. Smyth, in UAI (2004) pp. 487–494.
 Newman et al. (2007) D. Newman, P. Smyth, M. Welling, and A. U. Asuncion, in Advances in neural information processing systems (2007) pp. 1081–1088.
 Smyth et al. (2009) P. Smyth, M. Welling, and A. U. Asuncion, in Advances in Neural Information Processing Systems (2009) pp. 81–88.
 Ahmed et al. (2012) A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola, in Proceedings of the fifth ACM international conference on Web search and data mining (ACM, 2012) pp. 123–132.
 Low et al. (2012) Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein, in Proceedings of the VLDB Endowment ( VLDB Endowment, 2012).
 Xin et al. (2013) R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica, GRADES ’13: First International Workshop on Graph Data Management Experiences and Systems (2013).
 Bahmani et al. (2010) B. Bahmani, A. Chowdhury, and A. Goel, PVLDB 4, 173 (2010).
 (42) Most related works denote the jumping probability as , however, this would be ambiguous with the Dirichlet hyperparameter .
 Malewicz et al. (2010) G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (ACM, 2010) pp. 135–146.
 Zaharia et al. (2010) M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, in Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, Vol. 10 (2010) p. 10.
 Lee et al. (2012) K.H. Lee, Y.J. Lee, H. Choi, Y. D. Chung, and B. Moon, SIGMOD Record 40, 11 (2012).
 Zou et al. (2013) Q. Zou, X.B. Li, W.R. Jiang, Z.Y. Lin, G.L. Li, and K. Chen, Briefings in bioinformatics , bbs088 (2013).
 McCune et al. (2015) R. R. McCune, T. Weninger, and G. Madey, ACM Computing Surveys (2015).
 Mccallum et al. (2009) A. Mccallum, D. M. Mimno, and H. M. Wallach, in Advances in neural information processing systems (2009) pp. 1973–1981.
 Giles et al. (1998) C. L. Giles, K. D. Bollacker, and S. Lawrence, in Proceedings of the third ACM conference on Digital libraries (ACM, 1998) pp. 89–98.
 Ley (2002) M. Ley, in String Processing and Information Retrieval (Springer, 2002) pp. 1–10.
 Tang et al. (2008) J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, in SIGKDD (New York, New York, USA, 2008) p. 990.
 Ponte and Croft (1998) J. M. Ponte and W. B. Croft, in SIGIR (New York, New York, USA, 1998) pp. 275–281.
 Faloutsos et al. (2013) C. Faloutsos, D. Koutra, and J. T. Vogelstein, SDM , 162 (2013).
 McCallum (2002) A. K. McCallum, (2002) (2002).
 Chang et al. (2009) J. Chang, S. Gerrish, C. Wang, J. L. Boydgraber, and D. M. Blei, in Advances in neural information processing systems (2009) pp. 288–296.
Comments
There are no comments yet.