Mach Learn

DOI 10.1007/s10994-014-5457-9

Asymptotic analysis of estimators on multi-label data

Andreas P. Streich • Joachim M. Buhmann

Received: 20 November 2011 / Accepted: 4 June 2014

© The Author(s) 2014. This article is published with open access at Springerlink.com

Abstract Multi-label classification extends the standard multi-class classification paradigm by dropping the assumption that classes have to be mutually exclusive, i.e., the same data item might belong to more than one class. Multi-label classification has many important applications in e.g. signal processing, medicine, biology and information security, but the analysis and understanding of the inference methods based on data with multiple labels are still underdeveloped. In this paper, we formulate a general generative process for multi-label data, i.e. we associate each label (or class) with a source. To generate multi-label data items, the emissions of all sources in the label set are combined. In the training phase, only the probability distributions of these (single label) sources need to be learned. Inference on multi-label data requires solving an inverse problem, models of the data generation process therefore require additional assumptions to guarantee well-posedness of the inference procedure. Similarly, in the prediction (test) phase, the distributions of all single-label sources in the label set are combined using the combination function to determine the probability of a label set. We formally describe several previously presented inference methods and introduce a novel, general-purpose approach, where the combination function is determined based on the data and/or on a priori knowledge of the data generation mechanism. This framework includes cross-training and new source training (also named label power set method) as special cases. We derive an asymptotic theory for estimators based on multi-label data and investigate the consistency and efficiency of estimators obtained by several state-of-the-art inference techniques. Several experiments confirm these findings and emphasize the importance of a sufficiently complex generative model for real-world applications.

Editors: Grigorios Tsoumakas, Min-Ling Zhang, and Zhi-Hua Zhou.

A. P. Streich (B)

Science and Technology Group, Phonak AG, Laubisrütistrasse 28, 8712 Stäfa, Switzerland e-mail: andreas.streich@alumni.ethz.ch

J. M. Buhmann

Department of Computer Science, ETH Zurich, Universitätstrasse 6, 8092 Zurich, Switzerland e-mail: jbuhmann@inf.ethz.ch

Published online: 09 July 2014

1 Springer

Keywords Generative model • Asymptotic analysis • Multi-label classification • Consistency

1 Introduction

Multi-labelled data are encountered in classification of acoustic and visual scenes (Boutell et al. 2004), in text categorization (Joachims 1998; McCallum 1999), in medical diagnosis (Kawai and Takahashi 2009) and other application areas. For the classification of acoustic scenes, consider for example the well-known Cocktail-Party problem (Arons 1992), where several signals are mixed together and the objective is to detect the original signal. For a more detailed overview, we refer to Tsoumakas et al. (2010) and Zhang et al. (2013).

1.1 Prior art in multi-label learning and classification

In spite of its growing significance and attention, the theoretical analysis of multi-label classification is still in its infancy with limited literature. Some recent publications, however, show an interest to gain a fundamental insight into the problem of classifying multi-label data. Most attention is thereby attributed to correlations in the label sets. Using error-correcting output codes for multi-label classification (Dietterich and Bakiri 1995) has been proposed very early to "correct" invalid (i.e. improbable) label sets. The principle of maximum entropy is employed in Zhu et al. (2005) to capture correlations in the label set. The assumption of small label sets is exploited in the framework of compressed sensing by Hsu et al. (2009). Conditional random fields are used in Ghamrawi and McCallum (2005) to parameterize label co-occurrences. Instead of independent dichotomies, a series of classifiers is built in Read et al. (2009), where a classifier gets the output of all preceding classifiers in the chain as additional input. A probabilistic version thereof is presented in Dembczynski et al. (2010).

Two important gaps in the theory of multi-label classification have attracted the attention of the community in recent years: first, most research programs primarily focus on the label set, while an interpretation of how multi-label data arise is missing in the vast majority of the cases. Deconvolution problems (Streich 2010) define a special case of inference from multi-label data, as discussed in Chap. 2. In-depth analysis of the asymptotic behaviour of the estimators has been presented in Masry (1991,1993). Secondly, a large number of quality measures has been presented, the understanding of how these are related with each other is underdeveloped. Dembczynski et al. (2012) analyses the interrelation between some of the most commonly used performance metrics. A theoretical analysis on the Bayes consistency of learning algorithm with respect to different loss functions is presented in Gao and Zhou (2013).

This contribution mainly addresses the issue how multi-label data are generated, i.e., we propose a generative model for multi-label data. A datum is composed of emissions by multiple sources. The emitting sources are indicated by the label set. These emissions are combined by a problem specific combination function like the linear superposition principle in optics or acoustics. The combination function specifies a core model assumption in the data generation process. Each source generates data items according to a source specific probability distribution. This point of view, as the reader should note, points into a direction that is orthogonal to the previously mentioned literature on label correlation: extra knowledge on the distribution of the label sets can coherently be represented by a prior over the label sets.

Furthermore, we assume that the sources are described by parametric distributions.1 In this setting, the accuracy of the parameter estimators is a fundamental value to assess the quality of an inference scheme. This measure is of central interest in asymptotic theory, which investigates the distribution of a summary statistic in the asymptotic limit (Brazzale et al. 2007). Asymptotic analysis of parametric models has become an essential tool in statistics, as the exact distributions of the quantities of interest cannot be measured in most settings. In the first place, asymptotic analysis is used to check whether an estimation method is consistent, i.e. whether the obtained estimators converge to the correct parameter values if the number of data items available for inference goes to infinity. Furthermore, asymptotic theory provides approximate answers where exact ones are not available, namely in the case of data sets of finite size. Asymptotic analysis describes for example how efficiently an inference method uses the given data for parameter estimation (Liang and Jordan 2008).

Consistent inference schemes are essential for generative classifiers, and a more efficient inference scheme yields more precise classification results than a less efficient one, given the same training data. More specifically, the expected error of a classifier converges to the Bayes error for maximum a posteriori classification, if the estimated parameters converge to the true parameter values (Devroye et al. 1996). In this paper, we first review the state-of-the-art asymptotic theory for estimators based on single-label data. We then extend the asymptotic analysis to inference on multi-label data and prove statements about the identifiability of parameters and the asymptotic distribution of their estimators in this demanding setting.

1.2 Advantages of generative models

Generative models define only one approach to machine learning problems. For classification, discriminative models directly estimate the posterior distributions of class labels given data and, thereby, they avoid an explicit estimate of class specific likelihood distributions. A further reduction in complexity is obtained by discriminant functions, which map a data item directly to a set of classes or clusters (Hastie et al. 1993).

Generative models are the most demanding of all alternatives. If the only goal is to classify data in an easy setting, designing and inferring the complete generative model might be a wasteful use of resources and demand excessive amounts of data. However, namely in demanding scenarios, there exist well-founded reasons for generative models (Bishop 2007):

Generative description of data Even though this may be considered as stating the obvious, we emphasize that assumptions on the generative process underlying the observed data naturally enter into a generative model. Incorporating such prior knowledge into discriminative models proves typically significantly more difficult. Interpretability The nature of multi-source data is best understood by studying how such data are generated. In most applications, the sources in the generative model come with a clear semantic meaning. Determining their parameters is thus not only an intermediate step to the final goal of classification, but an important piece of information on the structure of the data. Consider the cocktail party problem, where several speech and noise sources are superposed to the speech of the dialogue partner. Identifying the sources which generate the perceived signal is a demanding problem. The final goal, however, might go even further and consist of finding out what your dialogue partner said. A generative model for the sources present in the current acoustic situation enables us to determine the most likely emission of each source given the complete signal. This approach, referred to

1 This supposition significantly simplifies the subsequent calculations, it is, however, not essential for the approach proposed here.

as model-based source separation (Hershey et al. 2010), critically depends on a reliable source model.

Reject option and outlier detection Given a generative model, we can also determine the probability of a particular data item. Samples with a low probability are called outliers. Their generation is not confidently represented by the generative model, and no reliable assignment of a data item to a set of sources is possible. Furthermore, outlier detection might be helpful in the overall system in which the machine learning application is integrated: outliers may be caused by defective measurement device or by fraud.

Since these advantages of generative models are prevalent in the considered applications, we restrict ourselves to generative methods when comparing our approaches with existing techniques.

1.3 A generative understanding of multi-label data

When defining a generative model, a distribution for each source has to be defined. To do so, one usually employs a parametric distribution, possibly based on prior knowledge or a study of the distribution of the data with a particular label. In the multi-label setting, the combination function is a further key component of the generative model. This function defines the semantics of the multi-label: while each single-labelled observation item is understood as a sample from a probability distribution identified by its label, multi-label observations are understood as a combination of the emissions of all sources in the label set. The combination function describes how the individual source emissions are combined to the observed data. Choosing an appropriate combination function is essential for successful inference and prediction. As we demonstrate in this paper, an inappropriate combination function might lead to inconsistent parameter estimators and worse label predictions, both compared to a simplistic approach where multi-label data items are ignored. Conversely, choosing the right combination function will allow us to extract more information from the training data, thus yielding more precise parameter estimators and superior classification accuracy.

The prominence of the combination function in the generative model naturally raises the question how this combination function can be determined. Specifying the combination function can be a challenging task when applying the deconvolutive method for multi-label classification. However, in our previous work, we achieved the insight that the combination function can typically be determined based on the data and prior knowledge, i.e. expertise in the field. For example in role mining, the disjunction of Boolean data is the natural choice (see Streich et al. 2009 for details), while the addition of (supposedly) Gaussian emissions is widely used in the classification of sounds (Streich and Buhmann 2008).

2 A generative model for multi-label data

We now present the generative process that we assume to have produced the observed data. Such generative models are widely found for single-label classification and clustering, but have not yet been formulated in a general form for multi-label data.

2.1 Label sets and source emissions

Let K denote the number of sources, and N the number of data items. We assume that the systematic regularities of the observed data are generated by a set K = {1,..., K} of K sources. Furthermore, we assume that all sources have the same sample space Each

Fig. 1 The generative model A for an observation X with source set L. An independent sample 3k is drawn from each source k according to the distribution P(3k\8k). The source set l is sampled from the source set distribution P(l). These samples are then combined to observation by the combination function cK(3, L). Note that the observation X only depends on emissions from sources contained in the source set l

source k e K emits samples Sk e ^ according to a given parametric probability distributions P(Sk \9k), where Ok is the parameter tuple of source k. Realizations of the random variables Sk are denoted by fk. Note that both the parameters Ok and the emission Sk can be vectors. In this case, Ok,1, 0^2, • • • and Sk,1, Sk,2, • • • , denote different components of these vectors, respectively. Emissions of different sources are assumed to be independent of each other. The tuple of all source emissions is denoted by E := (Sx ,•••, Sk), its probability distribution is given by P (E\9) = nK=1 P (Sk \ Ok). The tuple of the parameters of all K sources is denoted by 9 := (01,...,0k ).

Given an observation X = x, the source set L = X, •••, Xm} Q K denotes the set of all sources involved in generating X. The set of all possible label sets is denoted by L. If L = {X}, i.e. \L\ = 1, X is called a single-label data item, and X is assumed to be a sample from source X. On the other hand, if \L\ > 1, X is called a multi-label data item and is understood as a combination of the emissions of all sources in the label set L. This combination is formalized by the combination function cK : QK x L ^ where k is a set of parameters the combination function might depend on. Note that the combination function only depends on emissions of sources in the label set and is independent of any other emissions.

The generative process A for a data item, as illustrated in Fig. 1, consists of the following three steps:

(1) Draw a label set L from the distribution P (L).

(2) For each k e K, draw an independent sample Sk ~ P(Sk\0k) from source k. Set E := ^^•^Sk).

(3) Combine the source samples to the observation X = cK(E, L).

2.2 The combination function

The combination function models how emissions of one or several sources are combined to the structure component of the observation X. Often, the combination function reflects a priori knowledge of the data generation process like the linear superposition law of electrodynamics and acoustics or disjunctions in role mining. For source sets of cardinality one, i.e. for single-label data, the combination function chooses the emission of the corresponding source: ck(E, {X}) = Sx.

For source sets with more than one source, the combination function can be either deterministic or stochastic. Examples for deterministic combination functions are the (weighted) sum and the Boolean OR operation. In this case, the value of X is completely determined by E

and L. In terms of probability distribution, a deterministic combination function corresponds to a point mass at X = cK(3, L):

P(X\E, L) = 1{x =c«(z,C)}.

Stochastic combination functions allow us to formulate e.g. the well-known mixture discriminant analysis as a multi-label problem (Streich 2010). However, stochastic combination functions render inference more complex, since a description of the stochastic behaviour of the function has to be learned in addition to the parameters of the source distributions. In the considered applications, deterministic combination functions suffice to model the assumed generative process. For this reason, we will not further discuss probabilistic combination functions in this paper.

2.3 Probability distribution for structured data

Given the assumed generative process A, the probability of an observation X for source set L and parameters 0 amounts to

P (X\L, 0) = j P (X \E, L) d P (3\0)

We refer to P (X\L, 0) as the proxy distribution of observations with source set L. Note that in the presented interpretation of multi-label data, the distributions P (X\L, 0) for all source sets L are derived from the single source distribution.

For a full generative model, we introduce nc as the probability of source set L. The overall probability of a data item D = (X, L) is thus

P (X, L\0) = P (L) j -j P (X \ 3, L) dP (Sxi01) ••• dP (Ek\&k ) (1)

Several samples from the generative process are assumed to be independent and identically distributed (i.i.d.). The probability of N observations X = (X1, ..., Xn) with source sets L = (L1, ...,Ln ) is thus P (X, L\0) = nN=1 P (Xn, Ln\0). The assumption of i.i.d. data items allows us a substantial simplification of the model but is not a requirement for the assumed generative model.

To give an example of our generative model, we re-formulate the model used in McCallum (1999) in the terminology of this contribution. Omitting the mixture weights of individual classes within the label set (denoted by X in the original contribution) and understanding a single document as a collection of W words, the probability of a single document is P(X) = P(L) nW= 1 XAe£ P(Xw W. Comparing with the assumed data likelihood

(Eq. 1), we find that the combination function is the juxtaposition, i.e. every word emitted by a source during the generative process will be found in the document.

A similar word-based mixture model for multi-label text classification is presented in Ueda and Saito (2006). Rosen-Zvi et al. (2004) introduce the author-topic model, a generative model for documents that combines the mixture model over words with Latent Dirichlet Allocation (Blei et al. 2003) to include authorship information: each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. An additional dependency on the recipient is introduced in McCallum et al. (2005) in order to predict people's roles from email communications. Yano et al. (2009) uses the topic model to predict

the response to political blogs. We are not aware of any generative approaches to multi-label classification in other domains then text categorization.

2.4 Quality measures for multi-label classification

The quality measure mathematically formulates the evaluation criteria for the machine learning task at hand. A whole series of measures has been defined (Tsoumakas and Katakis 2007) to cover different requirements to multi-label classification. Commonly used are average precision, coverage, hamming loss, one-error and ranking loss (Schapire and Singer 2000; Zhang and Zhou 2006) as well as accuracy, precision, recall and F-Score (Godbole and Sarawagi 2004; Qi et al. 2007). We will focus on the balanced error rate (BER) (adapted from single-label classification) and precision, recall and F-score (inspired by information retrieval).

The BER is the ratio of incorrectly classified samples per label set, averaged (with equal weight) over all label sets:

BERL,:= iL Z

Xn (l{£„ =£)l{£n=£})

\L\ Lei Zn 1{Ln=l

While the BER considers the entire label set, precision and recall are calculated first per label. We first calculate the true positives tpk = XN=1 (1{keL } 1{keLn}), false positives

fpk = ZN=1(1{ke£n} 1{k^£n}) and false negatives fnk = ZN=1 (1{k^^n}1{keLn}) foreach class k. The precision preck of class k is the fraction of data items correctly identified as belonging to k, divided by the number of all data items identified as belonging to k. The recall reck for a class k is the fraction of instances correctly recognized as belonging to this class, divided by the number of instances which belong to class k:

tpk tpk

preck := - reck := -

tpk + fpk tpk + fnk

Good performance with respect to either precision or recall alone can be obtained by either very conservatively assigning data items to classes (leading to typically small label sets and a high precision, but a low recall) or by attributing labels in a very generous way (yielding high recall, but low precision). The F-score Fk, defined as the harmonic mean of precision and recall, finds a balance between the two measures:

2 • reck • preck

Fk :=-:-

reck + preck

Precision, recall and the F-score are determined individually for each base label k. We report the average over all labels k (macro averaging). All these measures take values between 0 (worst) and 1 (best). The error rate and the BER are quality measures computed on an entire data set. Its values also range from 0 to 1, but here 0 is best.

Besides the quality criteria on the classification output, the accuracy of the parameter estimator compares the estimated source parameters with the true source parameters. This model-based criterion thus assesses the obtained solution of the essential inference problem in generative classification. However, a direct comparison between true and estimated parameters is typically only possible for experiments with synthetically generated data. The possibility to directly assess the inference quality and the extensive control over the experimental setting are actually the main reasons why, in this paper, we focus on experiments with synthetic data. We measure the accuracy of the parameter estimation by the mean square

Table 1 Overview over the probability distributions used in this paper

Symbol Meaning

Pffk (Hfc) True distribution of the emissions of source k, given Ok

Pg ( H ) True joint distribution of the emissions of all sources

Pc,9 (X) True distribution of the observations X with label set l

PM (X) Distribution of the observation X with label set L, as

assumed by method M, and given parameters 9 Pcd(X) Empirical distribution of an observation X with label set l in the data set D

Pn (l) True distribution of the label sets

Pd(L) Empirical distribution of the label sets in D

Pg ( D) True distribution of data item D

P M (D) Distribution of data item D as assumed by method m

Pd(D) Empirical distribution of a data item D in the data set D

P M^(Ek ) Conditional distribution of the emission Ek of source k

given D and Ok, as assumed by inference method m P mg (h ) Conditional distribution of the source emissions h

given and , as assumed by inference method m

A data item D = (X, L) is an observation X along with its label set L

error (MSE), defined as the average squared distance between the true parameter 0 and its estimator 0:

MSE(9,9) := E

— tn(k),■

The MSE can be decomposed as follows:

MSE(9,9) =

K é (Eôk[|K — 11]2 + Vk[êk]) k=1 v J

The first term E0 [||0k,- — 0n(k),\\] is the expected deviation of the estimator 0n(k),- from the true value 0k,-, called the bias of the estimator. The second term V^[()k ] indicates the variance of the estimator over different data sets. We will rely on this bias-variance decomposition when computing the asymptotic distribution of the mean-squared error of the estimators. In the experiments, we will report the root mean square error (RMS).

3 Preliminaries

Preliminaries to study the asymptotic behaviour of the estimators obtained by different inference methods are introduced in this section. This paper contains an elaborate notation, the probability distributions used are summarized in Table 1.

3.1 Exponential family distributions

In the following, we assume that the source distributions are members of the exponential family (Wainwright and Jordan 2008). This assumption implies that the distribution Pek (Ek) of source k admits a density pgk (ik) in the following form:

pok(fk) = exp ({Ok, 0(fk)) - A(0k)) • (3)

Here Ok are the natural parameters, 0 (fk) are the sufficient statistics of the sample fk of source k, and A(Ok) := log (J exp ({Ok, 0(fk))) dfk) is the log-partition function. The expression {Ok, 0(fk)) := 1 Ok,s • (0(fk ))s denotes the inner product between the natural parameters Ok and the sufficient statistics 0(fk). The number S is called the dimensionality of the exponential family. Oks is the sth dimension of the parameter vector of source k, and (0(fk))s is the sth dimension of the sufficient statistics. The (S-dimensional) parameter space of the distribution is denoted by 0. The class of exponential family distributions contains many of the widely used probability distributions: the Bernoulli, Poisson and the x2 distribution are one-dimensional exponential family distributions; the Gamma, Beta and normal distribution are examples of two-dimensional exponential family distributions.

The joint distribution of the independent sources is P9 (E) = Y\K= 1 P9k (Sk), with the density function p9 (%) = ]"[K= 1 pOk (fk). To shorten the notation, we define the vectorial sufficient statistic 0(%) := (0(f1), • • •, 0(fK))T, the parameter vector 9 := (O1, •••, OK)T and the cumulative log-partition function A(9) := ZK= 1 A(Ok). Using the parameter vector 9 and the emission vector |, the density function p9 of the source emissions is p9 (%) = nK=1 pOk (fk) = exp ({9, №)) - A(9)).

Exponential family distributions have the property that the derivatives of the log-partition function with respect to the parameter vector 9 are moments of the sufficient statistics $(•). Namely the first and second derivative of AQ are the expected first and second moment of the statistics:

V9 A(9) = En-P9 [0(E)] V92A(9) = Ve~p9 [0(E)] (4)

where Ex~p [X] and Vx~p [X] denote the expectation value and the covariance matrix of a random variable X sampled from distribution P. In all statements in this paper, we assume that all considered variances are finite.

3.2 Identifiability

The representation of exponential family distributions in Eq. 3 may not be unique, e.g. if the sufficient statistics 0 (fk) are mutually dependent. In this case, the dimensionality S of the exponential family distribution can be reduced. Unless this is done, the parameters Ok are unidentifiable: there exist at least two different parameter values O^ = O(2) which

imply the same probability distribution p (1) = p (2) . These two paramter values cannot be

distinguished based on observations, they are therefore called unidentifiable (Lehmann and Casella 1998).

Definition 1 (Identifiability) Let p = {pO : O e ©} be a parametric statistical model with parameter space ©. p is called identifiable if the mapping O ^ pO is one-to-one:

pO(1) = pOi2) ^^ O(1) = O(2) for all O(1),O(2) e ©.

Identifiability of the model in the sense that the mapping O ^ pO can be inverted is equivalent to being able to learn the true parameters of the model if an infinite number of samples from the model can be observed (Lehmann and Casella 1998).

In all concrete learning problems, identifiability is always conditioned on the data. Obviously, if there are no observations from a particular source (class), the likelihood of the data is independent of the parameter values of the never-occurring source. The parameters of the particular source are thus unidentifiable.

3.3 M- and Z-estimators

A popular method to determine the estimators 0 = (01,..., 0k ) for a generative model based on independent and identically-distributed (i.i.d.) data items D = (Di,..., Dn) is to maximize a criterion function 0 ^ Mn(0) = N XN=i mo (Dn), where mo : D ^ R are known functions. An estimator 0 = argmax0 Mn (0) maximizing Mn (0) is called an M-estimator, where M stands for maximization.

For continuously differentiable criterion functions, the maximizing value is often determined by setting the derivative with respect to 0 equal to zero. With ft0 (D) := V0m0 (D), this yields an equation of the type ^n (0) = N XN=i ft0 (Dn), and the parameter 0 is then determined such that ^n (0) = 0. This type of estimator is called Z-estimator, with Z standing for zero.

Maximum-likelihood estimators Maximum likelihood estimators are M-estimators with the criterion function m0 (D) := 1(0; D). The corresponding Z-estimator, which we will use in this paper, is obtained by computing the derivative of the log-likelihood with respect to the parameter vector , called the score:

f0(.D) = V01(0; D). (5)

Convergence Assume that there exists an asymptotic criterion function 0 ^ ^ (0) such that

the sequence of criterion functions converges in probability to a fixed limit: ^n (0) ^ ^ (0) for every 0. Convergence can only be obtained if there is a unique zero 0o of and if only parameters 0 close to 0o yield a value of ^(0) close to zero. Thus, 0o has to be a well-separated zero of (van der Vaart 1998):

Theorem 1 Let ^n be random vector-valued functions and let ^ be a fixed vector-valued function of0 such that for every e > 0

sup ii^N(0) — *(0)iiS 0 inf N^N > ||¥(00)II = 0.

0e& 0:d(0,00)>e

Then any sequence of estimators 0n such that ^n (0n) = op (1) converges in probability to 0.

The notation op (1) denotes a sequence of random vectors that converges to 0 in probability, and d( , 0) indicates the Euclidian distance between the estimator and the true value 0. The second condition implies that 00 is the only zero of outside a neighborhood of size e around 00. As is defined as the derivative of the likelihood function (Eq. 5), this criterion is equivalent to a concave likelihood function over the whole parameter space 0. If the likelihood function is not concave, there are several (local) optima, and convergence to the global maximizer 0 cannot be guaranteed.

Asymptotic normality Given consistency, the question about how the estimators 0n are distributed around the asymptotic limit 00 arises. Assuming the criterion function 0 ^ ft0 (D) to be twice continuously differentiable, ^n (0n) can be expanded through a Taylor series around 0. Then, using the central limit theorem, N is found to be normally distributed around 00 (van der Vaart 1998). Defining v® := vvT, we get the following theorem (all expectation values w.r.t. the true distribution of the data items D):

Theorem 2 Assume that Ed[$90 (D)®] < m and that the map 9 ^ Ed[$9 (D)] is differentiate at a zero 90 with non-singular derivative matrix. Then, the sequence V— • (9 N - 9 0) is asymptotically normal: V— • (9n — 90) ^ N(0, E) with asymptotic variance

E = (ED[V9 $90(D)])-1 • Ed[($90(D)f ] • ((Ed[V9$90(D)])-1)T • (6) 3.4 Maximum-likelihood estimation on single-label data

To estimate parameters on single-label data, a data set D = {(Xn, Xn)}, n = N,

with Xn e {1, •••, K} for all n, is separated according to the class label, so that one gets K sets X1,- ••, Xk, where Xk := {Xn \(Xn, Xn) e D,Xn = k} contains all observations with label k. All samples in Xk are assumed to be iJd• random variables distributed according to P(X\Ok). It is assumed that the samples in Xk do not provide any information about O^ if k = k!, i.e. parameters for the different classes are functionally independent of each other (Duda et al. 2000). Therefore, we obtain K independent parameter estimation problems, each with criterion function (Ok) = —■ ZxeXk $Ok ((X, k)), where Nk := \Xk \. The parameter estimator Ok is then determined such that (Ok) = 0. More specifically for maximum-likelihood estimation of parameters of exponential family distributions (Eq. 3), the criterion function $Ok (D) = V¿(9; D) (Eq. 5) for a data item D = (X, {k}) becomes $Ok(D) = 0(X) - ESk~p9k [0(Sk)]. Choosing (>k such that the criterion function ^Nk(Ok) is zero means changing the model parameter such that the average value of the sufficient statistics of the observations coincides with the expected sufficient statistics:

*Nk (Ok) = — X 0(X) - ESk~POk [0(Sk)] • (7)

k XeXk

Hence, maximum-likelihood estimators in exponential families are moment estimators (Wainwright and Jordan 2008). The theorems of consistency and asymptotic normality are directly applicable.

With the same formalism, it becomes clear why the inference problems for different classes are independent: assume an observation X with label k is given. Under the assumption of the generative model, the label k states that X is a sample from source pOk. Trying to derive information about the parameter Ok of a second source k' = k from X, we would derive pOk with respect to Ok to get the score function. Since pOk is independent of Ok, this derivative is zero, and the data item (X, k) does not contribute to the criterion function (Ok') (Eq. 7) for Ok'.

Fisher information For inference in a parametric model with a consistent estimator (>k ^ Ok, the Fisher information I (Fisher 1925) is defined as the second moment of the score function. Since the parameter estimator O is chosen such that the average of the score function is zero, the second moment of the score function corresponds to its variance:

iXk (Ok) := Ex ~pog [$Ok (X)®] = Vx ~ Pg [0(X)], (8)

where the expectation is taken with respect to the true distribution P bg . The Fisher information thus indicates to what extend the score function depends on the parameter. The larger this dependency is, the more the observed data depends on the parameter value, and the more accurately this parameter value can be determined for a given set of training data. According to the Cramer-Rao bound (Rao 1945; Cramer 1946, 1999), the reciprocal of the Fisher

information is a lower bound on the variance of any unbiased estimator of a deterministic

parameter. An estimator 0k is called efficient if Vx~p G [0k] = (lXk (0k))—1.

4 Asymptotic distribution of multi-label estimators

We now extend the asymptotic analysis to estimators based on multi-label data. We restrict ourselves to maximum likelihood estimators for the parameters of exponential family distributions. As we are mainly interested in comparing different ways to learn from data, we also assume the parametric form of the distribution to be known.

4.1 From observations to source emissions

In single-label inference problems, each observation provides a sample of a source indicated by the label, as discussed in Sect. 3.4. In the case of inference based on multi-label data, the situation is more involved, since the source emissions cannot be observed directly. The relation between the source emissions and the observations are formalized by the combination function (see Sect. 2) describing the observation X based on an emission vector E and the label set L.

To perform inference, we have to determine which emission vector E has produced the observed X. To solve this inverse problem, an inference method relies on additional constraints besides assuming the parametric form of the distribution, namely on the combination function. These design assumptions — made implicitly or explicitly — enable the inference scheme to derive information about the distribution of the source emissions given an observation.

In this analysis, we focus on differences in the assumed combination function. PM (X | E, L) denotes the probabilistic representations of the combination function: it specifies the probability distribution of an observation X given the emission vector E and the label set L, as assumed by method M. We formally describe several techniques along with the analysis of their estimators in Sect. 5. It is worth mentioning that for single-label data, all estimation techniques considered in this work are equal and yield consistent and efficient parameter estimators, as they agree on the combination function for single-label data: the identity function is the only reasonable choice in this case.

The probability distribution of X given the label set L, the parameters and the combination function assumed by method M is computed by marginalizing E out of the joint distribution of E and X:

PM0(X) := P M(X L,0) = j P M(X |E, L) dP (E|0)

For the probability of a data item D = (X, L) given the parameters and under the assumptions made by model M, we have

PM(D) := PM(X, L|0) = n l J PM(X|E, L)p(E0) dE. (9)

Estimating the probability of the label set L, nl , is a standard problem of estimating the parameters of a categorical distribution. According to the law of large numbers, the empirical frequency of occurrence converges to the true probability for each label set. Therefore, we do not further investigate this estimation problem and assume that the true value of nl can be determined for all L e L.

The probability of a particular emission vector E given a data item D and the parameters 9 is computed using Bayes' theorem:

KA M P M(X \ E, L) • P (E\9)

PM(e) := PM(E \X, L, 9) = -( E ' )-(10)

D,9 pM(X\L, 9) v ;

The dependency of on the parameter vector indicates that the estimation of the contributions of a source may depend on the parameters of a different source. When solving clustering problems, we also find cross-dependencies between parameters of different classes. However, these dependencies are due to the fact that the class assignments are not known but are probabilistically estimated. If the true class labels were known, the dependencies would disappear. In the context of multi-label classification, however, the mutual dependencies persist even when the true labels (called label set in our context) are known.

The distribution PM(E \D,9) describes the essential difference between inference methods for multi-label data. For an inference method M which assumes that an observation X is a sample from each source contained in the label set L, PM(Sk\D, 9) is a point mass (Dirac mass) at X. In the above example of the sum of Gaussian emissions, PM(E \ D, ) has a continuous density function.

4.2 Conditions for identifiability

As in the standard scenario of learning from single-label data, parameter inference is only possible if the parameters are identifiable. Conversely, parameters are unidentifiable if 9(X) = 9(2), but P9(1) = P9(2). For our setting as specified in Eq. 9, this is the case if

]T login ¿JP M( Xn, Ln) p(- \9(1)) d^ = ]T log LlJp m (Xn, Ln) p(-\9(2}) d|)

n=1 n=1

but 9(1 = 9(2). The following situations imply such a scenario:

- A particular source k never occurs in the label set, formally \{L e L\k e L}\ = 0 or nl = 0 VL e L : L 3 k. This excess parameterization is the trivial case — one cannot infer the parameters of a source without observing emissions from that source. In such a case, the probability of the observed data (Eq. 9) is invariant of the parameters k of source k.

- The combination function ignores all (!) emissions of a particular source k. Thus, under the assumptions of the inference method M, the emission Sk of source k never has an influence on the observation. Hence, the combination function does not depend on Sk. If this independence holds for all L, information on the source parameters k cannot be obtained from the data.

- The data available for inference does not support distinguishing different parameters of a pair of sources. Assume e.g. that source 2 only occurs together with source 1, i.e. for all n with 2 e Ln, we also have 1 e Ln. Unless the combination function is such that information can be derived about the emissions of the two sources 1 and 2 for at least some of the data items, there is a set of parameters 1 and 2 for the two sources that yields the same likelihood.

If the distribution of a particular source is unidentifiable, the chosen representation is problematic for the data at hand and might e.g. contain redundancies, such as a source (class) which is never observed. More specifically, in the first two cases, there does not exist any empirical evidence for the existence of a source which is either never observed or has no

influence on the data. In the last case, one might doubt if the two classes 1 and 2 are really separate entities, or whether it might be more reasonable to merge them to a single class. Conversely, non-compliance to the three above conditions is a necessary (but not sufficient!) condition for parameter identifiability in the model.

4.3 Maximum likelihood estimation on multi-label data

Based on the probability of a data item D given the parameter vector 0 under the assumptions of the inference method M (Eq. 9) and using a uniform prior over the parameters, the log-likelihood of a parameter 0 given a data item D = (X, L) is given by lM(0; D) = log (P M (X, L |0)). Using the particular properties of exponential family distributions (Eq. 4), the score function is

*M(D) = VlM(0; D) = Es„pm [0(3)] - VA(0) (11)

= Es~ftf, [0(3)] - Es~Po [0(3)] • (12)

Comparing with the score function obtained in the single-label case (Eq. 7), the difference in the first termbecomes apparent. While the first term is the sufficient statistic of the observation in the previous case, we now find the expected value of the sufficient statistic of the emissions, conditioned on D = (X, L). This formulation contains the single-label setting as a special case: given the single-label observation X with label k, we are sure that the kth source has emitted X, i.e. Sk = X. In the more general case of multi-label data, several emission vectors 3 might have produced the observed X. The distribution of these emission vectors (D and 0) is given by Eq. 10. The expectation of the sufficient statistics of the emissions with respect to this distribution now plays the role of the sufficient statistic of the observation in the single-label case.

As in the single-label case, we assume that several emissions are independent given their sources (conditional independence). The likelihood and the criterion function for a data set D = (D1, •••, Dn) thus factorize:

^M(0) = -Z *M(Dn) (13)

In the following, we study Z-estimators 0M obtained by setting ^M(0M) = 0. We analyse the asymptotic behaviour of the criterion function ^M and derive conditions for consistent estimators as well as their convergence rates.

4.4 Asymptotic behaviour of the estimation equation

We analyse the criterion function in Eq. 13. The N observations used to estimate ^M(0M) originate from a mixture distribution specified by the label sets. Using the iJd• assumption and defining D£ := {(X', L) e D|L' = L}, we derive

*M(0) = N IZ *M( D) = N X ID . Z *M( D) (14)

£eL Dedl £eL 1 Dedl

Denote by Pl,d the empirical distribution of observations with label set L. Then,

N- I * M( D) = Ex - Pld [*M(( X, L))] with N c := |D £ |

L Ded/;

is an average of independent, identically distributed random variables. By the law of large numbers, this empirical average converges to the true average as the number of data items, Nl, goes to infinity:

-Pld [f M((X, ¿))] - Ex-p£eg [f M((X, ¿))] • (15)

Furthermore, define nl := Nl/N. Again by the law of large numbers, we get nl — n l. Inserting (15) into (14), we derive

^M(O) = z nlEX-PLD [fM((X, £))] - X *lEX-p£ea [fM((X, ¿))] (16)

LeL LeL

Inserting the value of the score function (Eq. 12) into Eq. 16 yields

^M(O) - ED-pea [Es-Pm №(*)]] - Es-p[0(3)1 (17)

This expression shows that the maximum likelihood estimator is a moment estimator also for inference based in multi-label data. However, the source emissions cannot be observed directly, and the expected value of its sufficient statistic substitutes for this missing information. The average is taken with respect to the distribution of the source emissions assumed by the inference method M.

4.5 Conditions for consistent estimators

Estimators are characterized by properties like consistency and efficiency. The following theorem specifies conditions under which the estimator 9 M is consistent.

Theorem 3 (Consistency of estimators.) Assume the inference method Muses the true conditional distribution of the source emissions E given data items, i.e. for all data items D = (X, L), PM(E \(X, L), 9) = PG(E\(X, L),9), and that PM(X\L,9) is concave. Then the estimator 9 determined as a zero of ^M (9) (Eq. 17) is consistent.

Proof The true parameter of the generative process, denoted by 9G, is a zero of ^G(9), the criterion function derived from the true generative model. According to Theorem 1,

sup9 e0 \\^M(9) — ^ G (9)\\ 0 is a necessary condition for consistency of 9 M .Inserting the criterion function ^£^(9) (Eq. 17) yields the condition

||ed~p9g [ee~Pmm 0(E)]] — ED~PtG [Es~P£90(E)]]| = 0. (18)

Splitting the generative process for the data items D ~ P9G into a separate generation of the label set L and an observation X, L ~ Png , X ~ Pc 9G, Eq. 18 is fulfilled if

y ncEx~pG H|E^pm [0(E)] — eE~pG [0(E)]H = 0. (19)

x p£9glii E P(X,c),9 E P(,Xl),9 iij

Using the assumption that P(Mq 9 = PGX L) 9 for all data items D = (X, L), this condition is trivially fulfilled. □

Differences between pM 9 and PGs 9 for some data items Ds = (Xs, Ls), on the other hand, have no effect on the consistency of the result if either the probability of Ds is zero, or if the expected value of the sufficient statistics is identical for the two different parameter vectors. The first situation implies that either the label set Ls never occurs in any data item,

or the observation Xs never occurs with label set Ls. The second situation implies that the parameters are unidentifiable. Hence, we formulate the stronger conjecture that if an inference procedure yields inconsistent estimators on data with a particular label set, its overall parameter estimators are inconsistent. This implies, in particular, that inconsistencies on two (or more) label sets cannot compensate each other to yield an estimator which is consistent on the entire data set.

As we show in Sect. 5, ignoring all multi-label data yields consistent estimators. However, discarding a possibly large part of the data is not efficient, which motivates the quest for more advanced inference techniques to retrieve information of the source parameters from multilabel data.

4.6 Efficiency of parameter estimation

Given that an estimator 0 is consistent, the next question of interest concerns the rate at which the deviation from the true parameter value converges to zero. This rate is given by the asymptotic variance of the estimator (Eq. 6). We will compute the asymptotic variance specifically for maximum likelihood estimators in order to compare different inference techniques which yield consistent estimators in terms of how efficiently they use the provided data set for inference.

Fisher information The Fisher information is introduced to measure the information content of a data item for the parameters of the source that is assumed to have generated the data. In multi-label classification, the definition of the Fisher information (Eq. 8) has to be extended, as the source emissions are only indirectly observed:

Definition 2 Fisher information of multi-label data The Fisher information I l measures the amount of information a data item D = (X, L) with label set L contain about the parameter vector :

Ii := VS-Po [0(3)] - Ex-p[Vs-pm [0(3)]] (20)

The term pm [0(3)] measures the uncertainty about the emission vector 3, given adata item D. This term vanishes if and only if the data item D completely determines the source emission(s) of all involved sources. In the other extreme case where the data item D does not reveal any information about the source emissions, this is equal to Vs~p0 [0(3)], and the Fisher information vanishes.

Asymptotic variance We now determine the asymptotic variance of an estimator.

Theorem 4 (Asymptotic variance.) Denote by P^fy (3) the distribution of the emission vector 3 given the data item D and the parameters , under the assumptions made by the inference method M. Furthermore, let I l denote the Fisher information of data with label set L. Then, the asymptotic variance of the maximum likelihood estimator 0 is given by

£ = (El[Il])-1 • (Vd[Es-pm [0(3)]]) • (EL[Il])-T , (21)

where all expectations and variances are computed with respect to the true distribution.

Proof We derive the asymptotic variance based on Theorem 2 on asymptotic normality of Z-estimators. The first and last factor in Eq. 6 are the derivative of the criterion function

D) (Eq. 11):

V, D) = V2^(0; D) =

/ V PM( D)V

V pM(D) )

where v® denotes the outer product of vector v. The particular properties of the exponential family distributions imply

V2 D) / x®

Pd) = (EH-pm [^(S)] - №(*)]) + pm [0(S)] - Vs~P [0(S)]

with VP,M(D)/P^iD) = f M (D) and using Eq. 12, we get

V f M(D) = Vh-pM [0(3)] - VH-p, [0(H)].

3 PD,e

The expected Fisher information matrix over all label sets results from computing the expectation over the data items D:

Ed~p,g [V f(D)] = Ed-PiG [Vs-pM [0(H)]] - VH-p,[0(H)] = E£[Ic].

>Mm\ _ ,irM

For the middle term of Eq. 6, we have

Ed-pg [(fo(D))®] = Yd-p0,

EH-pm [0(H)]

+ Ed-p,

EH-pm [0(H)] D,0

- Eh-p.[0(

The condition for , given in Eq. 17 implies

Ed-pg [(f«(D))®] = Vd~p,g [Es-pm [0(t)]]

Using Eq. 6, we derive the expression for the asymptotic variance of the estimator 9 stated in the theorem. □

According to this result, the asymptotic variance of the estimator is determined by two factors. We analyse them in the following two subsections and afterwards derive some well-known results for special cases.

(A) Bias-variance decomposition We define the expectation-deviance for label set L as the difference between the expected value of the sufficient statistics under the distribution assumed by method M, given observations with label set L, and the expected value of the sufficient statistic given all data items:

AEM := Ex-PceG

EH-Pm .[0(H)] - Ed'-P.,

(X,L)0

Eh-pm [0(H)]

The middle factor (Eq. 22) of the estimator variance is the variance in the expectation values of the sufficient statistics of H.UsingEX [ X2] = E X [ X ]2 + VX [ X ] and splitting D = (X, L) into the observation X and the label set L, it can be decomposed as

eh_ p m[0(H)] d0

= E£ [(AEM) ®] + E£

Eh„pm [0(H)]

(X,L),Ô

. (24)

Two independent effects thus cause a high variance of the estimator:

(1) The expected value of the sufficient statistics of the source emissions based on observations with a particular label L deviates from the true parameter value. Note that this effect can be present even if the estimator is consistent: these deviations of sufficient statistics conditioned on a particular label set might cancel out each other when averaging over all label sets and thus yield a consistent estimator. However, an estimator obtained by such a procedure has a higher variance than an estimator which is obtained by a procedure which yields consistent estimators also conditioned on every label set.

(2) The expected value of the sufficient statistics of the source emissions given the observation X varies with X. This contribution is typically large for one-against-all methods (Rifkin and Klautau 2004).

Note that for inference methods which fulfil the conditions of Theorem 3, we have AEM = 0. Methods which yield consistent estimators on any label set are thus not only provably consistent, but also yield parameters with less variation.

(B) Special cases The above result reduces to well-known formula for some special cases of single label assignments.

Variance of estimators on single-label data If estimation is based on single-label data, i.e. D = (X, L) and L = {X}, the source emissions are fully determined by the available data, as the observations are considered to be direct emissions of the respective source.

1{s, =x} if k = X P (Sk \9k ) otherwise

PM9(E) = n Pd!(Ek), with PMEk) =

The estimation procedure is thus independent for every source k. Furthermore, we have E. pm M(Ek)] diagonal elements

Ekk = ( Vd G [0(X)] + I Ex~pG [0(X)] — ESk~pet [0(2

PM \é(3>k)] = X and V pm \é(3>k)] = 0. Hence, E is a diagonal matrix, with

ak-PD,Bk ak-pD,ek

k} (Vd -Pea [0(X)] + (ex~Pg [0(X)1 - Esk-Pgk [0(Zk)l)

Variance of consistent estimators Consistent estimators are characterized by the expression Ed -P,a [e3-m [0(3)l] = Es-P, [0(3)1 and thus

(ElPL])-1 • vd~ptg [E3^pm 0(3)]] • (ELIL])-1 .

Variance of consistent estimators on single-label data Combining the two aforementioned conditions, we derive

Eaa = Vs-P, [0(3)1-1 = TDx(di), (25)

which corresponds to the well-known result for single-label data (Eq. 8).

5 Asymptotic analysis of multi-label inference methods

In this section, we formally describe several techniques for inference based on multi-label data and apply the results obtained in Sect. 4 to study the asymptotic behaviour of estimators obtained with these methods.

5.1 Ignore training (Mignore)

The ignore training is probably the simplest, but also the most limited way of treating multilabel data: data items which belong to more than one class are simply ignored (Boutell et al. 2004), i.e. the estimation of source parameters is uniquely based on single-label data. The overall probability of an emission vector H given the data item D thus factorizes:

pg ore(H) = n ) (26)

Each of the factors P^Ok (^k), representing the probability distribution of source k, only

depends on the parameter Ok, i.e. we have Pg^ (^k) = PgO™ (&k) for all k = 1, ..., K. A data item D = (X, L) does exclusively provide information about source k if L = {k}. In the case L = data item D.

the case L = {k}, the probability distribution of emissions Ek, p'gn°re(Ek), is invariant to

„ignore,-, , _

pd,4 (^k) =

1{3t=x} if L = {k}

Plgnore(Ek ) otherwise

Observing a multi-label data items does not change the assumed probability distribution of any of the classes, as these data items are discarded by Mignore. From Eqs. 26 and 27, we obtain the following criterion function given a data item D:

flgnore (D) = }T fgnore (D), fl6gnore (D) =

0(X) - ESk~ Ph [$(Ek)] if L = {k} 0 otherwise

The estimator 9ignore is consistent and normally distributed:

Lemma 1 The estimator 9ignore determined as a zero of ty'Ngnore(9) as defined in Eqs. 13 and 28 is distributed according to *J~N • (9lgnooe — 9G) ^ N(0, Elgnore). The covari-ance matrix Eignore is given by Eignore = diag(Eignore, ..., Ei>gnKore), with Ekgknore =

V* ~p^eT^((X, {k}))]—1.

This statement follows directly from Theorem 2 about the asymptotic distribution of estimators based on single-label data. A formal proof is given in Sect. 1 in the appendix.

5.2 New source training (Mnew)

New source training defines new meta-classes for each label set such that every data item belongs to a single class (in terms of these meta-labels) (Boutell et al. 2004). Doing so, the number of parameters to be inferred is heavily increased as compared to the generative process. We define the number of possible label sets as L := |L| and assume an arbitrary, but fixed, ordering of the possible label sets. Let L[l] be the lth label set in this ordering. Then, we have: PDeff(a) = \\L=1 PDeW i(Ei). As for Mignore, each of the factors represents the probability distribution of one of the sources given the data item D. Hence

„new \ _ „new/*-? \ _

PD, «,i ) = PD,Bi ) =

^=x } if l = Lm

P1ewl(Ei ) otherwise ( )

For the criterion function on a data item D = (X, L), we thus have L

m D) = X ( D), ftnr ( D) =

f(X) - Es,-P6i [f(Ei)] if L = L[l] 0 otherwise

The estimator 9"Nfw is consistent and normally distributed:

Lemma 2 The estimator 9Nw obtained as a zero of the criterion function ^new (9) is asymptotically distributed as VN ■ (9nw - 9g) ^ N(0, £new). The covariance matrix is block-diagonal: £new = diag(E"fw,..., ~ELll)> with the diagonal elements given by

Kew = vx-PHl^eG (X )]-1.

Again, this corresponds to the result obtained for consistent single-label inference techniques in Eq. 25. The main drawback of this method is that there are typically not enough training data available to reliably estimate a parameter set for each label set. Furthermore, it is not possible to assign a new data item to a label set which is not seen in the training data.

5.3 Cross-training (Mcross)

Cross-training (Boutell et al. 2004), takes each sample X which belongs to class k as an emission of class k, independent of other labels the data item has. The probability of H thus factorizes into a product over the probabilities of the different source emissions:

PCDrr(H) = n PSZ(Zk) (30)

As all sources are assumed to be independent of each other, we have for all k

JDCTOSS /r-i \ _ JDCTOSS / r-i \ nCrOSS/r-j \ _

PD, 0 ,k(sk) = PD,0k (sk), PD,6k (sk) =

1{sk=x} if k e L Pek (Sk ) otherwise

Again, PD)°eks = Pek (^k ) in the case k </ L means that X does not provide any information about the assumed Pek, i.e. the estimated distribution is unchanged. For the criterion function, we have

ftCross (D) = Z ftCr ( D), ftCr ( D) =

$(X) - Esk-P6k №(Sk)] if k e L 0 otherwise

The parameters obtained by Mcross are not consistent:

Lemma 3 The estimator 9cross obtained as a zero of the criterion function ^NToss (9) are inconsistent if the training data set contains at least one multi-label data item.

The inconsistency is due to the fact that multi-label data items are used to estimate the parameters of all sources the data item belongs to without considering the influence of the other sources. The bias of the estimator grows if the fraction of multi-label data used for the estimation increases. A formal proof is given in the appendix (Sect. 1).

5.4 Deconvolutive training (Mdeconv)

The deconvolutive training method estimates the distribution of the source emissions given a data item. Modelling the generative process, the distribution of an observation X given the emission vector E and the label set L is

Pdeconv i v it r\ _ 1

IX |e, L) — !{X =cdeconv(E, L)}

Integrating out the source emissions, we obtain the probability of an observation X as Pdeconv(X|L, 0) — / P(X| E, L) dP(E |0). Using Bayes' theorem and the above notation, we have:

deconv deconv

pdeconV(E \D 0) — P-(X|E, L) • P-((E°1 (33)

v ' ' 7 Pdeconv (X\L 0)

If the true combination function is provided to the method, or the method can correctly estimate this function, then Pdeconv(E |D, 0) corresponds to the true conditional distribution. The target function is defined by

fdeconv(D) — Ee_ Pdeconv [0(E)] - Ee~ Po [0(E)] (34)

Unlike in the methods presented before, the combination function c(-, •) in Mdeconv

influences the assumed distribution of emissions E, Pde<c°nv(E). For this reason, it is not

possible to describe the distribution of the estimators obtained by this method in general. However, given the identifiability conditions discussed in Sect. 3.2, the parameter estimators converge to their true values.

6 Addition of Gaussian-distributed emissions

Multi-label Gaussian sources allow us to study the influence of addition as a link function. We consider the case of two univariate Gaussian distributions with sample space R. The probability density function is p(~) — —^exp(— ). Mean and standard deviation of

ay 2n 2a

the kth source are denoted by / and ak, respectively, for k — 1, 2. 6.1 Theoretical investigation

Rearranging terms in order to write the Gaussian distribution as a member of the exponential family (Eq. 3), we derive

Ok — (/2• -' T — (• •'2>r ^) — -4£ -

The natural parameters O are not the most common parameterization of the Gaussian distribution. However, the usual parameters (//k, a2) can be easily computed from the parameters

1 a ,_.„2 1 a Vk ,_, ,, 2

= Ok, 2 ^^ Ok 0k,1 = —

2a 2 2Ok,2 a 2

= Ok,2 ^ a2 = 0k, 1 = ^ Vk = a2 ■ OKi. (35)

The parameter space is 0 — {(Oi, O2) e R|O2 < 0}. In the following, we assume /1 — —a

- )T and

Springer

and /2 — a. The parameters ofthe first and second source are thus Oi — (— ^2 • — 21?)T and

02 = (^, — 212 )T As combination function, we choose the addition: k (Ei, E2) = Ei + E2. We allow both single labels and the label set {1, 2}, i.e. L = {{1}, {2}, {1, 2}}. The expected values of the observation X conditioned on the label set are

EX ~P1[ X ] = — a EX ~ P2[ X ] = a EX ~ P12[ X ] = 0. (36)

Since the convolution of two Gaussian distributions is again a Gaussian distribution, data with the multi-label set {1, 2} is also distributed according to a Gaussian. We denote the parameters of this proxy-distribution by 012 = (0, — 2( 2+ 2) )T.

Lemma 4 Assume a generative setting as described above. Denote the total number of data items by N and the fraction of data items with label set L by nl. Furthermore, we define W12 := ^ffj2 + n10^, s12 := a2 + o\, and m 1 := tea^a^ + 2^10!s12), m2 := (n10'2!a'122 + 2n2ff2s12). The MSE in the estimator of the mean, averaged over all sources, for the inference methods Mignore, Mnew, M.cross and Mdeconv, is as follows:

MSE(fignore, f) = if — + —) (37)

U ffii

2 V ni N n2 N )

MSEOi™ p) = 1 (+ + ^^^) 3\ ni N n N ni2 N )

MSE(i = 2M. nr^+(¡Tin;?) •

1 n ( ni I n2 \

2 (n i + ni;)3 N (n; + ni;)3 n)'

'niof + n12, n2°2 + ni2^i22 \ (39)

2 I (ni + ni2)2N (*2 + ni2)2N ' ( )

i /niof + ni2of2 H2O2 + ni2CTi22 \ 2\ (ni + n12)2N + (m + n12)2N )

MSE(ide™"*, p) = !( ni22"22wi2 + ni2^i + ninl*22 2 ° (nin2ii2 + ni2Wi2) N

+ n12°l W12 + n12n1m2 + nfn2s22 a2 . (40) (n1^2s12 + ^12 W12)2 N 2

The proof mainly consists of lengthy calculations and is given in Sect. 1. We rely on the computer-algebra system Maple for parts of the calculations.

6.2 Experimental results

To verify the theoretical result, we apply the presented inference techniques to synthetic data, generated with a = 3.5 and unit variance: 01 = 02 = 1. The Bayes error, i.e. the error of the optimal generative classifier, in this setting is 9.59%. We use training data sets of different size and test sets of the same size as the maximal size of the training data sets. All experiments are repeated with 100 randomly sampled training and test data sets.

In Fig. 2, the average deviation of the estimated source centroids from the true centroids are plotted for different inference techniques and a varying number of training data, and compared with the values predicted from the asymptotic analysis. The theoretical predictions agree with the deviations measured in the experiments. Small differences are obtained for small training set sizes, as in this setting, both the law of large numbers and the central limit

15 22 33 49 73 109 163 244 366 549 823 1234 1851 2776 4164

Training Set Size N

(a) Estimator Accuracy for Mignore

15 22 33 49 73 109 163 244 366 549 823 12341851 2776 4164

Training Set Size

(b) Estimator Accuracy for Mn

2.6 2.4 2.2 Ü 2

15 22 33 49 73 109 163 244 366 549 823 1234 1851 2776 4164

Training Set Size N

(c) Estimator Accuracy for Mcr0ss

15 22 33 49 73 109 163 244 366 549 823 1234 1851 2776 4164

Training Set Size N (d) Estimator Accuracy for Mde

Fig. 2 Deviation of parameter values from true values: the box plot indicate the values obtained in an experiment with 100 runs, the red line gives the RMS predicted by the asymptotic analysis. Note the difference in scale in Fig. 2c

theorem, on which we rely in our analysis, are not fully applicable. As the number of data items increases, these deviations vanish.

Mcross has a clear bias, i.e. a deviation from the true parameter values which does not vanish as the number of data items grows to infinity. All other inference technique are consistent, but differ in the convergence rate: Mdeconv attains the fastest convergence, followed by Mignore. Mnew has the slowest convergence of the analysed consistent inference techniques, as this method infers parameters of a separate class for the multi-label data. Due to the generative process, these data items have a higher variance, which entails a high variance of the respective estimator. Therefore, Mnew has a higher average estimation error than Mignore.

The quality of the classification results obtained by different methods is reported in Fig. 3. The low precision value of Mdeconv shows that this classification rule is more likely to assign a wrong label to a data item than the competing inference methods. Paying this price, on the other hand, Mdeconv yields the highest recall values of all classification techniques analysed in this paper. On the other extreme, Mcross and Mignore have a precision of 100%, but a very low recall of about 75 %. Note that Mignore only handles single-label data and is thus limited to attributing single labels. In the setting of these experiments, the single label data items are very clearly separated. Confusions are thus very unlikely, which explains the very precise labels as well as the low recall rate. In terms of the F-score, defined as the harmonic mean of the precision and the recall, Mdeconv yields the best results for all training set sizes, closely followed by Mnew. Mignore and M cross perform inferior to MMdeconv and .Mnew.

99 98 97 96 95 94 93 10

.........

. .......

.....Mi-

hH-H....

: : i H : H 4 r

Mignore

cross Mdeconv

100 1000 Training Set Size N

(a) Average Precision

i 0. 1?

0.95 0.9 85 8

........

100 1000 Training Set Size N

(b) Average Recall

s 0.925

100 1000 Training Set Size N

(c) Average F-Score

100 1000 1 Training Set Size N

(d) Balanced Error Rate

Fig. 3 Classification quality of different inference methods. 100 training and test data sets are generated from two sources with mean ±3.5 and standard deviation 1

Also for the BER, the deconvolutive model yields the best results, with Mnew reaching similar results. Both Mcross and Mignore incur significantly increased errors. In Mcross, this effect is caused by the biased estimators, while Mignore discards all training data with label set {1, 2} and can thus "not do anything with such data".

6.3 Influence of model mismatch

Deconvolutive training requires a more elaborate model design than the other methods presented here, as the combination function has to be specified as well, which poses an additional source of potential errors compared to e.g. AAnew.

To investigate the sensitivity of the classification results to model mismatch, we generate again Gaussian-distributed data from two sources with mean ±3.5 and unit variance, as in the previous section. However, the true combination function is now set to c((S1, H2)T, {1, 2}) — S1 + 1.5 • S2, but the model assumes a combination function as in the previous section, i.e. c((S1, H2)T, {1, 2}) — S1 + S2. The probabilities of the individual label sets are n{1} — n{2} — 0.4 and n{1,2} — 0.2. The classification result for this setting are displayed in Fig. 4. For the quality measures precision and recall, Mnew and Mdeconv are quite similar in this example. For the more comprehensive quality measures F-score and BER, we observe that Mdeconv is advantageous for small training data sets. Hence, the deconvolutive approach is beneficial for small training data sets even when the combination function is not correctly modelled. With more training data, Mnew catches up

^ 0.95 Vi

<5-J 0.925

Ei 0.9 0.875 0.85

■..... ,-■:

_Mignore

Training Set Size N

(a) Average Precision

0.95 kfrfr 4-4441

0.9 r-i-iii 44444 _--4-,-4444i4

vT 0.85 r-r-:-:-; -------i- VT :1T!T ----"i—'--i-y-i-i-i i —H —1—! h h i h h"

¥ 0.8 rtili

0.75 1-4-HI — t "frrrfrr

100 1000 Training Set Size N

(b) Average Recall

^ 0.85

r -r -: -: -------—:—:—:—

i i/iii

10 100 1000 Training Set Size N

(c) Average F-Score

10 100

Training Set Size N

(d) Balanced Error Rate

Fig. 4 Classification quality of different inference methods, with a deviation between the true and the assumed combination function for the label set {1, 2}. Data is generated from two sources with mean ±3.5 and standard deviation 1. The experiment is run with 100 pairs of training and test data

■ ■ ■ _M _M _M _M .

; ; ; ignore new cross deconv

and then outperforms Mdeconv . The explanation for this behavior lies in the bias-variance decomposition of the estimation error for the model parameters (Eq. 2): Mnew uses more source distributions (and hence more parameters) to estimate the data distribution, but does not rely on assumptions on the combination function. Mdeconv, on the contrary, is more thrifty with parameters, but relies on assumptions on the combination function. In a setting with little training data, the variance dominates the accuracy of the parameter estimators, and Mdeconv will therefore yield more precise parameter estimators and superior classification results. As the number of training data increases, the variance of the estimators decreases, and the (potential) bias dominates the parameter estimation error. With a misspecified model, Mdeconv yields poorer results than Mnew in this setting.

7 Disjunction of Bernoulli-distributed emissions

We consider the Bernoulli distribution as an example of a discrete distribution in the exponential family with emissions in B := {0, 1}. The Bernoulli distribution has one parameter f, which describes the probability for a 1.

7.1 Theoretical investigation

The Bernoulli distribution is a member of the exponential family with the following parameterization: 9k = log (j—^), ) = 2k, and A(6k) = - log - ifexpk)• As combina-

tion function, we consider the Boolean OR, which yields a 1 if either of the two inputs is 1, and 0 otherwise. Thus, we have

P(X = 1|L = {1, 2}) = ft + ft - ft ft =: ft2 (41)

Note that £12 > max{ft, ft}: When combining the emissions of two Bernoulli distributions with a Boolean OR, the probability of a one is at least as large as the probability that one of the sources emitted a one. Equality implies either that the partner source never emits a one, i.e. P12 = ft if and only if ft = 0, or that one of the sources always emits a one, i.e. P12 = ft if ft = 1. The conditional probability distributions are as follows:

P(E|(X, {1}),») = 1{S0)= x}• Ber(S(2)ie(2)) (42)

P(E |(X, {2}), 0) = Ber(S(1)l8(1)) • 1{S(2)= x} (43)

P(E|(0, {1, 2}), 0) = 1{SO)=0} • 1{S(2) =0} (44)

P(E, X = 1|L = {1, 2}, 0)

P(E|(1, {1, 2}),0) = ^-!--(45)

P(X = 1|L = {1, 2}, 0)

In particular, the joint distribution of the emission vector E and the observation X is as follows:

P(E = (£1, £2/, X = & v £2)|L = {1, 2}, 0) = (1 - ft )1-£1 (1 - ft)1-2(ft)£1 (ft)£2 All other combinations of E and X have probability 0.

Lemma 5 Consider the generative setting described above, with N data items in total. The fraction of data items with label set L by n l ■ Furthermore, define V1 := ft(1 — £1), V2 := ft(1 - P2), V12 := P12(1 - P12), W1 := £1(1 - ft), W2 := ft(1 - P1) and

V1 = -—-2W2 (1 - n12W2) V2 = -—-2W1 (1 - n12W1). (46)

(n1 + n 12)2 (n2 + n 12)2

The MSE in the estimator of the parameter averaged over all sources, for the inference methods Mignore, MneW, M cross and MdeconV is as follows:

MSE(0neW, 0) = 1 ( p1(1 - p1) + p2(1 - p2) + p12(1 - P12) ^ (47)

r H' 3\ n1 N n2 N n12 N J v ;

MSE(ßß) = Ußi(1 - ßi) + ß2(1 - (48)

2 \ n1N n2 N )

MSE(ß—, ß) = 1 (+ 2 (

2\ n 1 + ni2 / 2\ n2 + ni2 )

1 VL( n22 (ßl - ßl2)2 nivi + ni2 V12 \ '1N ^2 V (ni + ni2)3 (m + ni2)2 )

11 v2 / ni22 (ß2 - ß12)2 , n2 V2 + n 12 V12 \

vj( _ + _

2 nN V2 y (n2 + ni2)3 + (n2 + ni2)2 )

MSE0deconV, 0) = U--n2P12 + n12W2-V1

2 n 1N n12(n1W2 + n2W1) + ^1^2^12

1 1 n1P12 + n12W1

+----v2 (50)

2 n2N n12(n1W2 + n2W1) + ^1^2P12

The proof of this lemma involves lengthy calculations that we partially perform in Maple. Details are given in Section A.3 of (Streich 2010).

Training Set Size N

(a) Estimator Accuracy for Mignore

5 22 33 49 73 109 163 244 366 549 823 1234 1851 2776 4164 6246 9369

Training Set Size N

(b) Estimator Accuracy for Mnew

Ï 0.2

15 22 33 49 73 109 163 244 366 549 823 1234 1851 2776 4164 6246 9369

Training Set Size N

(c) Estimator Accuracy for Mcross

5 22 33 49 73 109 163 244 366 549 823 1234 1851 2776 4164 6246 9369

Training Set Size N

(d) Estimator Accuracy for Mde

Fig. 5 Deviation of parameter values from true values: the box plots indicate the values obtained in an experiment with 100 runs, the red line gives the RMS predicted by the asymptotic analysis

7.2 Experimental results

To evaluate the estimators obtained by the different inference methods, we use a setting with 01 = 0.40 • 1.10x1 and 02 = 0.20 • 110x1, where 110x1 denotes a 10-dimensional vector of ones. Each dimension is treated independently, and all results reported here are averages and standard deviations over 100 independent training and test samples.

The RMS of the estimators obtained by different inference techniques are depicted in Fig. 5. We observe that asymptotic values predicted by theory are in good agreement with the deviations measured in the experiments, thus confirming the theory results. Mcross yields clearly biased estimators, while Mdeconv yields the most accurate parameters.

Recall that the parameter describing the proxy distribution of data items from the label set {1, 2} is defined as £12 = £1 + £2 - P1P2 (Eq. 41) and thus larger than any of £1 or £2. While the expectation of the Bernoulli distribution is thus increasing, the variance £12(1 - £12) of the proxy distribution is smaller than the variance of the base distributions. To study the influence of this effect onto the estimator precision, we compare the RMS of the source estimators obtained by Mdeconv and MneW, illustrated in Fig. 6: the method Mdeconv is most advantageous if at least one of £1 or £2 is small. In this case, the variance of the proxy distribution is approximately the sum of the variances of the base distributions. As the parameters £ of the base distributions increase, the advantage of Mdeconv in comparison to M neW decreases. If £1 or £2 is high, the variance of the proxy distribution is smaller than

0.1 0.3 0.5 0.7 0.9

(a) RMS(ßnew, ß)

0.1 0.3 0.5 0.7 0.9

(b) RMS(ßdeconv, ß)

aß 0.5

0.1 0.3 0.5 0.7

(c) RMS(ßdeconv, ß) - RMS(ßnew, ß)

aß 0.5

0.3 0.5 0.7 0.9

RMS(ßdeconv ,ß)-RMS(ßnew,ß) RMS(ß new,ß)

Fig. 6 Comparison of the estimation accuracy for ^ for the two methods Mnew and Mdeconv f°r different values of f 1 and f2

the variance of any of the base distributions, and Mnew yields on average more accurate estimators than Mdeconv •

8 Conclusion

In this paper, we develop a general framework to describe inference techniques for multilabel data. Based on this generative model, we derive an inference method which respects the assumed semantics of multi-label data. The generality of the framework also enables us to formally characterize previously presented inference algorithms for multi-label data.

To theoretically assess different inference methods, we derive the asymptotic distribution of estimators obtained on multi-label data and thus confirm experimental results on synthetic data. Additionally, we prove that cross training yields inconsistent parameter estimators.

As we show in several experiments, the differences in estimator accuracy directly translate into significantly different classification performances for the considered classification techniques.

In our experiments, we have observed that the values of the quality differences between the considered classification methods largely depends on the quality criterion used to assess a classification result. A theoretical analysis of the performance of classification techniques with respect to different quality criteria will be an interesting continuation of this work.

Acknowledgments We appreciate valuable discussions with Cheng Soon Ong. This work was in part funded by CTI grant Nr. 8539.2;2 EPSS-ES.

Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Appendix 1: Asymptotic distribution of estimators

This section contains the proofs of the lemmas describing the asymptotic distribution of estimators obtained by the inference methods Mignore, Mnew and Mcross in Sect. 5.

Proof Lemma 1 Mignore reduces the estimation problem to the standard single-label classification problem for K independent sources. The results of single-label asymptotic analysis are directly applicable, the estimators 0ignore are consistent and converge to 0G.

As only single-label data is used in the estimation process, the estimators for different sources are independent and the asymptotic covariance matrix is block-diagonal, as stated in Lemma 1. The diagonal elements are given by Eq. 25, which yields the given expression. □

Proof Lemma 2 Mnew reduces the estimation problem to the standard single-label classification problem for L := |L| independent sources. The results of standard asymptotic analysis (Sect. 3.4) are therefore directly applicable: The parameter estimators 0new for all single-label sources (including the proxy-distributions) are consistent with the true parameter values 0G and asymptotically normally distributed, as stated in the lemma.

The covariance matrix of the estimators is block-diagonal as the parameters are estimated independently for each source. Using Eq. 25, we obtain the values for the diagonal elements as given in the lemma. □

Proof Lemma 3 The parameters ok of source k are estimated independently for each source. Combining Eqs. 17 and 32, the condition for ok is

(ok) := X koss(D) = 0.

(D) = 0 in the case k e L thus implies that D has no an influence on the parameter estimation. For simpler notation, we define the set of all label sets which contain k as Lk, formally Lk := {L e L|k e L}. The asymptotic criterion function for ok is then given by

¥cross(ok) = ed~p0g [®sk-pcdj; w(2k)]] - Esk-Pok №2k)]

= X ncEx-P£fig 0(X)] + X ncEs-Pok 10(X)] - Esk-POk wsk)]

LeLk L/Lk

Setting ^cross (0k) = 0 yields

EX-FScroSS №(X)] = —-^r1- X ncEX~P. .g X)] . (51)

k 1 2.LfLk LeLk

The mismatch of Qkross thus grows as the fraction of multi-label data grows. Furthermore, the mismatch depends on the dissimilarity of the sufficient statistics of the partner labels from the sufficient statistics of source k. □

Appendix 2: Lemma 4

Proof Lemma 4 This proof consists mainly of computing summary statistics. Ignore training (Mignore)

Mean value of the mean estimator As derived in the general description of the method in Sect. 5.1, the ignore training yields consistent estimators for the single-label source distributions: Q1t1 ^ —and 024 ^ -a2.

a1 ' a2

Variance of the mean estimator Recall that we assume to have nc N observations with label set L, and the variance of the source emissions is assumed to be Vh~pk [0(H)] = ak. The variance of the estimator for the single-label source means based on a training set of size N is thus V [Ak] = a2/(nkN).

Mean-squared error of the estimator With the above, the MSE, averaged over the two sources, is given by

MSE(0ignore, 0) = 1 f — + —. v A ' 2 V n1 N n2 N)

Since the estimators obtained by Mignore are consistent, the MSE only depends on the variance of the estimator.

New source training (Mnew)

Mean value of the estimator The training is based on single-label data items and therefore yields consistent estimators (Theorem. 2). Note that this method uses three sources to model

the generative process in the given example: 01,1 ^ —a, 02,1 ^ a, 012,1 ^ 0.

Variance of the mean estimator The variance is given in Lemma 2 and takes the following values in our setting:

2 2 2 2 2 „r- 1 °22 1 ff122 CT12 + CT22

V [Al] = V [A2} = V [A 12] =

n1 N L J n2 N L J n12 N n12 N

Since the observations with label set L = {1, 2} have a higher variance than single-label observations, the estimator A12 also has a higher variance than the estimators for single sources.

Mean-squared error of the estimator Given the above, the MSE is given by

MSE0 0) = i (aL + aL + ).

^ 3\ n1 N n2 N n12 N )

Cross-training (Mcross)

As described in Eq. 30, the probability distributions of the source emissions given the observations are assumed to be mutually independent by Mcross. The criterion function ftC[0SS (D) is given in Eq. 32. The parameter 9k is chosen according to Eq. 51:

Ex ~p9croSS [X] = --—- > ncEx~p~ ,G [X]

9k 1 -Z c^Lknc £-Lk ^

Mean value of the mean estimator With the conditional expectations of the observations given the labels (see Eq. 36), we have for the mean estimate of source 1:

A1 = Ex ~p9cross [X] = 11 ^ ^Ex ~p{1} gG [X] + n12Ex~p{12} lG [X n1 ■ a a

n + ^12 1 + n?

n2 ■ a a

similarly A2 = n = . , n12 n2 + n12 1 + —

The deviation from the true value increases with the ratio of multi-label data items compared to the number of single-label data items from the corresponding source.

Mean value of the standard deviation estimator According to the principle of maximum likelihood, the estimator for the s which contain k their label sets:

likelihood, the estimator for the source variance of is the empirical variance of all data items

|Di U D12I m||n )

x e(DiUDi2)

X (x -1 if

X (x - 1 if + X (x - 11

N (n1 + n12) ,

v 1 12/ \xeD1 xeD12

¡-a2 +--:-— (52)

(nx + n12 )2 nx + n12

, . , .2 n2^12a2 n2oG,2 + n12oG,12

and similarly o2 = -r- +--:-:—. (53)

(n2 + n 12)2 n2 + n12

The variance of the source emissions under the assumptions of method Mcross is given by VS~pe [0(3)] = diag (o2,o2).

Variance of the mean estimator We use the decomposition derived in Sect. 4.6 to determine the variance. Using the expected values of the sufficient statistics conditioned on the label sets and the variances thereof, as given in Table 2, we have

Ec [Vx ~p g [E3~ pcross [0(3)]11 = M + n202 n12 0122 2 ) . cL X pcfigy 3 p(xd» -ij v ^O^ n2022 + ^O^/

Furthermore, the expected value of the sufficient statistics over all data items is

ed - p g [eh-p—[0(H)]! = (-nia + n21 A D pGL H PD,e Ty 7 J \ nn2 + n2a )

Table 2 Quantities used to determine the asymptotic behavior of parameter estimators obtained by Mcross for a Gaussian distribution

Quantity

l = {1}

l = {2}

L = {1, 2}

(X,L), 6

.[0(h)]

EX - p

L,6 G PL,6 G

Ev^pcross [0(H)] (XX),6

lHr»jpcross [0(h)] (Xx), 6

—a \

ß2 J 0 oy

ß1 a '0 0

0 0 \ ,0 -12J

.2 * ' 12 u 12

Ec-pn [(ex-p£0g [eh -fcross,, [0(h)] — Ef0g [es-f™ [0(h)]] ])®

(^1^12 a2__^1^12^2 a2 \

__^1^12^2 a2 ^2^12 a2 I

(ni+ni2)(n2 + ni2) ^2 + ^12 /

The variance of the sufficient statistics of the emissions of single sources and the Fisher information matrices for each label set are thus given by

vh-p(X°j7}),0[0(h)] = (002) Iw = —(?0)

[0(h)] = ( "J0) ^ = —( 0 02)

vh-P(xt;,2}),s [0(h)] = ( 00) ^{1.2} — (? 0)

The expected value of the Fisher information matrices over all label sets is Ec-pl [Ic] = —diag ((nx + nx2)(i12, (n + ^2a2)

where the values of and "2 are given in Eqs. 52 and 53. Putting everything together, the covariance matrix of the estimator 0cross is given by

across , -17,11 v0,12

6 ' 12

/ ve,11 12 \ V V0,12 ve,22 )

with diagonal elements

n1 + n12 n2 + n12

11 = _ _ o . _ _2 . _ -2 v0,22 =

n1n12a2 + n oY + пx2ax22 n2^2a2 + n20{ + ^2 "12

To get the variance of the mean estimator, recall Eq. 35. The covariance matrix for the mean estimator is

(Vn V12),with ./^12 a2 + ^ + nW2 \

\vß, 12 vß,22 / ' n 1 + n12 y(n1 + n12)2 n 1 + n12 f

^cross = , with Vß 11 =---a2 +

ß » V",12 Vß,22 J n + ^12 y(n1 + n12)2

1 / ^2^12 2 + ^12 of2\

vß 22 = --I -2a2 +--2-12 I.

n2 + n12 \ (n2 + n12)2 n2 + n12 J

The first term in the brackets gives the variance of the means of the two true sources involved in generating the samples used to estimate the mean of the particular source. The second term is the average variance of the sources.

Mean-squared error of the mean estimator Finally, the Mean Squared Error is given by:

MSE(jLcross, f) = -nf2l--1-+ --1-^ I a

2 12 V (ni + n 12)2 (n2 + nu)2,

1/1 ni 1 n2 \ 2

+ - ni2--t +---t I a2

2 \(ni + n 12)N (ni + n 12)2 (n2 + nu)N (n2 + n 12)2/

1/ 1 niffj2 + ni2 of2 1 n^af + ni2ff!22\

2 ^ (n1 + nn)N %1 + n12 (n2 + n12)N n2 + n12 y

This expression describes the three effects contributing to the estimation error of Mcross:

- The first line indicates the inconsistency of the estimator. This term grows with the mean of the true sources (a and -a, respectively) and with the ratio of multi-label data items. Note that this term is independent of the number of data items.

- The second line measures the variance of the observation x given the label set L, averaged over all label sets and all sources. This term thus describes the excess variance of the estimator due to the inconsistency in the estimation procedure.

- The third line is the weighted average of the variance of the individual sources, as it is also found for consistent estimators.

The second and third line describe the variance of the observations according to the law of total variance:

V* [ X ] = V£ [Ex [ X |L]] + El [Vx [ X |L]]

second line third line

Note that (n1 + n12)N and (n2 + ^12)N is the number of data items used to infer the parameters of source 1 and 2, respectively.

Deconvolutive training (Mdeconv)

Mean value of the mean estimator The conditional expectations of the sufficient statistics of the single-label data are:

pdeconv m[01(S)] = ( X ) pdeconv m[01(S)] = ( ^ ) (54)

Observations X with label set L = {1, 2} are interpreted as the sum of the emissions from the two sources. Therefore, there is no unique expression for the conditional expectation of the source emissions given the data item D = (X, L):

_ (W)[*1(B)] =(X= (X-ß2)

{1,2}),e<~> \X - ß 1) ^ ß2 )

pdeconv (-)[^1(-)] = I v

We use a parameter X e [0, 1] to parameterize the blending between these two extremes:

pdeconv

3 P(X,{1,2}),e<~>

^1<S>] = >{x + (1 - »( X -f) <55>

Springer

Furthermore, we have Es~p0[01(H)] = (A 1, ¡2) • The criterion function tydeconv(D) for the parameter vector 0 then implies the condition

n (V n (¡1 ^ n (i + (1 - - ¡i2)\ ! (¡1 \

n11 A J + ^y X 2)+7t12{ X( *12 - A1 ) + (1 - №2) = i A 2),

where we have defined X1, X2 and X12 as the average of the observations with label set {1}, {2} and {1, 2}, respectively. Solving for f, we get

A1 = 1 ((1 + x)X1 + (1 - x)x 12 - (1 - x)x2) A2 = 1 (-xX1 + xx 12 + (2 - x)x2).

Since E[X1] = -a, E[X12] = 0 and E[X2] = a, the mean estimators are consistent independent of the chosen X: E[f1] = -a and E[f2] = a. In particular, we have, for all L:

ex~pc,e g

EH_pdeconv [0(H)] (X,£)ß

= ED' ~ PeG

EH_pdeconv [0(H)] D',6

Mean of the variance estimator. We compute the second component $2 — ) of the sufficient statistics vector 0(a) for the emissions given a data item. For single-label data items, we have

pdeconv m[02(S)] =( X, E^ pdeconv (-)[02(-)] =( ^ ^ ^ ^

(X,{1}),8W 2 + (2 ) (Xmr \ X )

For multi-label data items, the situation is again more involved. As when determining the estimator for the mean, we find again two extreme cases:

E-~ Pdeconv №-)] = ( ^ fe ^ = ( X 2^2 ^ ( 2) (x,{1,2}),e \ 112 + °2 / VX - 11 - CT1 /

We use again a parameter X e [0, 1] to parameterize the blending between the two extreme cases and write

pdeconv H (X, {1,2}),

, [*■ H)] = K X ^+- *0 + (1 - w( X 2'i2 5 * ?)

= ApV'll^ ' ii2 +i2

V a 2 + ^

Since the estimators for the mean are consistent, we do not distinguish between the true and the estimated mean values any more. Using Ex~Pjil0G [X2] = ¡f + a2 for I = 1, 2, and

Ex~p(1 2 10g [X2] = ¡2 + ¡2 + a2 + a2 , the criterion function implies, in the consistent case, the following condition for the standard deviation parameters

(l2 + a2\ (n2 + a2 \ _(X{lA + "12 + "22 - + (1 - X){lA + \l2 + a2V + n2\I2 + a22 ) + + "I) + (1 - X)(|2 + "12 + a22 - a2))

I (l2i + aA

V2 + a2/

Solvingfor <T1 and a2,wefind(1 = "1 and (¡2 = "2. The estimators for the standard deviation are thus consistent as well.

Variance of the mean estimator. Based on Eqs. 54 and 55, the variance of the conditional expectation values over observations X with label set L, for the three possible label sets, is

given by

% - PmfiG [eS- ^¿0-)]] = diag (a2, 0) VX-p|2),„g [E3-= diag (0 ' a22) -pf1,2},og [E3-p^ß[0(3)]\ = (;

V lE [0(-YII I (1 - X)2 X(1 - 2

%x -p.. ., „n\E--vdeconv |0(a)M = 1 ^(1 - ^2 I a12

and thus

EL-Pn[VX-P^h-^0(3)]]] = (T2 ^a2) + ^ ( ^ X)) a?2

The variance of the assumed source emissions are given by

V3_pdeconv [0(3)] = diag (0, a22)

3 p(X,|1},« v '

V3_pdeconv [0(3)] = diag (ffj2, 0)

3 p(X,{2},« y '

V [0(3)]- V \(kE1 + (1 - X)(X - H2)\l

V--pd^k„[0(-)] = V--pd^k„ [\X( X - Sx) + (1 - X)E2j\

= x2 ( < -a ) + (1 - X)2 ( $ )

v 1 1 / . -a a0

With Vh-f0 [0(H)] = diag [a^, a2), the Fisher information matrices for the single-label data are given by Ij1} = —diag (a^2, 0) and Ij2} = —diag (0, ff|).ForthelabelsetL = {1, 2}, we have

_ ((X2 - 1)of + (1 - X)2a2 -X2a2 - (1 - X)2a2 V -X2a2 - (1 - X)2a22 X2a2 + ((1 - X)2 - 1) a22

Choosing X such that the trace of the information matrix I(i,2} is maximized yields X = a2./ + ff22) and the following value for the information matrix of label set (1, 2}:

1 ( a4 a2a22 \

1(1,2} =--—-2 I a2a2 a42

a 1 + a22 V ai a2 a2 / The expected Fisher information matrix is then given by

E [T] , a? (n1 +n12Ä)

E£-^Il] = -| aM 7 2 ( ^ a2 \

' n12 0xtT0t2 a2 ^ + n12 an)

(v\n I^Ywit

\ve, 12 ve,22/

With this, we have £deconv = I 0,11 0,12 J, with the matrix elements given by

v0,12 Ve,22,

2 _ n12a|w12 + ^12^2 (na^^ + 2n1a2so) + пxпtsXt ' aj2 (^1^2^12 + ^12^12)2

2 _ n12 W12 + ^12^1^2 (2^12 - a^)

°'12 (^1^2^12 + n^W^)2

,2 I „ „ /-„ „2^2

2 _ n 12a2 w 1 2 + n 1 2n 1 (n 1 a2 a12 + tпtats 1 2) + n^s —

a22 (^1^2512 + n^wo)2

where, for simpler notation, we have defined W12 := ^ffj2 + H\o2; and S12 := o^2 + . For the variance of the mean estimators, using Eq. 35, we get

Tdeconv _ t V2,11 V2,12 1

^d -I v2 v2 I

\ V ,12 V ,22 /

22 /22 2 \ 22 2 2CT2 w12 + ^12^2 (^2^ ^ 2 + 2^1ff2 S12) + nni,2 2

with v 2 11 — —^-^--22-^-of (56)

(^1^2^12 + n 12 W12)2 22

2 n12W12 + ^12^1^2(2^12 - ff12) 2 2

v„ 12 — -2-ff2

(^1^2^12 + n 12 W12)

2 n 122ff12w 1 2 + n 1 2 n 1 (n 1 o-fof2 + 2^2ff 2S 1 2) + n l^2S 22 2

V„ 22 — -2-CT2 •

(^1^2 S12 + n 12 W12)2

Mean-squared error of the mean estimator Given that the estimators *deconv are consistent, the mean squared error of the estimator is given by the average of the diagonal elements of

^deconv.

MSEd,econv — itr (^d^A — d 2 V d )

¡1,22

* 2 \ * ) 2

Inserting the expressions in Eqs. 56 and 57 yields the expression given in the theorem.

References

Arons, B. (1992). A review of the cocktail party effect. Journal of the American Voice I/O Society, 12, 35-50.

Bishop, C. M. (2007). Pattern recognition and machine learning. Information science and statistics. Berlin: Springer.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993-1022.

Boutell, M., Luo, J., Shen, X., & Brown, C. (2004). Learning multi-label scene classification. Pattern Recognition, 37(9), 1757-1771.

Brazzale, A. R., Davison, A. C., & Reid, N. (2007). Applied asymptotics: Case studies in small-sample statistics. Cambridge: Cambridge University Press.

Cramer, H. (1946). Contributions to the theory of statistical estimation. Skand. Aktuarietids, 29, 85-94.

Cramer, H. (1999). Mathematical methods of statistics. Princeton: Princeton University Press.

Dembczynski, K., Cheng, W., & Hullermeier, E. (2010). Bayes optimal multilabel classification via probabilistic classifier chains. In Proceedings of the 27th International Conference on Machine Learning.

Dembczynski, K., Waegeman, W., Cheng, W., & Hullermeier, E. (2012). On label dependence and loss minimization in multi-label classification. Machine Learning, 88(1-2), 5-45.

Devroye, L., Gyorfi, L., &Lugosi, G. (1996). A probabilistic theory of pattern recognition. Stochastic modelling and applied probability. Heidelberg: Springer.

Dietterich, T. G., & Bakiri, G. (1995). Solving multiclass learning problems via error-correcting output codes. Journal of Articificial Intelligence Research, 2, 263-286.

Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification (2nd ed.). Hoboken: Wiley-Interscience.

Fisher, R. A. (1925). Theory of statistical estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 22, 700-725.

Gao, W., & Zhou, Z.-H. (2013). On the consistency of multi-label learning. Artificial Intelligence, 199-200, 22-44.

Ghamrawi, N. & McCallum, A. (2005). Collective multi-label classification. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), pp. 195-200.

Godbole, S. & Sarawagi, S. (2004). Discriminative methods for multi-labeled classification. In Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 22-30.

Hastie, T., Tibshirani, R., & Buja, A. (1993). Flexible discriminant analysis by optimal scoring. Journal of the American Statistical Association, 89, 1255-1270.

Hershey, J. R., Rennie, S. J., Olsen, P. A., & Kristjansson, T. T. (2010). Super-human multi-talker speech recognition: A graphical modeling approach. Computer Speech and Language, 24(1), 45-66.

Hsu, D., Kakade, S., Langford, J., & Zhang, T. (2009). Multi-label prediction via compressed sensing. In Proceedings of NIPS.

Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings ofECML.

Kawai, K., & Takahashi, Y. (2009). Identification of the dual action antihypertensive drugs using tfs-based support vector machines. Chem-Bio Informatics Journal, 9, 41-51.

Lehmann, E. L., & Casella, G. (1998). Theory of point estimation. New York: Springer.

Liang, P. & Jordan, M. I. (2008). An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators. In Proceedings oflCML, pp. 584-591, New York, USA. ACM.

Masry, E. (1991). Multivariate probability density deconvolution for stationary random processes. IEEE Transactions on Information Theory, 37(4), 1105-1115.

Masry, E. (1993). Strong consistency and rates for deconvolution of multivariate densities of stationary processes. Stochastic Processes and Their Applications, 47(1), 53-74.

McCallum, A., Corrada-Emmanuel, A., & Wang, X. The author-recipient-topic model for topic and role discovery in social networks: Experiments with enron and academic email. (2005). Amherst, MA: University of Massachusetts Amherst, Technical report, Department of Computer Science.

McCallum, A. K. (1999). Multi-label text classification with a mixture model trained by EM. In Proceedings ofNIPS.

Qi, G.-J., Hua, X.-S., Rui, Y., Tang, J., Mei, T., & Zhang, H.-J. (2007). Correlative multi-label video annotation. In Proceedings of the 15th ACM International Conference on Multimedia, pp. 17-26.

Rao, C. R. (1945). Information and the accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37, 81-91.

Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2009). Classifier chains for multi-label classification. Machine Learning and Knowledge Discovery in Databases, 278, 254-269.

Rifkin, R., & Klautau, A. (2004). In defense of one-vs-all classification. Journal of Machine Learning Research, 5, 101-141.

Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence.

Schapire, R. E., & Singer, Y. (2000). Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3), 135-168.

Streich, A. P. (2010). Multi-label classification and clustering for acoustics and computer security. PhD thesis, ETH Zurich.

Streich, A. P. & Buhmann, J. M. (2008). Classification of multi-labeled data: A generative approach. In Frocedings ofECML, pp. 390-405.

Streich, A. P., Frank, M., Basin, D., & Buhmann, J. M. (2009). Multi-assignment clustering for boolean data. In Proceedings ofICML, pp. 969-976. Omnipress.

Tsoumakas, G., & Katakis, I. (2007). Multi label classification: An overview. International Journal of Data Warehousing and Mining, 3(3), 1-13.

Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010). Data mining and knowledge discovery handbook. In O. Maimon & L. Rokach (Eds.), Mining multi-label data (2nd ed.). Heidelberg: Springer.

Ueda, N., & Saito, K. (2006). Parametric mixture model for multitopic text. Systems and Computers in Japan, 37(2), 56-66.

van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge series in statistical and probabilistic mathematics. Cambridge: Cambridge University Press.

Wainwright, M. J., & Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2), 1-305.

Yano, T., Cohen, W. W., & Smith, N. A. (2009). Predicting response to political blog posts with topic models. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 477-485.

Zhang, M.-L., & Zhou, Z.-H. (2006). Multi-label neural network with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering, 18(10), 1338-1351.

Zhang, M.-L. & Zhou, Z.-H. (2013). A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering. in press.

Zhu, S., Ji, X., Xu, W., & Gong, Y. (2005). Multi-labelled classification using maximum entropy method. In Proceedings ofSIGIR.