## Normalized Mutual Information Clustering Evaluation Essay

Each clustering algorithm $C=\{C_1,\dots,C_k\}$ defines the probability distribution $P_{\mathcal C}$

$$P_{\mathcal C}(i)=\frac{n_i}{N},$$

where $n_i$ is the number of points in the $i$-th cluster $C_i$ and $n$ is the total number of points in the data cloud. Different cluster algorithms can determine different numbers of clusters, of course.

For any distributions $P_{\mathcal C_1}=(p_1,\cdots,p_n)$ and $P_{\mathcal C_2}=(q_1,\cdots,q_m)$ the mutual information $I(p,q)$ is just

$$I(p,q)=\sum_{i,j}R(i,j)\log\frac{R(i,j)}{P_{\mathcal C_1}(i)P_{\mathcal C_2}(j)},$$

denoting by $R(i,j)$ the joint probability distribution. Using the definition for $P_{\mathcal C_i}$ and $R$, i.e.

$$P_{\mathcal C_1}(i)=\frac{n_i}{N}$$ $$P_{\mathcal C_2}(j)=\frac{m_j}{N}$$ $$R(i,j)=\frac{n_{i,j}}{N}:=\frac{|n_i\cap m_j|}{N}$$

we arrive at

$$I(P_{\mathcal C_1},P_{\mathcal C_2})=\sum_{i,j}\frac{n_{i,j}}{N}\log\frac{\frac{n_{i,j}}{N}}{\frac{n_i}{N}\frac{m_j}{N}}=\sum_{i,j}\frac{n_{i,j}}{N} \log\frac{N n_{i,j}}{n_i m_j},$$

as in http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html, formula (185).

The same formula for the mutual information contained in http://machinelearning.wustl.edu/mlpapers/paper_files/StrehlG02.pdf and cited in http://www.shi-zhong.com/papers/comptext2.pdf does not contain the factor $\frac{1}{N}$, as you correctly remark.

I would use the above formulation for the mutual information, as it implements the correct probabilitstic view for the univariate and joint distributions.

Remark Note that in http://machinelearning.wustl.edu/mlpapers/paper_files/StrehlG02.pdf the assertion at pag. 589 "I(X,Y) is a metric" is wrong. The mutual information $I(X,Y)$ is equivalent to a Kullback Leibner divergence and it is no metric (or distance). The information value (or variation of information) is a metric, instead. Please look at http://en.wikipedia.org/wiki/Variation_of_information for more details.

answered Jul 7 '13 at 13:33

## Title: Normalized Mutual Information to evaluate overlapping community finding algorithms

Authors:Aaron F. McDaid, Derek Greene, Neil Hurley

(Submitted on 11 Oct 2011 (v1), last revised 2 Aug 2013 (this version, v2))

Abstract: Given the increasing popularity of algorithms for overlapping clustering, in particular in social network analysis, quantitative measures are needed to measure the accuracy of a method. Given a set of true clusters, and the set of clusters found by an algorithm, these sets of clusters must be compared to see how similar or different the sets are. A normalized measure is desirable in many contexts, for example assigning a value of 0 where the two sets are totally dissimilar, and 1 where they are identical. A measure based on normalized mutual information, [1], has recently become popular. We demonstrate unintuitive behaviour of this measure, and show how this can be corrected by using a more conventional normalization. We compare the results to that of other measures, such as the Omega index [2].

## Submission history

From: Aaron Francis McDaid [view email]
[v1] Tue, 11 Oct 2011 21:45:31 GMT (9kb,D)
[v2] Fri, 2 Aug 2013 11:11:57 GMT (9kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)