Normalized Mutual Information Clustering Evaluation Essay

Each clustering algorithm $C=\{C_1,\dots,C_k\}$ defines the probability distribution $P_{\mathcal C}$

$$P_{\mathcal C}(i)=\frac{n_i}{N},$$

where $n_i$ is the number of points in the $i$-th cluster $C_i$ and $n$ is the total number of points in the data cloud. Different cluster algorithms can determine different numbers of clusters, of course.

For any distributions $P_{\mathcal C_1}=(p_1,\cdots,p_n)$ and $P_{\mathcal C_2}=(q_1,\cdots,q_m)$ the mutual information $I(p,q)$ is just

$$I(p,q)=\sum_{i,j}R(i,j)\log\frac{R(i,j)}{P_{\mathcal C_1}(i)P_{\mathcal C_2}(j)},$$

denoting by $R(i,j)$ the joint probability distribution. Using the definition for $P_{\mathcal C_i}$ and $R$, i.e.

$$P_{\mathcal C_1}(i)=\frac{n_i}{N} $$ $$P_{\mathcal C_2}(j)=\frac{m_j}{N} $$ $$R(i,j)=\frac{n_{i,j}}{N}:=\frac{|n_i\cap m_j|}{N}$$

we arrive at

$$I(P_{\mathcal C_1},P_{\mathcal C_2})=\sum_{i,j}\frac{n_{i,j}}{N}\log\frac{\frac{n_{i,j}}{N}}{\frac{n_i}{N}\frac{m_j}{N}}=\sum_{i,j}\frac{n_{i,j}}{N} \log\frac{N n_{i,j}}{n_i m_j},$$

as in, formula (185).

The same formula for the mutual information contained in and cited in does not contain the factor $\frac{1}{N}$, as you correctly remark.

I would use the above formulation for the mutual information, as it implements the correct probabilitstic view for the univariate and joint distributions.

Remark Note that in the assertion at pag. 589 "I(X,Y) is a metric" is wrong. The mutual information $I(X,Y)$ is equivalent to a Kullback Leibner divergence and it is no metric (or distance). The information value (or variation of information) is a metric, instead. Please look at for more details.

answered Jul 7 '13 at 13:33

Title: Normalized Mutual Information to evaluate overlapping community finding algorithms

Authors:Aaron F. McDaid, Derek Greene, Neil Hurley

(Submitted on 11 Oct 2011 (v1), last revised 2 Aug 2013 (this version, v2))

Abstract: Given the increasing popularity of algorithms for overlapping clustering, in particular in social network analysis, quantitative measures are needed to measure the accuracy of a method. Given a set of true clusters, and the set of clusters found by an algorithm, these sets of clusters must be compared to see how similar or different the sets are. A normalized measure is desirable in many contexts, for example assigning a value of 0 where the two sets are totally dissimilar, and 1 where they are identical. A measure based on normalized mutual information, [1], has recently become popular. We demonstrate unintuitive behaviour of this measure, and show how this can be corrected by using a more conventional normalization. We compare the results to that of other measures, such as the Omega index [2].

Submission history

From: Aaron Francis McDaid [view email]
[v1] Tue, 11 Oct 2011 21:45:31 GMT (9kb,D)
[v2] Fri, 2 Aug 2013 11:11:57 GMT (9kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

One thought on “Normalized Mutual Information Clustering Evaluation Essay

Leave a Reply

Your email address will not be published. Required fields are marked *