Does the Variation of Information metric need to be defined on clusterings?

By | June 13, 2018

I've read Meila's original paper where she defines the Variation of Information metric as $$VI(X,Y) = H(X|Y) + H(Y|X)$$ where $X$,$Y$ are two clusterings of a dataset D. The proof uses the properties of the clusterings themselves.

However, the wiki page on Mutual Information lists $VI(X,Y)$ as a metric (seemingly) on two arbitrary discrete random variables $X,Y$. Is VI a true metric for arbitrary discrete random variables?