Tuesday, October 31, 2017

Reading Notes: Co-Clustering

In co-clustering we have two sets of related entities, both of which we want to cluster -- the main goal is that the clusters of these two sets preserve the "meaningful" relationships between the original sets. As a running example, we assume that the two sets are a set of words $\mathcal{W}$ and a set of documents $\mathcal{D}$, and that the relation between these two is the number of times a word appears in a document. We furthermore assume that the relationship between these two sets is described by a $\mathcal{W}\times\mathcal{D}$ matrix $A$.

Co-clustering has tight connections to/applications in community detection in bi-partite graphs.

Block Models

Block models assume that the entries of $A$ are independently drawn from distributions parameterized by the word and document clusters; for example, if $A$ is binary, then for each co-cluster, a different Bernoulli distribution is assumed, while for $A$ being a contingency table, a Poisson distribution with a parameter depending on the co-cluster is often assumed. Clustering can then be performed by finding a model that maximizes the likelihood of the $A$, i.e., by applying a variant of classification EM. Relevant references can be found in the paper "Sparse Poisson Latent Block Model for Document Clustering (2017)" by Ailem et al. and mainly go back to Govaert and Nadif. Ailem et al. assumed that the block model is "block diagonal", i.e., that the words and documents are clustered in the same number of clusters, and they assumed that the off-diagonal blocks are parameterized with the same underlying parameter, leading to fewer parameters to be estimated by the EM-type algorithm.

Information-Theoretic Cost Functions

These cost functions are based on transforming $A$ to a joint probability distribution $P$ of two RVs $W$ and $D$ corresponding to the words and documents, i.e., $W$ has alphabet $\mathcal{W}$ and $D$ has alphabet $\mathcal{D}$. We denote the RVs corresponding to clusters by an overline, i.e., $\overline{W}$ and $\overline{D}$.
  • Not really simultaneous co-clustering, but still co-clustering: In "Document Clustering using Word Clusters via the Information Bottleneck Method (2000)" by Slonim and Tishby, the authors first clustered the words in order to maximize $I(\overline{W};D)$, and then used these word clusters in order to find document clusters such that $I(\overline{W};\overline{D})$ is maximized. They used the agglomerative information bottleneck method in each of these two steps. In "Iterative Double Clustering for Unsupervised and Semi-Supervised Learning (2002)", El-Yaniv and Sourourjon improved the method by clustering the words (to clusters $\overline{W}^2$) again, but this time maximizing $I(\overline{W}^2;\overline{D}^1)$, where $\overline{D}^1$ are the document clusters obtained in the first iteration. They they obtained the document clusters by maximizing $I(\overline{D}^2;\overline{W}^2)$. They then repeated this procedure several times until convergence.
  • The work "A divisive information-theoretic feature clustering algorithm for text classification (2003)" by Dhillon et al. is not on co-clustering, but on word clustering in order to make document classification more robust. The goal of this technique is to find word clusters such that the mutual information $I(D;\overline{W})$, the mutual information between the word clusters and the documents, is maximized. The maximization procedure is similar to k-means, i.e., cluster memberships are computed, on the basis of which new cluster centroids are determined.
  • Based on the previous work, Dhillon et al. wrote the influential paper "Information-Theoretic Co-Clustering (2003)". Co-clustering was suggested to yield improved performance compared to one-way clustering, presumably because of an implicit regularization/dimensionality reduction. The goal was to maximize $I(\overline{D};\overline{W})$, i.e., to maximize the information the word clusters share with the document clusters. This is equivalent to approximating $P$ by a non-negative matrix trifactorization $\tilde{P}=A_W Q A_D$, where $Q$ is the joint probability distribution between $\overline{D}$ and $\overline{W}$ (it is hence a $\overline{\mathcal{W}}\times\overline{\mathcal{D}}$ matrix), and where the cost function is the Kullback-Leibler divergence $D(P\Vert \tilde{P})$. The authors then suggested a sequential optimization algorithm, i.e., one starts with a co-clustering, then one computes the block model, based on which all word clusters are updated, which then changes the block model before the document clusters are updated.
  • The paper "Information Bottleneck Co-Clustering (2010)" by Wang et al. follows the spirit of the information bottleneck method for multiple variables, i.e., using multi-information. The multi-information of a collection of RVs, described as a Bayesian network, is the sum of the mutual information between each RV and its parent RVs in this network. The goal is then to minimize the multi-information of the "input graph" (described by $P$ and the clustering functions) while at the same time maximizing the multi-information of the co-clustering to the target variables (described by $Q$ as in the previous section and the mutual information between document clusters and words/word clusters and documents). The cost function thus becomes
    $$ I(W;\overline{W}) + I(D;\overline{D}) + I(W;D) - \beta I(W;\overline{D}) - \beta I(D;\overline{W}) - \beta I(\overline{W};\overline{D}).$$ They minimize this cost function either using an agglomerative technique or by a Blahut-Arimoto-type fixed-point iteration in combination with some heuristic (continuation method) to escape local optima.
  • An information-theoretic cost function was also used by Bekkerman et al. in "Multi-Way Distributional Clustering via Pairwise Interactions (2005)", where they investigated simultaneous clustering of $m\ge 2$ sets (or $m$ RVs). Rather than using multi-information, they depended on a graph structure with $m$ vertices and an edge $e_{ij}$ if they wanted to maximize $I(\tilde{X}_i;\tilde{X}_j)$, i.e., the mutual information between the clusterings of the $i$-th and $j$-th RV. While they essentially proposed a simple sequential technique, they combined it with splits and merges, i.e., they started with an initial clustering solution in which $\tilde{X}_i=X_i$ for some $i$, while $\tilde{X}_j=const.$ for others.

Spectral/Modularity-Based Methods

Spectral and modularity-based methods assume that the word-document matrix $A$ can be interpreted as the biadjacency matrix of a bipartite graph. In these techniques, one either tries to minimize some cut score, or one tries to maximize modularity, which is always defined relative to a null model. In these models, one always has $\overline{\mathcal{W}}=\overline{\mathcal{D}}$.
  • In "Co-clustering documents and words using bipartite spectral graph partitioning (2001)", Dhillon suggested co-clustering by cutting the graph, minimizing the normalized cut. Relaxing the cut criterion, for $\overline{\mathcal{W}}=\overline{\mathcal{D}}=2$, one can compute the second singular vectors of a normalization of $A$, stack them, and cluster the elements of this single vector into two clusters using k-means. By unstacking, these two clusters then correspond to the two word clusters and the two document clusters. This technique can be extended to $\overline{\mathcal{W}}=\overline{\mathcal{D}}=k$ by computing $\ell=\lceil\log_2 k\rceil$ singular vectors, stacking them to a matrix $Z$, and perform k-means for $k$ clusters on the $\ell$-dimensional rows of $Z$. There is a Matlab implemenation of this method.
  • The paper "Co-clustering for Binary and Categorical Data with Maximum Modularity (2011)" by Labiod and Nadif uses modularity as an objective function; modularity is maximized solving the relaxed generalized eigenvalue problem and then clustering the eigenvalues using k-means just as Dhillon proposed. They focused on binary and categorical data.
  • Ailem et al. proposed modularity as an objective function in "Co-clustering document-term matrices by direct maximization of graph modularity (2015)". They proposed searching for modules in the bipartite graph, by alternatingly maximizing the modularity over word clusters and document clusters, i.e., they first fix the word clusters and maximize modularity over the document clusters, then fix document clusters to find optimal word clusters.

Other Methods

  • Co-clustering can be seen also as a non-negative matrix trifactorization problem; Sra and Dhillon present update rules for both Euclidean distance and Kullback-Leibler divergence in their report " Generalized nonnegative matrix approximations (2006)". The latter cost function is also used in Information-Theoretic Co-Clustering by Dhillon et al.
  • The paper "Co-clustering through optimal transport (2017)" by Laclau et al. formulates the co-clustering problem as the problem to transport the empirical probability mass from the words to the documents, the solution of which is a joint probability distribution $\tilde{P}$ approximating the empirical one. Entropically regularized transport yields again the Kullback-Leibler divergence as cost function, thus minimizing $D(\tilde{P}\Vert P)$, where $P$ is obtained from $A$ (note the difference to Information-Theoretic Co-Clustering), and where $\tilde{P}$ is connected to variational inference. The authors showed that co-clustering can be solved via the Sinkhorn-Knopp algorithm and suggested a heuristic to determine the number of clusters.
  • Not exactly a co-clustering method, but rather a meta-heuristic, was proposed by Cheng et al. in "Co-ClusterD: A Distributed Framework for Data Co-Clustering with Sequential Updates (2015)". They analyzed the "concurrent" update rule of, e.g., Information-Theoretic Co-Clustering of Dhillon et al. and replaced it by a "sequential" update rule: Essentially, they propose updating the statistics for a co-cluster as soon as a single element of a set changes its cluster membership, thus influencing the next cluster update immediately. Concurrent update rules, on the other hand, update the co-cluster statistics only after all words have been reassigned to word clusters (and similarly for document clusters). They then present a framework for performing co-clustering in a distributed fashion.
  • A hierarchical co-clustering method was introduced by Li et al. in "Hierarchical co-clustering: a new way to organize the music data (2012)". Their "divisive" version applies Dhillon's spectral method recursively, while their "agglomerative" version simply merges clusters in order to minimize cluster heterogenity.
  • Banerjee et al. view co-clustering as a matrix approximation task in "A generalized maximum entropy approach to Bregman co-clustering and matrix approximation (2007)". Essentially, they view the co-clustering as a way to get a summary statistic of $A$ based on the co-clusters, and they proposed several different types of summary statistics (e.g., co-cluster means). They then show that, given $A$ and the type of summary statistic, the optimal summary statistic and the optimal co-clustering is obtained via minimizing any Bregman divergence, such as the squared Euclidean distance or the Kullback-Leibler divergence. Their algorithm is similar to the one proposed by Dhillon et al. for Information-Theoretic Co-Clustering.
  • In "Locally Discriminative Coclustering (2012)", Zhang et al. proposed a cost function mapping a word and a document to the same co-cluster if the corresponding entry in $A$ is large. In addition, they proposed also enforcing co-clusters that respect dependencies between words alone, and between documents alone. They showed that the complete problem can be relaxed to an eigenvalue problem.

Friday, October 6, 2017

Reading Notes: Community Detection in Bipartite Graphs

A bipartite graph consists of two sets $V_a$ and $V_b$ that unite to the set of vertices, and a set of edges $E\subseteq V_a\times V_b$. The bipartite adjacency matrix $B$ contains a 1 in position $(a,b)$ iff $(a,b)\in E$; the adjacency matrix $A$ of the graph is given by

$$ A = \left[\begin{array}{cc} 0 & B \\ B^T & 0 \end{array}\right].$$

A classic paper is the one by Freeman, "Finding Social Groups: A Meta-Analysis of the Southern Women Data (2003)", who compared and analyzed 21 approaches to community detection.

Block Models

  • A block model approach has been proposed by Doreian, Bagatelj, and Ferligolj in their paper "Generalized Blockmodeling of Two-mode Network Data (2004)". Their approach is based on modeling blocks via specific attributes (all-ones, all-zeros, row-regular, etc.) and then inferring the position of these blocks in $B$. They do so by proposing a sequential heuristic in which they minimize the difference between the ideal block model and the actual adjacency matrix (more details are here). Hence, they require not only to specify the number of communities, but also the type of blocks used in the model. Different choices yield different solutions, as they illustrate for the Southern Women dataset. Their approach can also be used for weighted graphs.
Stochastic block models assume that the number of edges between two nodes is drawn from some probability model, and that the probabilities are different for edges within blocks and between blocks. Usually, the block probabilities are estimated, and the blocks are inferred by maximizing the (log-)likelihood of the graph given the block model.
  • The paper "Efficiently inferring community structure in bipartite networks (2014)" from Larremore, Clauset, and Jacobs proposes a block model in which the number of edges are assumed to be drawn from a Poisson distribution, admitting different Poisson parameters for every block and every pair of blocks. They introduced a second model which admits modelling bipartite graphs with broad degree distributions. The authors suggested a sequential algorithm with a heuristic to escape local minima. The algorithm is implemented for MATLAB and R. The algorithm seems to work also for weighted graphs, but requires the numbers of blocks (i.e., communities) as input.

Spectral/Modularity-based Methods

Modularity is a measure of how well a graph can be split into modules. It is based on a null model (typically one in which the edges are randomized such that the vertex degrees remain constant) and the comparison of a given partition of the network to a partition of this null model. Modularity is usually maximized by spectral methods. Although these methods seem to work also for weighted graphs, the derivation of the null model is only justified for unweighted graphs (i.e., for $B$ comprising of 0s and 1s).
  • Barber proposed a modularity measure for bipartite networks in "Modularity and Community Detection in Bipartite Networks (2007)". He proposed optimizing this measure recursively: Fixing the row partition, the column partition can be obtained easily, and vice-versa. He hence alternatively optimizes these two partitions, calling his algorithm BRIM. Barber also suggested a bisection approach to determine the number of communities. A fundamental drawback is that his method requires the same number of clusters for both rows and columns (see eq. (21) in the paper). The cost function has been used in a MATLAB implementation, which is described in this paper.
  • A modularity measure was also proposed by Guimera, Sales-Pardo, and Nunes Amaral in "Module Identification in Bipartite and Directed Networks (2007)". They optimize their cost function via simulated annealing, but show that their results do not differ from modularity-based methods applied to $B^TB$. 
  • Dormann and Strauss proposed a simulated-annealing Monte Carlo method for maximizing modularity in "A method for detecting modules in quantitative bipartite networks (2014)". They measure modularity using Barber's definition. Their algorithm, QuanBiMo, is based on fitting a tree to a graph, a method proposed here. An R package is available.

Projection Methods

Projection Methods usually work on $B^TB$ (weighted projection) or on $\max\{1,B^TB\}$ (unweighted projection) rather than on $A$. In that sense, they work on less information than is available in the original model, but admit using community detection methods for unipartite graphs. If $B$ is binary, $B^TB$ and $BB^T$ contain all "relevant" information about $A$ (thus, $B$), hence a dual-projection approach could potentially work as well as an approach targeted at $A$. This, at least, is argued by Everett and Borgatti in "The dual-projection approach for two-mode networks (2013)".

Other Techniques

Friday, September 15, 2017

Invariant Distribution of a Higher-Order Markov Chain (Pt. 2)

Some time ago I was looking into the uniqueness of the invariant distribution of a second-(or higher-)order Markov chain, and I was running into problems. After a few weeks of work, a manuscript was ready in which I found some useful sufficient conditions for this invariant distribution to be unique. The manuscript is available on arXiv, and published in the Statistics and Probability Letters. You can access the full-text of the manuscript for free until October 1st 2017 (afterwards, the arXiv version remains available, of course). I'm happy to get feedback, so please don't hold back :)!

Tuesday, April 11, 2017

An Inequality between Conditional Entropy and Error Probability

Suppose that $X$ is a random variable that can assume $N$ different values, and suppose that $Y$ is another random variable that is linked to $X$ in an arbitrary way. We want to estimate $X$ from observing $Y$, i.e., we define a (possibly random) function $\hat{X}(Y)$. (We may assume that $X$ is a signal we want to estimate and that $Y$ is a noisy or distorted observation of $X$.) We define the detection probability as $P_d:=\mathbb{P}(X=\hat{X}(Y))$, where the probability is taken w.r.t. the joint distribution $p_{X,Y}$ of $X$ and $Y$. A famous inequality relating $P_d$ to the conditional entropy of $X$ given $Y$ is Fano's inequality:
 H(X|Y) \le h_2(P_d) + (1-P_d)\log (N-1)
where $h_2(p):=-p\log p-(1-p)\log(1-p)$. Fano's inequality depends on the alphabet size $N$ of $X$. In what follows we derive a different inequality that is independent of $N$.

To this end, suppose that we use the maximum a posteriori estimate (MAP) of $X$ given $Y$, i.e.,
  \hat{x}(y) = \arg\max_x p_{X,Y}(x,y) = \arg\max_x p_{X|Y}(x|y).
Given that $Y=y$, we can thus define
P_d(y) = \sum_{x} p_{X|Y}(x|y) \mathbb{I}(x=\hat{x}(y)) = \max_x p_{X|Y}(x|y)
from which $P_d=\sum_y p_Y(y) P_d(y)$ follows. Comparing the right-hand side of this with Renyi's entropy of order infinity, we observe that
 P_d(y) = 2^{-H_\infty(X|Y=y)}.
Renyi's entropy is non-increasing in the order, i.e., we have $H_\infty(X|Y=y)\le H(X|Y=y)$. Hence, $P_d(y)\ge 2^{-H(X|Y=y)}$. The function $2^{-x}$ is convex in $x$. Hence, Jensen's inequality yields
 P_d = \sum_y p_Y(y) P_d(y) \ge \sum_y p_Y(y) 2^{-H(X|Y=y)} \ge 2^{- \sum_y p_Y(y) H(X|Y=y)} = 2^{-H(X|Y)}.
Thus, the MAP detection probability is bounded from below by a decreasing function of the conditional entropy. The bound does not depend (explicitly) on the alphabet size of $X$.

Monday, January 9, 2017

New Features for WhatsApp & Co?

This was originally intended as a submission to the Science Park Graz idea contest (that's why it's in German) - but then I thought that the idea is not innovative enough, even though I would love to see those features in WhatsApp or Hangouts. Who knows, maybe Google is reading this ;).

Zwei Use Cases, an denen WhatsApp, Facebook & Co scheitern

Use Case 1: Wochenendausflug mit Freunden teilen

Du verbringst das Wochenende in München und möchtest deinen Freunden zuhause ein paar Fotos vom Marienplatz, der Allianz-Arena und dem Olympiaturm schicken. Du könntest die Fotos auf Facebook, Instagram oder Google+ teilen. Jene deiner Freunde, die diese sozialen Netzwerke selten nutzen, könnten diese Fotos verpassen weil der Newsfeed in der Zwischenzeit von anderen Neuigkeiten überschwemmt wurde. Du könntest deine Fotos deinen Freunden per WhatsApp schicken - vielleicht existiert sogar schon eine passende Gruppe. Aber einerseits werden deine Freunde mit mehreren Benachrichtigungen belästigt, andererseits fühlen sie sich vielleicht dazu verpflichtet, auf deine Fotos mit Kommentaren zu reagieren. Der Abend im Hofbräuhaus wird so schnell zum Message-Marathon.

Use Case 2: Weihnachtswünsche und einen Jahresrückblick an Bekannte schicken

Du möchtest entfernten Bekannten (ehemalige ArbeitskollegInnen, entfernte Familienmitglieder, etc.) ein frohes Fest wünschen und sie wissen lassen, was im vergangenen Jahr bei dir passiert ist. Was du zu sagen hast sprengt den Rahmen der durchschnittlichen WhatsApp-Nachricht, und Facebook und Google+ sind dir zu öffentlich - zudem besteht das große Risiko, dass deine Bekannten die Nachricht aus den im letzten Use Case genannten Gründen gar nicht bekommen.


Um diese Use Cases abzudecken, bietet das App verschiedene Features, die über die Spezifikationen der derzeit verfügbaren Messaging-Apps hinausgehen. Zur besseren Übersicht sind die neuartigen Features (wie zum Beispiel Nachrichtentypen, Gruppenrollen und Gruppeneinstellungen) fett markiert.


Die Darstellung ähnelt WhatsApp bzw. den Abos von Youtube: Der Startbildschirm listet die Gruppen auf, in denen man Mitglied ist und gibt mit einer Zahl neben dem Gruppennamen die Anzahl der neuen Nachrichten in der Gruppe an. Bei Touch auf den Gruppennamen kommt man zum Gruppenchat, der wieder WhatsApp ähnelt. Ein wesentlicher Unterschied sind zwei neue Nachrichtentypen (Stories und Textnachrichten im eReader-Stil, siehe unten), die im Gruppenchat nur kompakt als Überschrift angezeigt werden. Alle Nachrichten haben außerdem eine Fußzeile mit der Anzahl von Likes bzw. Kommentaren (siehe unten). Das App hat außerdem ein Webinterface welches vor allem für das Verfassen längerer Nachrichten hilfreich ist.


Neben dem Gruppenadministrator und dem Gruppenmitglied, die es auch in WhatsApp und anderen Messaging-Gruppen gibt, gibt es auch Leser, also eine rein passive Gruppenrolle.

  • Gruppenadministrator: Legt die Gruppeneinstellungen fest; lädt Mitglieder in die Gruppe ein und teilt ihnen Gruppenrollen zu
  • Gruppenmitglied: Teilt Nachrichten mit anderen Gruppenmitgliedern bzw. kommentiert und liked Nachrichten anderer Gruppenmitglieder
  • Leser: Passive Teilnahme; kann ggf. Nachrichten lesen, kommentieren und liken, aber keine neuen Nachrichten teilen


Das App erlaubt neben Textnachrichten im Sprechblasenstil natürlich auch das Teilen von Fotos, Videos, Dokumenten und Internetlinks. Ein besonderes Feature sind Textnachrichten im eReader-Stil. Dieser Nachrichtentypus ist für lange Textnachrichten ausgelegt und ermöglicht ein angenehmes Lesen durch Umblättern statt Scrollen. Außerdem gibt es Stories, also Nachrichten, die über einen längeren Zeitraum geschrieben werden und mit Fotos oder Videos versehen sind. Stories werden vom Gruppenmitglied wie normale Nachrichten geschrieben und abgeschickt, erreichen die anderen Gruppenteilnehmer aber nicht sofort. Erst wenn die Story als abgeschlossen markiert wurde, erscheinen alle kumulierten Nachrichten im Gruppenchat.
Auf Nachrichten kann auf dreierlei Art reagiert werden: Einerseits durch eine neue Nachricht -- ganz im Stile von WhatsApp. Andererseits können Nachrichten kommentiert und geliked werden -- ganz im Stile von Social Networks wie Facebook oder Google+.


Die Gruppeneinstellungen werden vom Gruppenadministrator festgelegt. Sie regeln unter anderem

  • Welche Nachrichtentypen dürfen in der Gruppe verwendet werden?
  • Gibt es Mindestgrößen für Stories bzw. Textnachrichten im eReader-Stil?
  • Welche Nachrichtentypen führen zu Push-Benachrichtigungen (es sei denn, das Gruppenmitglied hat Push-Benachrichtigungen für diese Gruppe deaktiviert)?
  • Wie darf auf Nachrichten reagiert werden? Sind Kommentare erlaubt? Gibt es die Möglichkeit zu liken?

Individuelle Einstellungen

Ferner gibt es Einstellungen, die jeder Nutzer für sich selbst festlegen kann. Dazu gehört unter anderem die Deaktivierung von Push-Benachrichtigungen und natürlich die Schriftgröße von Textnachrichten im eReader-Format. Außerdem kann jeder Nutzer entscheiden, wer Kommentare auf Nachrichten sehen kann: Alle Mitglieder einer Gruppe, oder nur der Verfasser der Nachricht. Die Reihenfolge der Gruppenliste ist ebenfalls anpassbar. So können die Gruppen nach der Anzahl oder Aktualität ungelesener Nachrichten, nach der Rolle des Nutzers in der jeweiligen Gruppe oder nach einer selbst gewählten Priorisierung geordnet sein. Nutzer können ferner entscheiden, wie sie in Gruppen hinzugefügt werden wollen: Hinzufügen ohne Bestätigung, Hinzufügen mit Bestätigung durch Nachricht oder Hinzufügen durch persönlichen Kontakt (z.B. über NFC-Transfer oder Fotografieren eines QR-Codes am Bildschirm des Gruppenadministrators). Diese Einstellungen sind für jede Gruppenrolle, die dem Nutzer zugeteilt werden soll, und für ausgewählte einladende Personen separat möglich. So erlaube ich zum Beispiel vertrauten Kontakten, mich zu jeder Gruppe als Leser hinzuzufügen, in der sie Gruppenadministrator sind; als Gruppenadministrator möchte ich nur hinzugefügt werden, wenn ich andere Gruppenadministratoren der selben Gruppe persönlich treffe.

Zurück zu den Use Cases

Use Case 1: Wochenendausflug mit Freunden teilen

Als Gruppenadministrator legst du eine Gruppe mit deinen Freunden als Gruppenmitgliedern an. Du erlaubst nur den Nachrichtentypus Stories und verbietest Kommentare - liken ist erlaubt. Die Push-Benachrichtigung ist aktiviert.
Im Laufe des Wochenendes schickst du mehrere Fotos an diese Gruppe. Deine Freunde bekommen am Sonntagabend die Benachrichtigung, dass du eine Story geteilt hast. Sie haben die Möglichkeit, die Fotos anzuschauen und zu liken, aber die Gruppeneinstellung verbietet Kommentare. Niemand ist verpflichtet, dir zu antworten -- wer es dennoch will, findet sicher einen Weg dazu (z.B. eine Direktnachricht oder SMS).

Use Case 2: Weihnachtswünsche und einen Jahresrückblick an Bekannte schicken

Als Gruppenadministrator legst du eine Gruppe mit deinem Bekanntenkreis als Lesern an. Du deaktivierst die Push-Benachrichtigung, erlaubst aber Kommentare.
Du schreibst im Webinterface deine Nachricht und schickst sie als Textnachricht im eReader-Stil an die Gruppe. Deine Leser sehen beim nächsten Öffnen der App einen roten Punkt vor dem Gruppennamen am Hauptbildschirm. Wenn sie die Nachricht öffnen, sehen sie keine scrollbare Nachricht, sondern eine Anzeige im eReader-Format: Durch Berühren des Bildschirms wird die Seite umgeblättert. Da die Option von dir aktiviert wurde, können deine Bekannten deine Nachricht kommentieren. Sie können dabei selbst entscheiden, ob alle Leser oder nur du das Kommentar sehen können.

Wednesday, November 30, 2016

(Lossy) Wyner's Common Information and Gacs-Körner's Common Randomness Revisited

Recently, I was involved (not so much, to be honest) in a research project dealing with caching, i.e., the storage of data close to a user such that it need not be transmitted anymore. A typical use case is Netflix: Netflix offers $N$ movies/series, and the user can choose any of these anytime. Most users watch during prime time, which taxes the communication network. During the day, the communication network has unused capacities, so one could try storing part of the Netflix database on a hard drive connected to the user's player. This reduces the network load during prime time and improves the user experience.

I our work, we not only looked at the fundamental limits of this caching scheme, but we also got a good idea of what should actually be stored in the user's hard drive ("cache"). The answers are decades old: One should store either what is known as Gacs-Körner's common randomness, or what is known as Wyner's common information.

For simplicity, we assume that there are only two files available on Netflix, namely $X$ and $Y$. The common information of these two random variables (RVs; in information theory, everything is random) is another RV $W$ that minimizes
$$I(X,Y; W)$$
subject to $X-W-Y$, i.e., subject to the fact that, given $W$, $X$ and $Y$ are conditionally independent. The common randomness of these two RVs is an RV $U$ that maximizes
$$ I(X,Y; U) $$
subject to the constraint that $X-Y-U$ and $Y-X-U$. It can be shown that
$$ I(X,Y; U) \le I(X;Y) \le I(X,Y; W).$$

Common randomness is something that is truly common, in the usual sense of the word. In particular, if $A$, $B$, and $C$ are independent RVs, and if $X=(A,B)$ and $Y=(A,C)$, then $U=A$. So much for the math. Now here comes an interpretation (and probably one for which an information theorists wants to punch me in the face): If $X$ and $Y$ are two different episodes of your favorite series, then $U$ is the opening theme. If you have a small hard drive, storing the opening theme is probably the most useful idea. Common information is a bit more intricate, but an intuitive explanation is the following: Suppose $X$ and $Y$ are two different episodes of your favorite series, and suppose the only thing $X$ and $Y$ have in common is an actress called $A$. In $A$, some parts of $A$ are seen, while in $Y$ other parts of $A$ are seen. (Say, $X$ shows $A$ running, while $Y$ shows $A$ talking to a person.) Obviously, if you could store all information about actress $A$ as a whole on your hard drive, the small computer in your player could reconstruct running scenes and talking scenes in which $A$ participates. Thus, if your hard drive is large enough, that's what you should store.

Let's take this one step further. Suppose that you are fine with watching not $X$ or $Y$, but lower-quality versions $\hat X$ or $\hat Y$, respectively. This requires the definition of lossy versions of common randomness and common information. And indeed, such definitions have been proposed in 1403.8093. Namely, the lossy common information is an RV $W$ that minimizes
$$ I(X,Y; W)$$
subject to $\hat X - W - \hat Y$ and $(X,Y) - (\hat X,\hat Y) - W$. Lossy common randomness is an RV $U$ that maximizes
$$ I(X,Y; U) $$
subject to the constraints $X-Y-U$, $Y-X-U$, $X-\hat X - U$, and $Y-\hat Y-U$. These definitions are all meaningful because they solve communication problems: The get their operational characterization in the Gray-Wyner problem (1403.8093) or in lossy caching (1610.07304).

The lossy version of common information seems fine to me: I only need to store as much information about actress $A$ in order to be able to reconstruct approximate depictions of her running and speaking. This might need more or less space on the hard drive than the perfect reconstruction: I might be able to store not only this reduced version of the actress $A$, but I might also store a, say, generic building, that allows to reconstruct two different but similar houses in episodes $X$ and $Y$. And indeed, lossy common information may be larger or smaller than lossless common information (see Lemma 1 in 1403.8093).

The lossy version of common randomness, however, is not fully satisfactory to me. If I am not interested in perfect reconstruction, should I not be able to store more on my internal hard drive? For example, if the credits of my two episodes $X$ and $Y$ differ only in a few actors' names, can I not store a "noisy" version of the credits in addition to the opening theme? Nevertheless, the lossy version of common randomness is usually smaller than the lossless version (see Corollary 2 in 1403.8093).

I would thus suggest different definitions of lossy common randomness: The lossy common randomness is an RV $U'$ that maximizes
$$ I(X,Y; U') \text{ or } I(\hat X,\hat Y; U')$$
subject to the constraints $\hat X-\hat Y-U'$, $\hat Y-\hat X-U'$, $X-\hat X - U'$, and $Y-\hat Y-U'$. The difference is subtle and mirrors the definition of lossy common information. While lossy common information requires $W$ to make the reconstructions $\hat X$ and $\hat Y$ conditionally independent, the original definition of lossy common randomness requires that $U$ is a common part of $X$ and $Y$ (the first two Markov constraints). The definition proposed here would require that $U'$ is only a common part of the reconstructions $\hat{X}$ and $\hat Y$. Given this definition, we may ask the following questions:

  1. Does the new definition satisfy the desired property that it is larger than the lossless common randomness?
  2. Is there a communications problem for which this definition is a single-letter characterization?

Monday, October 24, 2016

Invariant Distribution of a Second-Order Markov Chain?

A second-order Markov chain $X$ on a finite state space $\mathcal{X}$ is a stochastic process that satisfies
 \mathbb{P}(X_n=x|X_{n-1}=x_{n-1},\dots,X_1=x_1) = \mathbb{P}(X_n=x|X_{n-1}=x_{n-1},X_{n-2}=x_{n-2})
If the second term is invariant of $n$, we call the second-order Markov chain homogeneous and write
  Q_{x,y\to z}= \mathbb{P}(X_3=z|X_2=y,X_1=x)
We say that this Markov chain is irreducible, if and only if from every pair $(x,y)$ every other state $z$ can be reached in any number of steps. In other words, let
  Q^n_{x,y\to z}= \mathbb{P}(X_n=z|X_2=y,X_1=x).
Then, $X$ is irreducible, if and only if for every $(x,y)$ and every $z$ there exists an $n=n(x,y,z)\ge 1$ such that $Q^n_{x,y\to z}>0$. An even stronger condition is regularity: A second-order Markov chain $X$ is regular if and only if this integer $n$ does neither depend on $(x,y)$ nor on $z$. In this case, we write that $Q^n>0$.

We are now interested in the invariant distribution of $X$. In other words, we look for a probability distribution $\pi_{x,y}$ on $\mathcal{X}^2$ such that
  \pi_{y,z} = \sum_{x\in\mathcal{X}} \pi_{x,y}Q_{x,y\to z}.

More precisely, we are interested in the question whether there exists a unique invariant distribution. It is known, for example, that if $X$ is a first-order Markov chain, a unique invariant distribution exists if $X$ is irreducible. A fortiori, for a first-order Markov chain, a unique invariant distribution exists if $X$ is regular. Moreover, for a first-order Markov chain, a unique invariant distribution exists if $X$ is a so-called ergodic unichain, i.e., a chain with a single communicating class.

It is easy to show that if $X$ is a second-order Markov chain, that then the process $X^{(2)}$ with samples $(X_1,X_2)$, $(X_2,X_3)$, $(X_3,X_4)$, etc. is a first-order Markov chain. The hope is that this fact allows us to compute the invariant distribution (or prove its uniqueness) with the help of the simple first-order case (see, e.g., here). Unfortunately, in the setting we have here, this is not possible. It turns out that even if $X$ is a regular second-order Markov chain (and thus irreducible), the chain $X^{(2)}$ need not be irreducible. To this end, consider the following example:

Let $X$ be a second-order Markov chain on $\{1,2,3,4\}$ with transition matrix $Q$, where
$$Q=\begin{bmatrix} 0.5 & 0.5 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 1 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0.5 & 0.5 \\
0 & 0.5 & 0.5 & 0 \\
0.5 & 0 & 0 & 0.5 \\
1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
1 & 0 & 0 & 0 \\
0 & 0.5 & 0.5 & 0 \\
1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 0 & 0.5 & 0.5\end{bmatrix}$$
In this matrix, the columns are labeled with states $\{1,2,3,4\}$, while the rows are labeled with the state tuples, i.e., $\{11,12,13,14,21,\dots,44\}$.

This Markov chain is regular, since $Q^{10}>0$, $Q^{11}>0$,... It turns out, however, that $X^{(2)}$ is not irreducible: $X$ is such that, depending on the initial states, we either have $1-2-3-4-1$ and $1-2-3-1$ or $1-4-3-2-1$ and $1-3-2-1$. It follows that $X^{(2)}$ has transient states $\{(1,1),(2,2),(2,4),(3,3),(4,2),(4,4)\}$ and communicating classes:
It follows that there is no unique distribution $\pi_{x,y}$.

To find out (a little bit) more, I recommend Chapter 7 of this book. Unfortunately, this is as much reference as I have found, the original works being only available in Romanian and on paper. I would be extremely grateful for any reference linked here!