Translational Immunology - Cross-entropy-test

What is a t-SNE?

t-Distributed Stochastic Neighbor Embedding (t-SNE) is the most commonly used non-linear dimensionality reduction algorithm for single cell biology. In its common usage for visualising high-dimensionality single cell data, the algorithm starts with the single cells distributed at random points, along a Gaussian distribution, in transformed space. In an iterative process the cells move along a cost gradient, which provides a penalty for mismatch between the distances between two cells in the original high-dimensional space versus the representational low-dimensional space. In its common usage for visualising high-dimensionality single cell data, the cost gradient of t-SNE places greater weight on pairs of cells close to each other, with medium- and long-range pairs ignored. When sufficient iterations have occurred to reach stability, the outcome produces clusters of similar cells, based on the input data. Membership of a cluster indicates shared properties, however the non-linear nature of the penalty cost does not allow relationships to be inferred by the relative positioning of the clusters.

Running the same dataset through a t-SNE multiple times results in a visually distinct stable states, owing to the random placement of the input data at the first stage. The high cost of violation of local distances ensures that local clusters are maintained across runs, while the low cost of medium- and long-range pairs permits multiple stable states with rotational symmetry to develop. An under-appreciated aspect of the t-SNE algorithm is the early exaggeration of the penalty for violating local distances for the first 50 iterations. Visualising t-SNE runs at each iteration demonstrates that the early exaggeration phase involves a sharp contraction of all points, which then expand out into separate clusters when the exaggeration factor is removed. This early exaggeration is integral to the t-SNE calculation, as maintaining the high penalty throughout results in dense overlapping clusters, while maintaining the low penalty throughout permits splitting of clusters, as cells close together in high-dimensional space do not come in close enough proximity in representational space to drive clustering. It is important to allow both phases and sufficient iterations for the t-SNE to reach stability for consistency in results.

Iteration-by-iteration visualisation of the t-SNE for two unique seeds of the same dataset. Note the initial collapse of the sample into very small distances, and the rotational symmetry observed between the two runs as the samples slowly expand with extra iterations.

Iteration-by-iteration visualisation of the t-SNE for two unique seeds of the same dataset, with no initial exaggeration of the penalty. As the initial collapse of the sample is reduced, similar cells can avoid coming into close enough contact to drive cluster formation. As a result, biological clusters are split in the final representation and repeat runs vary greatly.

Not just a visualisation tool

The low-dimensional representation of high-dimensional data makes t-SNE an attractive visualisation tool, yet it also has value as an analytical tool. We have developed the Cross Entropy test, a statistical test capable of distinguishing biological differences in single cell t-SNE representations, while being robust against false detection of differences in technical replicates or the seed-dependent variation in t-SNE generation. As the t-SNE algorithm is driven by the cross entropy of the individual cells in the dataset, and the t-SNE fixes the average point entropy, each t-SNE can be considered a distribution of cross entropy divergences. Deriving a distribution of cross entropy divergences per t-SNE plot therefore allows the use of the Kolmogorov-Smirnov test to evaluate the degree of difference between two, or more, t-SNE plots.

The Cross Entropy test is a useful tool for calculating p values on the difference between any two t-SNE or UMAP plots, whether the data comes from flow cytometry, mass cytometry or single cell sequencing. Further, the test generates a quantitative comparison of the extent of differences, allowing you to compare multiple t-SNE or UMAP plots and identify outgroups and clustered samples. For a full explanation of the test, see our paper on arXiv. If you just want to run the test yourself, we have script for both cytometry and single cell sequencing on GitHub. Oh, and if you are a biologist who is worried about R, we’ve got you covered – here is a gentle walk-through on how to use the test.

The Liston-Dooley Laboratory