tSNEJS demo

Intro

t-SNE is a visualization algorithm that embeds things in 2 or 3 dimensions according to some desired distances. If you have some data and you can measure their pairwise differences, t-SNE visualization can help you identify various clusters.

Example

In the example below, we identified 500 most followed accounts on Twitter, downloaded 200 of their tweets and computed distances similarities between them based on the types of things they talk about (bigram tfidf dot products). These similarities are then fed to t-SNE and visualized in 2 dimensions. People who use similar words and phrases will be nearby in the visualization. I documented the creation of this demo in a blog post.

Go ahead, pan and zoom around with mouse!
(note if nothing is showing up: it can take a while to load all images...)

Second Example

In a second example, we take word embeddings that are trained using an unsupervised learning algorithm and visualize them using t-SNE. Head over here to see it!

Javascript Library

The tSNEJS library implements t-SNE algorithm and can be downloaded from Github. The API looks as follows:

var opt = {epsilon: 10}; // epsilon is learning rate (10 = default)
var tsne = new tsnejs.tSNE(opt); // create a tSNE instance

// initialize data. Here we have 3 points and some example pairwise dissimilarities
var dists = [[1.0, 0.1, 0.2], [0.1, 1.0, 0.3], [0.2, 0.1, 1.0]];
tsne.initDataDist(dists);

for(var k = 0; k < 500; k++) {
  tsne.step(); // every time you call this, solution gets better
}

var Y = tsne.getSolution(); // Y is an array of 2-D points that you can plot

Embed your CSV data with Web Demo

Using the web demo here you can run t-SNE on your data, in CSV format.

Algorithm Details

The algorithm is described in this paper:

L.J.P. van der Maaten and G.E. Hinton.
Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008.
[ PDF ] [ Supplemental Material (24MB) ]

In short, the algorithm computes a matrix P that is related to distances between all elements in the original space. The variables of the problem are the embedding point locations, which similarily rise to their own distance matrix Q. The algorithm's cost function then minimizes the difference between P and Q.

About

Maintained by @karpathy.