0:45

And so we've looked at this in terms of visualizing graphs.

Graphs encode relationships between data points where the length of the edge is

irrelevant.

But if we want to look at a graph where the length is relevant,

we might want to look at ways of plotting the nodes of a graph or data,

based on the relationship between data points, based on distance.

So that we can try to preserve the distance in, for example,

a low dimensional, a two dimensional visualization of high dimensional data.

We may want the distance between data points in two dimensions to somehow

indicate the distance between the data points in higher dimensions.

And so that sets up a metric MDS,

metric multidimensional scaling where you try to preserve the distance between

data points even though you're displaying them at a lower dimensional space.

So here's a nice example.

We're not doing dimensionality reduction because we'll go from a two dimensional

space to a two dimensional space.

But it illustrates how we would compute a multidimensional scaling to try to

preserve the distance between points.

So here's distances between cities in the US, for example.

So we've got Boston, NYC, DC, Miami, Chicago, Seattle, SF, LA, and Denver.

And so the distance between, for example, Chicago and

NYC would be 803 miles in this example.

So the lower triangle of this data is showing the distance between

the city on the left and the city in the row, and the city in the column.

And so there'd be 0 distance between a city and itself.

And this will be symmetric, so

I haven't bothered to fill in the upper triangle of this matrix.

2:36

So given those distances, dij, between data points i and j,

between city i and city j, can we recover the positions of those cities?

So I'm not taking the coordinates of those cities,

I'm just taking the distances between those cities.

Basically taking a graph where I'm specifying the length of the edges between

the nodes.

And then I want to find the positions of the nodes that satisfies

those edge lengths.

So we can do that by minimizing this function.

We basically take the distance between the two data points,

and then we want to subtract their actual distance, and we want to minimize that.

So if the distance between two data points equals the desired distance

between those two data points, then this should be close to zero.

And we square things here so that we keep everything positive so

that negative values aren't messing up our minimization.

So we can solve this with any number of non-linear optimization methods.

In the example I'm going to describe, I solved it just using

the Excel spreadsheet program with an optimizer plug-in.

So here's the results that I got,

I minimized that distance relationship between the actual

distance from my optimized variables and the desired distance.

And as you can see, things are obviously in the wrong place.

I've got the East Coast on the left and West Coast on the east, but

that should still preserve distance.

And there's a few other differences,

but basically I've got this general east to west trend.

Seattle is above SF and LA, Denver is here, Chicago is where it's at.

So I've preserved the distance of my

points even though I haven't preserved the exact geometry of the points.

And so this can be a useful layout method, especially for high dimensional data when

you're trying to preserve the distances between the points without necessarily

representing accurately their spatial configuration, their original coordinates.

So we've thrown away the original coordinates of these points.

And just using their distance relationship,

tried to preserve that in a reprojection, in this case in two dimensions.

So there's all sorts of ways you can use MDS, multidimensional scaling.

You can visualize affinities between data points,

areas of collaboration based on co-authorship of papers.

If you're visualizing papers or the number of any other attributes two data

points might have in common, anything you can describe as a distance.

It doesn't have to be coordinate distance, it can be other similarities.

It's used in human-computer interaction for

user interface design to basically lay out buttons.

If you have a dashboard with a lot of buttons, you can lay out the buttons by

organizing them based on how similar the buttons are to a given task.

And then,

let multidimensional scaling compute the coordinates of where the buttons should be

located so that buttons that are used together are close to each other.

And it's also used in marketing to create what are called perceptual maps

that are based on survey data of what people think of different products,

what attributes people assign to different products.

And then you can figure out what products are similar to other products and

create a map of products that way.

So, multidimensional scaling is enabled by optimization.

You don't have to write your own optimization program,

you can use pre-existing optimization packages.

You just need something that's going to minimize a function, and so

you need some form of non-linear minimization in an optimization package.

I found one that was enabled in Excel in order to do

the example I described in the slides.

[MUSIC]