0:04

Hi, my name is Maggie Ngyuen,

I'm a PhD student at the Department of Statistical Science at Duke University.

>> Hi, my name is Sunith Suresh, I'm a current Masters student at the Department

of Statistics in Duke University.

>> And here we have Steve Scott at Google.

He's going to share his expert insight about Bayesian statistics in the industry.

So thank you so much Steve, for your help, joining this interview and

the first question I have for you is, so why does Google use Bayesian statistics?

>> Hi guys, it’s great to be with you.

So Google uses Bayesian statistics in lots of different ways, and I guess it depends

on which part of the company and which analyst you’re dealing with.

One of the big uses that we have is for people that want to fit fancy models.

Bayesian model averaging and Bayesian variable selection gets a lot of use in

terms of averaging linear models and logistic regression type of models.

We have a product called Google consumer surveys that sometimes we need to

take expand the group of people that is applicable to.

And based in verbal selection turns up be very, very helpful in that context.

Another reason people use it is that they need correct and

coherent representations of uncertainty.

So gaining the error barge right under your forecast that sort of thing.

There's a product that we have that takes the Bayesian model averaging tool,

and pairs it up with time series tool and uses that for

computing the counterfactual and casual inference problem.

So you can tell after someone runs an advertising campaign,

what did they get out of it?

And a real major use case that combines both of those features is most of you

are on bandits for website optimization.

So Google Analytics has a product that somebody could use to improve their

website if they want to figure out whether the button should be red or

blue or they should use this picture or that picture.

Then we can use this multi-arm bandwidth tool to make the AB testing that you would

go through for that better, faster, and cheaper.

And Bayesian modeling is the engine behind the multi-arm bandit engine.

>> That's quite interesting,

could you tell us a little bit more about Bayesian statistics at Google scale,

since it's such a big company and you have a tremendous amount of data?

>> It's a very big company.

It's got a tremendous amount of data, and we have a reputation of not solving

a problem unless it's terabytes and terabytes.

But that's not actually true, in addition to all of the search and

ad system type problems that Google's famous for, we're also a company and

we need to know how we're doing.

So there's lots and

lots of analyses that get done here by analysts that are working and are.

Like on a laptop or a desktop, that is not to say that the people who use the really,

really thick models wouldnt use Bayesian statistics that are available.

I think right now being able to do the types of things we normally associate with

Bayesian stats interms of money car loan Bayes inference.

That's a research topic that we're making progress on, but

I don't think we're quite there yet.

So as I mentioned a lot of these analyses get done kind of at laptop scale or

desktop scale in r.

Some of those do get automated and they can be pushed out where every user,

every advertiser, every publisher gets that analysis done for

them in an automated fashion.

So that's a way that a Bayesian inference might get scaled up.

So you might have a bunch of independent analysis running parallel and

we're really good at that sort of thing.

That's a sort of task that the analyst would originally do the analysis and

sort of figure out how it should work.

And then an engineer would get involved to help replicate it or for everybody.

>> And kind of doting up on that question, so could you tell us in your

opinion what skills would make a successful data scientist at Google?

>> Sure, I think the main thing you want to start off with

is a solid education in statistics.

So there's a question about what is data science and

I know that data science isn't all statistics.

But to be a data scientist particularly in a place like Google you

want to start off being a really good statistician.

And what that means is that you want to have deep knowledge in a subject matter

area that is going to be relevant for Google.

And that might be machine learning, it might be Bayesian statistics,

it might be experimental design, it might be whatever.

But you want to be really, really good in one area of statistics.

But that is not enough, because you also need to be broad.

You need to learn a lot about the field.

You need to have a broad knowledge, because you're not necessarily going to be

able to choose problems that cross your desk.

Like you would if you are studying in an academic job and

you're taking control of your, your research agendas.

So you may have many dissolved problem that's not directly in your deep area of

expertise.

But the problem still needs to be solved and so

breadth is certainly an important part of the job as well.

So that's statistics knowledge, but there's parts

of being a data scientist that are not statistics and those are important too.

One of them, and it sounds weird to say this to a statistician but data scientist.

So there is a whole lot about being a data scientist that is

not about the modelling aspect of the problem.

It is about sort of understanding what data might be available to answer

a particular question.

And how to find those data, understanding how to get the data out of a different

system and out of the format that they're in and then the format that you need them.

Thinking and being willing to go through the steps about you know what the huge

data in the database, somebody put those fields in there and gave them names.

And you need to make sure that the numbers that are in the database are actually what

the name suggests that they would be, and that they're reliable.

And there's a whole bunch of unsexy stuff like that that turns out to be really,

really important and it's not super fun to sit in a class and learn it.

It's not super fun to sit in a class and teach it, but

in the world out there it's pretty important.

And that segs into a third thing which is in order to be

good at the data management side of things.

You need to be comfortable working in a computer language that's other than R.

R is a fantastic language for the modeling side of things, the visualization side.

But for the data management side,

you want to be willing to kind of learn a new language as you need it.

So Pythons are really great language,

if you're going to talk about the stuff that's out there in the world.

But if you work at Google then there's a language that we have for

data manipulation.

If you work at Facebook or Amazon, you can sort of expect that all these places were

have written their own language to solve this problem and

you need to not be scared of that.

It will turn out to be comfortable enough with computing to make that just not be

a problem.