[MUSIC] Welcome back. So this time I want to talk about what this term scalable means. We made the point that working with really large data is important aspect of data science and we mentioned the word scalability before we haven’t really talked about really what that might mean. So a couple of different ways to think about it I want to talk about here on this slide. So operationally and in the past. One way to think about this was, look it needs to work on date that doesn’t fit in main memory on a single machine. So maybe you still only have one machine to work with, but you need to be able to bring date off disc in pieces, operate on it and then maybe write it out in pieces. And so a bundle of algorithms that could work on data in this fashion by bringing in data piece by piece to memory such that the memory footprint at any given point was small. This is something the database has provided. So, you could write a query and you knew for sure that it was gonna finish, as long as the data was there on disk and you had sort of a minimal amount of memory, at least to get started. Okay. And I might use the term out of core processing here. So out of core means it uses the disk to operate. So in core means everything you're doing fits entirely in main memory. Out of core means you need to sort of work with the disk appropriately and so databases were, the database community were specialists at out of core processing of large data sets. But increasingly, this notion of scalability wasn't really enough. And so you saw this pretty acutely with websites that were coming online in the 2000s where one big server, no matter how big you bought that server, you couldn't bring data off of disk fast enough to meet all the requests. And so you had to start being sure that things were in memory. And the only way to do that is start adding more machines. Okay. And so increasingly, Google is especially known for this, although many of the large media companies do this. Is that scalable really means being able to use up to thousands or maybe even more, tens of thousands of cheap computers and apply them all to the same problem. And so we might call this. Scale out. While getting bigger and bigger and bigger main memory and more and more cores perhaps would be scale up. Okay. Fine, so another way of looking at this that maybe is a little bit more precise is to think about this kind of in terms of algorithmic complexity, that you may or may not be familiar with, depending on how much computer science you've taken. But let me give you just a flavor of what's going on here. So in the past, you might call an algorithm scalable if, for, if given in date items, your algorithm does no more than N to the m operations on it. N may be one, in which case it's a linear time algorithm or maybe two, which means it's a quadratic time algorithm and so on. But this was deemed, you know, tractable. All right, and so you'd prove properties about. You would prove that you could find a polynomial time algorithm to solve some problem and it was sort of thought to be scalable. Or, assumed, well assumed, that was the definition of what scalable was. Things that were non-polynomial that took more than this, for example, exponential, where you might have N to the m. Were exponential algorithms and they grew much, much faster and they were, they didn't scale, okay. But, you know, this isn't very tight bound on scalability and practice. A quadratic time algorithm may be sort of feasible. You start getting into the fourth and so forth it becomes pretty difficult to do for very large data sets. Okay? So, now you would say that it really can't just be into the M. It's gotta be into the N over K for some pretty large N. So, you have to a lot of K being the number of computers you can apply to the problem. And so you have to come up with an algorithm that can really exploit this properly. And then one more point that I'm going to make now, but we're not going to return to in this segment, but I hope to at the end of the course. Is that it could be that soon, even this isn't good enough, and for N data items, you really should do no more than N*log(N) operations. And so the N here means for every data item that comes in over the wire. So this is applicable to streaming applications when the data is coming in so fast that you only get one pass at it. So for every operation I have I'm allowed to process that data item and then I'm allowed to put it in some sort of an index and that's this log(N) factor, okay? And, so whenever you see log, you should think trees. So I'm allowed to take each item, inspect it and work with it and then stick it into some tree data structure. But that might be about it. It's just too big to make multiple passes at. So examples of this might be this large synoptic survey telescope that we heard about. Taking 30 terabytes a night. You can't make too many passes on this data at one time. So this whole area we think of as streaming data which I guess I have written here but I'll write it again. And we'll come back to some of the techniques we're dealing with, big data in a streaming context. [MUSIC]