Let's look at some of the technical challenges involved in astronomical data and calculations and how we tackle those challenges. We're going to want to look at how much data we want to store, how long it takes to search through it, and how long it takes to do calculations. And in each of those cases, we'll see there's a brute force method, and a smart method. And in fact we usually want to do both. And then right at the end we'll wrap up with looking at how we try to make data access over the internet as easy as possible. [BLANK_AUDIO] So let's talk about how much data we need. First of all, let's think about a single CCD image. So maybe one CCD image is about 4,000 pixels across, and each one of those pixels, there's a number stored with 16 bits of information which is then 2 bytes of information. And if you do the sums that adds up to 32 megabytes. So that's what a single CCD image might require for storage. Not too much by modern standards. However as you've seen we tend to use large mosaic cameras with many CCD's put together in a big array. Those gigapixel cameras could be much bigger. You could be talking here about a a mosaic camera... ... a single image might be 1 to a few gigabytes. So that's a big image. Now what about if we want to survey the whole sky? Well you could ask how many pixels are there over the whole sky? If we try to pave the sky with CCD images, what do we need? Well it depends how fine the pixels are, but let's suppose the pixels are about one third - 0.3 - of an arc second, so that makes the pixels small enough to get reasonably decent images. That makes about 5 trillion pixels over the whole sky. Then again, if you assume we have 16 bit numbers. Oh, and by the way, that 16 bit number is enough to store numbers up to about 65,000. And, that's typically what CCDs do. So covering the whole sky, 5 trillion pixels. That makes about 10 terabytes. Now actually of course we want to do the same thing at several different wavelengths, different colors and also we want to repeat the sky. So in practice a typical modern sky survey, the whole sky will be something like petabyte scales. Now a reminder about all of these petas and gigas and so on... so a factor of 10 to the power of 3,000 in scientific notation, that's kilo, as in kilobyte, etcetera. 10 to the 6, that's big M for mega. 10 to the power 9, a billion, is a big G for giga. 10 to the power 12, big T for tera. And then we'd get up to 10 to the power 15, a thousand trillion. That's a big P for peta. So, a petabyte is a thousand trillion bytes and remember each byte is 8 bits in computer speak. So that's how big a sky survey needs to be. Now today if you've got a reasonably good laptop, you know, you could get yourself a 1 terabyte disk, it's not that unusual to do so. So maybe to store, for yourself, a sky survey, you need about 1000 laptops. So it's doable, but really a bit daft. If every astronomer has to have 1,000 laptops to store their own copies of all their favourite databases, that's just silly. So here's the smart solution. What it drives us into is what's known as a service economy. Around the world, there's a handful of big data centers, one of which is here in Edinburgh, and there're a number of others, where we store on big computers with lots of disks the major sky surveys. Astronomers around the world through the internet go and get only the data they want, or do calculations or searches on our servers here. So, that's the smart way to do it. [BLANK_AUDIO] It's not just the pixels, the images, that astronomers want, but it's catalogues of objects. If you imagine here's an image of the sky, and there are lots of stars and galaxies on it. We have software that goes over here, spots each of these objects. Each of those becomes a line in a table and that makes a catalogue of objects. And for each one of these objects then is a row, and there are lots of columns, and, each one of these columns is a different piece of information about this object here. So, that, then, is our catalogue. So, how big are these catalogues? Well, it's not nearly as big as the pixel data. So for instance - it's a lot of objects, out of a sky survey, we might have a billion objects, or a few billion - and maybe there might be 50 of these columns. And if each one of those is a couple of bytes, then we end up with something like 100 gigabytes for a big sky survey. So that's no problem to store, however searching through it is something else. And that's what we'll look at next. [BLANK_AUDIO] So how do we search through a table of a billion objects and find just the one we want, that redshift seven quasar or the killer rock or whatever? So imagine here we've got lots of rows in our table and this is sitting on our hard drive on our computer, and then let's imagine over here is our CPU in the computer, the bit that does the calculation, calculating. Now in order to, do our search, essentially what we have to do, is take one row of data, bring it into the CPU, do a calculation and decide whether we want that one or not. And then we take the next row and do it again, and the next row and do it again and so on. Now, if imagine all this data streaming from the hard drive to the CPU. A good PC will run at gigahertz rates, So in principle, you can stream a billion rows of data like this through in a split second. It's not a problem. However, it doesn't really work like that. Any search process like this, any transfer from a hard drive to the CPU - because you do it in lots of chunks, each one of those has some kind of overhead, and that overhead may be only a few milliseconds, but then when you multiply a billion times a few milliseconds, you're into a much longer time. It could take days to search through your big database. Now modern solid state disks as opposed to spinning hard drives have much smaller overheads - they're faster, but still they are very expensive. Data centers, at least scientific ones, are not using those because they're more expensive. So, we need a smarter way to do the searching. So the key point is that you don't actually necessarily have to search through everything every time. The first thing - and this is just the same as say Google or Amazon do - you save the most popular searches so they can be brought back quickly the next time somebody asks pretty much the same thing. The next thing is that the various columns here, you figure out which of those are the most common and put your database in the right order so that you can search through them quickly. And then you can pick examples of these columns to make an index on and search through those particularly quickly. [BLANK_AUDIO] So let's talk about astronomical calculations, and how long they take. And I'll take as an example, so called N-body calculations that cosmological theorists use. And the idea here is that we take lots of fake matter particles and if you take any two of those particles, we can calculate the gravitational force between them and that tells us how they're going to move overtime. But we need to take every possible pair of particles and calculate the forces between all those particles, to understand how the whole ensemble of particles is going to evolve with time. Now, a big calculation might have a million fake matter particles and a really big one might have a 100 million fake matter particles. That step from a million to a 100 million makes a big difference as we'll see. But first of all, let's make the basic point. Let's imagine we've got one of these big simulations with 100 million particles. So, that's 10 to the power 8 particles. Now, let's just assume, to begin with, we're doing a single calculation on each particle. [BLANK_AUDIO] Now, on a fast computer, that's going to take less than a second. It's not a problem. Let's just say, that takes, 1 second. However, as I just described, we need to do not one calculation per particle, but one calculation for every pair of particles. So we need to do 10 to the power 8 times 10 to the power 8 calculations. Okay, every particle in that 10 to the 8 has to do all the others. So then, that's an enormous amount of time. That's going to take years. And that essentially would make one frame, one time step in that simulation movie that we saw earlier. And you need lots of those to see how the universe evolves. So this is just very difficult. So the brute-force solution is to say okay, if it, this is what one PC can do, we need a super computer. So a supercomputer is really just like thousands of computers chained together working in parallel. So a big supercomputer might have several thousand nodes, and every one of those nodes might have 10 or 20 cores in. So there could many thousands effectively of CPUs working in parallel. But even then there're two snags. It still wouldn't be fast enough with this sort of calculation, to get things done really quickly - and also the other snag is that those machines cost millions of pounds. We'd like to do something a bit smarter. So the smart solution is to do with being approximate. So with what I've described here, it's what you have to do if you're going to do this calculation exactly. Every particle and its effect on every other particle. But, there's shortcuts you can do which have kind of to do with the fact that particles further away, as individual pairs, are less important. (They add up to a lot.) Now, we haven't got time to explain exactly how this works. For the mathematically minded, instead of a problem that is n times n, we can get a speed change so that this is n times the logarithm of n. Now this makes a very serious difference to the speed of our calculation. So for instance let's start... if we imagine we have our 10 to the 6 particles and let's say... that if we're doing N calculations ... let's say that that takes us 1 second. Okay? If we then do the the N times N, On that it's going to take 12 days. It's a million seconds. If we do the N log N version, this is the approximate but smart calculation, that takes 14 seconds. It's a huge difference. Then if we move up to 10 to the 8, right then the the simple N calculations would then be 3 minutes. If we do, the naive N times N calculation, that comes out to be about 317 years. And that's obviously impossible, and infeasible even with a super computer. If we do the n times the logarithm of n method, that comes out to about 30 minutes. Now notice that's still quite a long time - to calculate one timestep in this simulation, but at least it's something plausible that that we can do. So big calculations are very difficult. [BLANK_AUDIO] So we've been talking about technical challenges, how much data there is to store, how hard that is, how long it takes to search, how long it takes to do calculations, and all those problems are about machine time, how long it takes a computer or a super computer to perform these tasks. But the real world problem is not just about machine time, it's about human time as well. So for example what we don't want is that every time we go and get some data we have to spend a whole afternoon working out how this particular website works, what we have to do, and when we get the data we have to write a special piece of software to deal with that data and plot it on top of something else. All of that sort of then just uses up astronomer time even if the machines are very fast. Now, the modern internet that we're used to - when we're looking for information, doing our shopping, etc, is very point and click. It's very easy. It took a lot of effort to get it that way, but it's automated. And we want astronomy to work the same way, grabbing data, mixing and matching it. That ideal is known as the virtual observatory. And it's the secret - to either the virtual observatory, or to the internet as a whole - is the same thing. It's standardization. What we need essentially is for everything to have the same screw threads so that, so the bits fit together. The different web pages, the data sets, and so on. So forthe internet that's about things like TCP/IP, the basic internet protocols, or HTTP the way, how you define how you speak to a website or HTML, how you write the content of a website. So all those things are standards which were agreed internationally, and that's what makes it all magic and easy. In the same way in astronomy we want to standardize the format of data, the data access protocols and and so on. And we're in the middle of that process as we speak.