Today we're going to talk about estimating the camera pose from the image. The question we'll ask today is, where am I? This is a easy question to answer if you have a cell phone. You can just take it out and look at a GPS location of yourself. Your cell phone, in fact, also comes with gyro and gravity sensor, so they tell you the orientation of your cell phone. So therefore you know where you are. But what if I'm in a situation where I don't have a GPS signal? How do I figure out where I am by just looking at the picture? And today we're going to talk about how to estimate the pose from images by itself. Recall, we already seen example of estimating camera pose by looking at the vanishing points. Again, we're back to the street corner where we see two streets, a building, and the corner. And if it can identify the streets in our image, we know the road has two pair of lines, and those two pair of lines will converge at a point infinity which we can see the image. And that's point at infinity. Once we localize that point in infinity, typically by intersecting rays in the image to a single point, we know that point in the image pixel space together with my optimal ray, forms a ray in a 3D space. And that ray from the optical center to the vanishing point in image moving into the space, in fact is parallel to the street that we are looking at in the physical space. Therefore, if I know the optical ray, we know exactly the orientation of the street orientation in the z direction. And that allows us to compute orientation angle for two of the axes, we know pan and tilt angle, in this case, of the camera. Just to be sure that we know how to do this, recall that we always need to take the pixel coordinate, convert it back to the optical world, through the camera calibration which is K. And that's K inverse times the vanishing point at image equal to the optical ray and the optical ray in fact is the long direction of the z-axis, or the street axis in the 3D world. We normalize that vector such that norm one, and form that that we can estimate two of the angles of the rotation. So how do we get three rotation angles? Because that requires us looking at two vanishing points of two perpendicular rays in the 3D space. So we must find two rays in the physical space which are perpendicular to each other. And those two perpendicular rays must form images themself in the image through two vanishing points. So the same procedure will identify two vanishing points in the image, compute its pixel coordinates, convert its pixel coordinate to the optical array through the K inverse, and then we normalize them, forming two axes of rotation. By taking the cross product between the two axes of rotation, we obtain the full three-dimensional rotation matrix. Hence we know the orientation of the camera. Recall we also have another way to compute the camera pose. That is if you have two friends taking pictures of the same objects in the 3D scene and are willing to share pictures between them. So the two pictures themselves we can compute epipolar geometry between them by clicking on eight corresponding points. And that epipolar geometry allows us to identify the translation and the rotation between the two views. So estimated camera rotation translation is an important concept. This is important for us to move in a 3D space. If it were a robot needing navigate ourself relative to this building, we would need to know how we were rotating and translating relative from one position to another. If you were working on augmented reality to insert an object in a virtual scene, we need to learn how to estimate a camera position relative to the world so we can put object correctly into 3D scene. And this is also important for virtual reality, where we have headmount displays and we need to know exactly the orientation of my head relative to the space, as I move in the space. In the following sequence, we will reexamine how to do this through a known physical object in a scene, through a three-dimensional object in the scene. So the first case we have seen already is if you have a planar object that's a situated scene. And what we need to know is some property of this plane, or some property of points on this plane. So if I have a checkerboard, we know exactly the size of each checkerboard square. If I don't have the checkerboard, all I need to identify is a plane of space, and to know the physical position or size dimensions of objects on that plane. And this is often achievable because we have many man-made object in the world, indoors as well as outdoors, and many of these objects are situated on the planar surface. For example, we can identify the size of my laptop in the picture or our TV screens or window size. Once we have a known three-dimensional objects, the planar surface in the world, we take a picture of that object, that allows us to see the position of self orientating in 3D space relative to the object. And this is happening through a set of homographic projections. Recall we can look at this planar object from a straight bird's-eye view. If I look at a bird's-eye view of the planar objects, it looks for in this case, a checkerboard, which is shown on the right-hand side. And the image itself is shown on the left. The checkerboard image is going to be skewed, rotated, or deformed according to the camera orientation. And again, we assume we have some known points on the surface, for example, this point x1. We have measured the position of x1 relative to a center of this world plane surface. And we can identify the corresponding points in the image as well, as illustrated on the left. So on the right we have a planar surface in 3D. We mark one of the locations as 0,0 at the origin, and we measure our point relative to that origin in the two-dimensional space. Similarly, we do that in the image pixel space as well. Again, we form a homogenous cornice of the two-dimension points, attaching a 1 to the third dimension. With this, we have shown that those two homogeneous corners are related through a three by three transformation matrix, called H, stand for homography mapping. And this H is again, three-dimensional by three-dimensional, has in total nine elements. Inside the H matrix, in fact, encodes both the camera calibration matrix k as well as the rotation matrix r and the translation t between the plane and the camera center. But it's all shrunken into this compact form of H matrix of nine elements. So if we have a way of computing H, identify a few corresponding points between a three-dimensional world planar surface to a two-dimensional image of that point, we can estimate H. And once we have H estimated, we can go back to estimate rotation translation. So the key question is, how do we identify this H computationally? Given a point corresponded between one point in the 3D space and one point in image space, we have this homographic mapping matrix x2=Hx1. x1 and x2 are known, each three-dimensional, and H, color-coded in orange, is unknown which we wanted to estimate. The first step we need to do is obtain an equation in the form of 0 on the right-hand side and linear mapping of H on the left-hand side. And this can be done by simply taking the cross product of the vector x2 with itself. So any vector crossed with itself equal to 0. So on the left-hand side we have a 0 and on the right-hand side we have x2 to vector crossed with Hx1 = 0. So we would continue to do this for many different corresponding points. Each time we do so we obtain a set of equations where we have the unknown H on the left-hand side and the known 0 on the right-hand side. Expand out this equation a little bit further out, we denote the u2, v2 as the pixel coordinates of x2 in the image. And we denote H by breaking it down such that h1 is the first row, h2 is second row, h3 is the third row. We multiply the vector x1 together with H, further expanding the equation as is shown here. Just visually to be clear, we see h1, h2, h3 are color coded in orange, the row vectors, one by three each. And x1 is replicated in each of the columns. So this between the orange vector and the gray vectors produces a scalar. As such we can take transpose of the scalar without changing the value. And it's equivalent to taking the vertical column vector x1, convert it to horizontal form, transpose it, and similarly take h2, horizontal vector, convert it to the vertical vector, h1 transpose. So nothing had changed, they still produce three scalar vectors for me. And mathematically it's written out more clearly in this form. The first row is x1 transpose times h1 transpose. Next we take the vector u2 v2 1 and expand it out as its cross-product form. As we know, the vector cross-product can be written down as a skew-symmetrical matrix written on the left, color coded in green. We further take the dot product between x1 transpose and h1 transpose and further expand it out. As a matrix vector multiplication with a matrix is consist of three rows of x1 transpose, lay on a diagonal and the rest are all 0s, shown in the middle. And again, h is going to put out as column vectors, where the first three elements is simply first row of h1. The next three elements is second rows of h2. And the last three elements is the last row of h3. So together we can see that we obtain a matrix of the following form. What's in the pink consists of the product between x1 x2, the position of image coordinates in the image, and position on the plane surface x1. That matrix has a size of 3 rows by 9, because each vector x1 transpose is a three-dimensional vector. You have 3 of those, so a total of 9 columns. We have this purple matrix, 3x9 matrix, times this orange column h1 consists all the known values of h equal to 0. So now we have obtained a familiar matrix that we had known in the least squares setting where we have Ax=0. So given a single point, we have obtained one set of equations, Ax=0 where A's a 3x9 matrix. And we know that rank of this 3x9 matrix in fact is rank 2 because we only have two quantities of x,y we have obtained. And each matrix has nine elements. But we can scale it any way we want it. So there's one degree of freedom we can remove. So total 9-1 elements equal to 8 elements we need to estimate. So therefore, if I have four corresponding points from the plane in the 3D space to the image, we can obtain enough constraints on this linear system to identify the elements of H. Of course, we'll have more points where we'll have more constraints, and this will lead to a set of least square problems. Once we have the least square solution 8x = 0 set up, we can obtain the solution for H by solving the smallest eigenvectors of a. We take the svd's of a, convert into ud v transpose. And we take the last column of v, the ninth column of the v, and that is the reshaped H matrix. We take the H matrix vectors, convert it into a matrix form, a 3x3 matrix, and that will allow us to obtain several rotations and translations. We first take H matrix, convert it into the optical space, again, through the H inverse transformation. Given that the first two columns of the transform H matrix, in fact, its rotation matrix one column, rotation matrix two columns followed by transformation vector t. And we can renormalize it to obtain a proper rotation matrix to ensure the first column of r=1, the norm of that equal to 1. And this is shown in set of equation here.