In this video, I will describe a seminal Viola-Jones face detection algorithm. I believe it is useful to understand its key ideas even in our deep learning era. In object detection with sliding windows, the number of positive windows is several magnitudes lower than the number of background windows. For example, if we apply a face detector to one megapixel image, we have to classify approximately one million windows. In order to expect less than one false positive in the image, the false positive rate should be lower than 10 in power of minus six. In order for the detector to be fast, we should quickly reject windows without faces. Viola-Jones detector combines four key ideas, the simple Haar features, the use of integral images for fast feature computation, boosting for feature selection and the main thing, the attentional cascade for fast rejection of windows without faces. Haar features form a very large set of simple function. Each feature specify a set of rectangles in image window. We divide these rectangles into two groups, white and black. Feature value is calculated as difference between sum of pixels in white rectangles and sum of pixels in black rectangles. We define weak classifiers by thresholding Haar features. Such weak classifiers are sensitive to image gradients and other critical features in the image. We use integral images for fast computation for Haar features. The integral image computes the value at each pixel, that is the sum of pixel values above and to the left of this pixel inclusive. This can be quickly computed in one pass through the image. Let A, B, C, D be the values of the integral image at the corners of the rectangle. Then the sum of original image values within the rectangle can be computed as, A plus D minus B minus C. So there are only three additions that are required for any size of the rectangle. This allow us to compute Haar features very fast. The number of possible Haar features is very large. For a 24 by 24 detection region, the number of possible rectangle features is more than 100,000. In Viola-Jones algorithm, boosting is applied to select good features and put them into linear combinations. Boosting is an exhaustive research. The training consists of several boosting round. For each round, we evaluate each rectangle filter on each example. We select best threshold for each weak classifier. We select the best combination of filter and threshold. We re-weight all examples. So, overall computational complexity of learning is proportional to the product of number of rounds, to the number of examples, to the number of features. It is insightful to visualize two features that were selected first. You can see in the slide that they are sensitive to eyes on the face. This combination of features can yield almost 100 percent detection rate and only 50 false positive rate for the frontal faces, which is quite good for just the two features. But, a 200-feature classifier trained by boosting with Haar features can yield only 95 percent detection rate and false positive rate of one in 14,000. It is not good enough for the face detector. The next key idea of Viola-Jones detector is attentional cascade. We start with simple classifiers which reject many of the negative windows while detecting almost all positive windows. The positive response from the first classifier triggers the evaluation of the second classifier, which is more complex and so on. A negative outcome at any point leads to immediate rejection of the window. Such attentional cascade allow us to apply slow classifiers only to a small subset of windows, which gives significant speedup to the algorithm. This idea is extensively used in object detection methods. We chain classifiers that are progressively more complex and have low false positive rates. Each additional classifier improves the performance of the cascade. We continue adding classifiers until we reach the desired false positive rate. [inaudible] Viola-Jones algorithm was trained on a data set of 5000 faces, all frontal, re-scaled to 24 by 24 pixels. Training took weeks on the Sun workstation. The final cascade consists of 38 stages with 6000 features. But only 10 features were evaluated per window on test set. So on a Pentium III processor, this face detector can process 300 by 200 pixel image in 15 frames per second. It was 15 times faster than the previous detector of comparable accuracy, which is, Rowley detector from 1998, which was based on neural network. On this slide, I give several examples of frontal face detection by Viola-Jones algorithm. The Viola-Jones detector has been extensively proved since its inception. Gradient-based features in form of integral channel features has been added to the detector. Faster ways for computation of such features in multi-scale mode has been proposed. Soft cascade and crosstalk cascades have been proposed to improve the detector speed. So even now, derivatives of Viola-Jones detector are used quite oftenly for object detections, if especially high speed is required.