Welcome to Unit 2 in which we're going to talk about Lexical Analysis. In the previous unit, we observed that the syntax analyzer that we’re going to develop will consist of two main modules, a tokenizer and a parser, and the subject of this unit is the tokenizer. This tokenizer is an application of a more general area of theory and practice known as lexical analysis. So, here's an example of tokenizing in action. Suppose that we have this input file, which presumably contains a jack program. Now the first thing that I would like you to observe is that, as far as lexical analysis is concerned, this input file is nothing more than a stream of characters. When we say tokenizing, we refer to the act of transforming this barebone primitive stream of characters into a stream of meaningful tokens meaningful within the language that we are analyzing. Now, once we come up with these stream of tokens, we can hand it over to the compiler, and from this point onward, we can completely forget about the original input file. Which is very nice indeed because the original input contains all sorts of noise like wide space and comments, which are completely irrelevant for the compiler. Therefore, the tokenizing performs a very simple yet important preliminary processing of the file, that makes compilation something that can start on the right foot, so to speak. All right, now what does it mean to be a token? Well, a token is a string or a sequence of characters that makes sense within the language that we are seeking to understand, and different programming languages have different definitions of tokens. For example, consider the statement, say, x plus plus In the C language this statement makes a lot of sense, why? Because we can process it into two meaningful tokens x and plus, plus. However the very same input if we give it to a jack tokenizer will not make sense, well it will make sense as far as tokens go because we'll get The three tokens, x, +, and another +, and then later on in the compilation process someone is going to raise a red flag and say you can not have a + immediately after another + because in the jack language, there is no ++ operator. So if you have the task of writing a tokenizer You should demand to get a very well specified documentation that says, what does it take to be a token in the language that I'm trying to analyze. And so in the jack language specifically, we have five categories of tokens. And here they are. First of all, we have keywords, of which we have about 20 keyword constants. Then, we have symbols, like times, divide and so on. Then we have integer constants, which are numbers that vary from 0 to 32767. Then we have StringConstant which is anything that exist within two double quotes. And finally, we have identifiers, which we use to call variables, methods, classes And so on. And everyone of this categories is very well defined without any ambiguity or uncertainty. And therefore, if we have to write a tokenizer, we can write a program that starts with such an input file. And based on this lexical definition that we have here, We can write code that will carry out the tokenizing for us. And typically, the tokenizer ends up being some class or some service, some module, call it anything you want depending on the language that you use to implement it. And it will provide also some useful services. First of all it will allow us to view the input as a stream of tokens. It will allow us to ask questions like are there more tokens to process. And if so, can you give me the next token? And can you tell me what is the value of this token, what is the type of this token, and so on? So the tokenizer provides all these nice services. And here's an example of a program that uses these services. So TokenizerTest is a program that You know construct the tokenizer object and then it goes on to process the input file. And here is what it does. It goes through the input and for every token it lists the token to be out of file. And if surrounds the token by two tags. That describe the type of this token, given the jack remor in the Jack language. And so it does it all the way, you know, until the end of the file, and what we get is a very useful documentation of not only the tokens that we have in the program but also the token types. Now this test here is not just an arbitrary example of what can be done with a tokenizer. This is actually exactly the way in which out parser will later on use the tokenizer. The information that we have here obviously is very relevant for the compiler. So it's nice that we can easily generate it Using the Tokenizer services. And notice that once we do it, we no longer need the so called file because from now on the Tokenizer is going to represent the input for the compiler. In particular We can tell this object to advance and once we do this the first token in this file will become the current token. And then we can enter a loop that says as long as we have more tokens to process Here's what we want to do. Well, take a look at the desired output. And notice that we have to, in each line, we have to output the classification of the token. And we have to do it twice. So I'm going to get the current token classification from the tokenizer. And put it in some local variable, that I call, token classification. And once I do this, I can actually print this line to the output. I print the opening tag Then I print the value of the current token, the token itself, then I print the end tag with the token classification. Then I print a new line, and once I do this, I advance to get the next token from the tokenizer Go back to the loop and this will go on until we consume all the tokens from the input file. So, this is an example of writing a program that uses services of the Jack tokenizer. Now obviously, we are going to supply to you a very specific and complete API of this tokenizer so if you're not completely sure how to invoke all these methods and services, don't worry about it we have much more to say about it later on in this module So to recap, the tokenizer is the first module in our syntax analyzer. And now that we know how to group characters into tokens we can proceed to talk about passing. But before we'll do that, we have to say a few words about grammars. So this will be the subject of the next unit.