Welcome to the unit about grammars. In the previous unit, we talked about tokenizing and we developed the conceptual ability to think about the program as a set of tokens. Now I think that you will agree with me that the effect that we have a set of valid tokens does not necessarily imply that we also have a valid text or a valid program. For example, if we deal with the English language, consider the following set of tokens. Red drives a he. Maybe you want me to repeat that. Red drives a he. It mean nothing, right? Well, it means nothing even though each one of the tokens is perfectly legitimate. Red, drives, a, and he, four legitimate English words. However they don't come in the right order. Had I said something, like he drives a red car, everything will become crystal clear. So it's not only the tokens that are important. The order of the tokens is critically important as well. And the artifact that defines in what order we can put tokens together legibly is call the grammar. So the grammar, technically speaking, is a set of rules. Each rule define a certain pattern that indicates how tokens can be strung together in order to make meaningful sentences or statements in the underlying language. So for example, what we see here is a subset of the English language. In English, we have sentences and each sentence is a noun phrase followed by a verb phrase. What is a noun phrase? Well, a noun phrase is possible determiner, which is optional followed by a mandatory noun. So notice that I use question mark to refer to something that may or may not appear in this pattern, in the pattern known as noun phrase. What about verb phrase? Well, a verb phrase is a verb followed by a noun phrase. So we see that this grammar is quite recursive. And then we have a set of allowable nouns, which are constants, like lexical constants, like dog, school, dina, and so on. We have a set of allowable verbs and we have a set of allowable determiners. And we can add more rules and more constants to this grammar and make it as fancy as we please. So once again, this is just a subset of English language. But the subset is sufficiently rich to support or accept such inputs as, for example, Dina went to school. She said, the dog ate my homework. Every one of these sentences will be accepted by this grammar. For example, consider the dog ate my homework. Well, let's take a look at it. The is a determiner and dog is a noun. So we have the determiner followed by a noun, which taken together are a noun phrase. So we have something that begins with a noun phrase. So it looks like we may have a sentence here because a sentence is something that begins with a noun phrase. Then we have ate my homework. Well, ate is a verb and my homework is a noun phrase. So we have a verb followed by a noun phrase, which together constitute a verb phrase. So taken together, we have a noun phrase followed by a verb phrase and, therefore, we have a legitimate sentence, according to this grammar. The sentence may not be legitimate in a French grammar, but it is suddenly acceptable by the English grammar. And we can do the same exercise and convince ourselves that the previous two sentences are acceptable as well. Now looking at this grammar, I can view or observe that it is made of, as I said, a set of rules, and these rules fall into two broad categories. First of all, I have terminal rules and terminal rules are rules whose right-hand side consists of constants only. And then I have non-terminal rules and these are the rules whose right-hand side consists of names of other rules as well as possibly constants. So taken together, grammar is a set of terminal and non-terminal rules. This is an observation that will be useful later on in our work. All right, now let us shift our attention to the Jack grammar. Because after all, this course is not about computational linguistics in general. It is about compilation, or at least this module is about compilation. So a Jack program is a set of statements. And I want to focus on a subset of the Jack language in which we have three possible statements, if, while, and let. So I use this bar, the vertical bar, to denote choice. So either, and if, or while, or let is a acceptable statement in this subset of the language. And then statements in plural are zero or more occurrences of statement in singular and I use the asterisk to indicate zero or more. So once again, what is statements? It is zero or more or Karen says, of the statement pattern. And the statement pattern is either an ifStatement, while, or let. Okay, then I have an ifStatement. An ifStatement is something that begins with a constant if, the token if, if you will. It must be followed by a left param symbol, then comes an expression. Then comes a right param, a left curly bracket, statements, and a right curly bracket. If you have an input which corresponds to this definition, you have a legitimate ifStatement. Likewise, we define a whileStatement and a letStatement, and then we go on to define the things which are still missing here. For example, what is an expression? Well, an expression is a term followed by the optional pair up, followed by term. Why optional? Because I use question mark to denote zero or one. So an expression is either a term standing alone or it's a term followed by up, and another term. What is a term? A term is either a variable name or a constant. A variable name is a string not beginning with a digit in this subset of the language. A constant is any decimal number and op is one of these symbols here, okay? So taken together, what we have here is the complete grammar of a subset of the Jack language. And now we can begin to consider possible inputs and ask ourselves do they cut mustards so to speak? Are they accepted by this grammar? For example, let x = 100. Well, let's see. It looks like it should correspond to this pattern. And indeed, it's start with the token let, and then we have x. Now what is x? Well, I look at the grammar and I realize that x qualifies as a varName. Because a varName is a string not beginning with a digit and an x is a string not beginning with a digit. So we have varName. Now a vowel name is also a term and I'm sorry, all we need is a varName, right? Because after let, we need the varName, we have it. Then according to the rule, we have to have an equal symbol and we do have an equal symbol. Moving along, we have the token 100. What is 100? Well, according to the grammar, 100 must be a constant because it's a decimal number and a constant is also a term. So what we have here, is a term, [COUGH] I'm sorry. And a term is also an expression. So we have an expression and that's what the rule requires, right? The rule expected to see an expression. And then the rule says, I also expect to see a semicolon. And indeed, we have a semicolon. So taken as a whole, this input is acceptable by the grammar. What about this input? Well, once again, we have a let pattern and let x = x is acceptable, according to the analysis that you did before. And then we have + 1, now what about this x + 1, what does it mean? Well, x is a term and +1 is an op followed by another term. And if you look at the rule that defines expression, you will see that x + 1 qualifies as an expression in this grammar. So once again, we have something that will be accepted by this grammar. What about this input here? Well, let's see, it looks like we have a whileStatement. Indeed, we have a while at the beginning, then we have let friend, very nice indeed,. Then we have n < lim. And if you will analyze it, if you will see that n < lim is a legitimate expression so, so far we are okay. Then we have a right parenthesis, very nice indeed. And then we have to have a left curly bracket, which we don't have in the input. And that's where your compiler is going to throw an error message and tell you that you have a syntax error in your source code. All right, so here's an example of an input, which is not acceptable where this grammar. What about this input? Well, here we have an ifStatement. And we see that the ifStatement allows to have statements within the curly brackets and statements is zero or more occurrences of statement. And indeed, we have here two such occurrences, so once again, we have a legitimate input. Finally, what about this input example? Well, you're welcome to stop the tape and convince yourself that this input will also be accepted by the grammar. But it's a good exercise to follow, at least informally, what we did before and realize that this is actually the case. So parsing is the art and science of determining if a given input conforms to a certain grammar. And notice that in the process of doing this confirmation, I also uncover the entire grammatical structure of the given input. Because I analyze how the input corresponds to the various components of the rule. And by doing so, I once again construct the underline grammar, grammatical structure of the input. And by the way, that's exactly what our brain does in some mysterious way, which is increasingly becomes more and more understandable. But basically, when we hear a sentence, we try to match it on certain patterns, which are sort of wired into our brain, have been wired from our learning of language throughout our life. And our brain has this remarkable capacity of understanding sentences, even if they don't match the grammar perfectly. If we didn't have this ability, all the poets in the world would be out of business. But computer programs and compilers don't have this remarkable ability. When you deal with a compiler, the input must match the grammar perfectly. And if there's any discrepancy, the compiler will raise its hands and say, we have a syntax error, you have to fix your program. So that's basically what I wanted to say about that parsing. And with that, the next unit, we'll discuss the notion of parse trees, which is a data structure that we use in order to describe the grammatical structure of our input.