x

Main point

Full script

Okay, let's get started for the lecture. I hope everyone has done a good job indeed. I can see your scores and the average is 77 something, so, which is pretty good, but i will still have to take some time to moderate your answers to see if there's any grading issues there, and i hope i can release the quiz result tomorrow or sometime before saturday. Okay so, yes, so, but from what i have seen so far, everyone have done a good job. So if you have any comments about how i should organize the online quiz better, let me know. It's quite likely that the exam would also be organized in pretty much the similar way for online remote exam. But yes, we'll see. Okay, so now for the lecture. Today i'm going to talk about something that i am most familiar with, that is, information theory. Indeed, i i must say that my research is in the area of information theory. Now, the lecture notes for today will be under week seven, and that is this information theory. So you can open this one, okay, okay, just to check if everyone can hear me.

Main point

Full script

If so, send me a chat message. Ah, great, okay, so what is information theory? So, again, let's motivate the idea. Well, what do we want to know about information theory. We want to know enough for data science in particular. We have learned how to build a decision tree and in the quiz you probably have also calculated the following quantities: the info of d, info x of t, the information gain of d, the speed information of d. I have taught you the formula and also give you some rough idea. Today i'm going to give you precise mathematical meaning, and that is from the theory of information. Okay, so what is it? What is information? Can we put a definition to information? Okay, and here we go.

Main point

Full script

That's the definition: information is the resolution of uncertainty, and that is defined by the father of information, edward avery shannon. Okay, and this is done. Well, half a decade ago, around the second world war. Now, what does it mean by resolution of uncertainty? I mean, pray, everyone, have a rough idea of what information is. Do you agree with this definition? Is it useful? Okay, let's play a game to understand this definition better, and this game is the resolution of uncertainty game.

Main point

Full script

Okay, how do you play it? Well, what i'm going to do is i will have a random variable, y, that is not shown to you. Now you have to ask questions to to guess the value of y. You can ask me any questions, and but then those questions has to be just true, false question. Okay, i will answer you, true or false, and based off my answer, you might be able to guess the value of y. Okay, now the the point of this, this game, is you want to ask as few questions as possible. Imagine every single question you ask cost you some money. Play you don't want to lose a lot of money. So how do you ask the questions?

Main point

Full script

Suppose the value of y must be binary, just like a binary classification problem. Okay, y can take value 0 or 1.. Now, how many true, false questions can you ask to be able to guess the value of y correctly? One, and what would you ask? Is y equal to zero? Okay, we can ask whether y is equal to zero, and if my answer is yes, then the value of y is equal to zero, right. If the answer is false, okay, then the value is equal to one. Okay, just one question is enough. And can anyone win? I mean, be able to guess, resolve the uncertainty of y with zero question? I guess not, because, well, you don't know whether y is zero or one. Okay, so what about the case when y takes values from zero, one, two, three. I mean there are four possible cases. Now how many questions do we need to ask? Two, two questions. And how are we going to ask these two questions? What is the first question to ask and what about the second question?

Main point

Full script

Okay, first question: whether y is equal to zero or not. Well, let's say whether y is bigger than zero, then i will answer either true or false. Now, suppose y is indeed bigger than zero, then what's the next question to ask? Oh, i, i think, i, i see what you are actually trying to say: whether it is zero or one, so that is bigger than 1.5, okay, okay, now, if my answer is true, then what will be the next question to ask? So, if it is true, well then there are two possible case. Then the value of y can be either two or three. So in order to resolve this uncertainty, we go back to the first case. Well, when there are two possibilities, then we just need to ask one question. And what is that? Well, yes, 2.5. If true, then we know it is free. If false, then we know it is equal to 2. And similarly, here i will ask whether it is bigger than 0.5. Then i can resolve the uncertainty of y by two questions. Okay, can anyone do do it with fewer than two questions? Probably not. So in this case, it seems that the amount of questions you need to resolve the value of y is equal to two. That's the minimone. Well, back to the definition of information. It is the resolution of uncertainty. Then, how hard is it to resolve the uncertainty of why? Well, that's probably related to the number of questions you need to ask to resolve the uncertainty. Then, can we think of the number of questions as a measure of information? Okay, now, what is the tree that i am drawing on the left hand side? Is it a decision tree, yes or no? Is it a decision tree? Well, basically, it is like a decision tree. Right, i'm making decisions at the end when i reach the leaf note. Okay, but what is wrong when i say it is exactly a decision tree? It is actually not exactly a decision tree. Why is why it is not? Well, think about the splitting attribute. Yes, i reused the attribute. Oh, and for decision tree, if the value is numer? If the attribute is numeric, can we reuse it? Well, imagine that, let's say, one of the attribute can take any values between 0 and 1, i can speak at 6.5, i can split at 0.25 and so forth. So can i reuse the attribute? Probably, and it might be a good idea to reuse attribute, because we can refine the information when we have multiple splits, just like this one. Yes, i reused the attribute y. However, if i use this for classification- let's say my target dc e- well, i'm trying to predict y, the value of y, which takes one of the four possible values i'm i'm splitting on why. What if i do that? What is wrong? Are we overfitting or are we under fitting, or are there other things that's wrong. So we are splitting with the attribute y. That is exactly the class attribute. Are we allowed to predict why by splitting on y for classification problem? You do not know why, so you don't get to observe why. So if you split on y, then you assume you observe y, but this is exactly not possible. So it is kind of like a decision tree, but it is really like cheating because you are using, you are splitting on the class attribute, which is exactly what you don't observe. Okay, so you can build a decision tree, use the using the class attribute. Okay, but you can, of course, train a decision tree using the labels observed from examples. Training is different from classification. Well, when you train, you might be guided by the attribute of the class value- i mean the class attribute- but when you actually try to classify- let's say you want to predict whether a patient has a disease or not- you don't get to observe exactly whether he has disease or not. You have to predict that. Okay, now what do we call this then? This is indeed a coding tree. The point is we try to encode it. Rather, we try to encode the information in y rather than trying to predict why, based on some input attributes. Okay, so to to make it more clear, what we are going to do is we are going to transform force value to zero and true value to one. Okay, this is often the case in the programming. Okay, zero often has the boolean value of false and any anything other than zero that has a boolean value of true. Okay, now what we are going to do is we are going to encode a sequence of y value. That's right. We try to encode the value of y into a sequence of bits- 0 or 1, okay, and those bits are constructed from the translation of answers to the questions. So, for example, the value of y equal to zero will be encoded by the sequence zero, zero. Why? Because my first answer is false, which translates to zero. My second answer is also false, which translates to zero as well. So zero, zero represents the value y equal to zero. So i'm encoding this particular value, zero, to a binary sequence: zero, zero. Now this may sound confusing. Let's look at another example, one here. How do i encode the value one? I will encode it as zero one. So when y takes the value of one, my two answers to the two questions are false and true and they are translated to zero and one. So i get this. And similarly, i will encode the value two to one zero and the value of three to one one. This is what we mean by encoding and this tree, which allow us to translate the answers into a sequence of bit recorder coding tree. Okay, so not a decision tree, but a current tree. A coating tree is nothing but a almost like a decision tree, but it's split on the edge, classic, the attribute that we try to encode. Okay, now, in this case, the number of questions correspond to the number of bits required to encode y. Right, i have two questions. Therefore, i have two answers and therefore i have two bits okay, and two bits is needed to resolve the uncertainty of y when y has four possible values. And that is what we mean by resolution of uncertainty as a definition of information. Okay, now let's take the idea further.

Main point

Full script

Okay, what if y takes the value from zero up to two, to the m minus one? Okay, two to the power of m and then minus one. They are all together how many possible? How many possible values? Well, two to the n, that many possible values, now, how many? Can anyone tell me how many questions would be needed to resolve the uncertainty? M? Very good. Well, this is really like counting the number of leave note when you have m split in the tree, right, if you have just one split, then you will have two leaf, and two leaf node correspond to the two possible values of y you can resolve. If you have m questions, m split, binary split, then how many of these notes you have? You have exactly 2 to the power of m. Therefore you can resolve that many possibilities. Very good, then it sounds like then y has m bits of information content. Is it intuitive? Yes, okay. Now let's take a look at another example. Suppose i have three possible values: 0, 1, 2, and how many questions do i need? Is it well, is it kind of the logarithm of log base 2, of maybe it's three values or three, and is that, is that the number of questions? Not really right, because log base two of 3 is not an integer, it is 1.585. I mean, can you ask 1.5 questions? What does me, what does it mean by asking of? Very good, so it should be two. So two questions. Because why we would? Well, because one is not enough to resolve three. Possibility 2 is kind of more than enough, but indeed it is already the minimpossible. It is the smallest integer that is bigger than 1.5. Another way to write it is really the taking the ceiling of this, and that gives you two very good run up to the next integer. Okay, now, as a professionalist or information theory theorists- so we we always want things to be very precise- can we really ask log base two of three questions? Can we have a way to define things so that i can say that the information content is indeed 1.585? Okay, instead of two, because really i mean i only have three possibilities, not four. Okay, let's do some change to the problem. Okay, the. The game we are doing now is: i'm going to have n iid samples of y, okay, and you, your. Your goal is to basically resolve all the uncertainty of y, one up to y n. How many questions do you need? Okay, and what i'm going to do is i will divide the number of questions by n to get basically how much information per symbol. Okay, so basically, what i read, what i'm going to do is i concatenate. Concatenate means putting things one by one next to each other, concatenate n variables, and i want to resolve them all. Now, suppose each value of y takes four, three possible values, zero and two. I have n of them i need to resolve. How many questions do i need to ask f? Yes, yes, i'm going to you then calculate the average number, but for now, if you need to resolve n variables, then how many questions do you need in total? Log base: 2 of 3 ceiling times n. Is that right? [music]. Oh, very good, you shouldn't multiply n after taking the ceiling. What you need to do is you multiply n and then you take the setting. Okay, so, yes, sorry, this is not the same as log two, this one, because when you write it like this, it is understood that three times n [Music] is. Well, it's done before you take the logarithm. Okay, so, so this may be a confusion i should address. Now, suppose i have y one and y two. Okay, then each of them can take three possible values altogether. When i concatenate these two, they would then have three to the power of two. That many possible values? Why? Because you, for every possible value of y one, okay, you will have a possible you have. You can combine it with the different values of y two, because they are id, they are independent. So you have zero one two, zero one, 2. So all together you have a 0 1, 2. You will have 9 combinations. Ok, so 9 possible values for the pair: y 1, y 2.. Then you would have to ask that many questions, but of course this is not an integer. Now you would have to take the ceiling to get it. Now, that's good. Then what about the information rate? Now, the information rate is, well, the total number of questions divided by the number of symbols you can resolve. Okay, i'm trying to implement: when i talk about rate, it means the amount of information per symbol. So we have n symbol here. Now, when you do this and when n goes to infinity. Guess what happened? This rate converge to log base 2 of 3, and that is what we insist on. Okay, because they are iid samples. Then. So when i amortize the cost- so this is in computer science when we compute, well, when we watch software problem, we can actually solve a whole lot of problem efficiently by avoiding this kind of integer effect and the cost is amortized. We say that the cost is amortized over the set of problems, so kind of average out. So exactly like this. Well, how do i prove that the con, this converge? Well, because i can bound this ceiling to within a value of one. So this is basically what i would like to write: is n here, but times order. One other one is basically either zero or one, so within that range it does not grow with n. Okay, and now when i divide it by n and to calculate the rate and let n goes to infinity, then this term will goes away and what is left behind is the first term. But then this n cancel out with that end and you get exactly this value. Okay, i hit. I hope this is good. Now this is how we get it. Can we do even better? Better than what? Better than log base two of three? Is this, i mean, what i have? What i haven't told you is, of course, is this the minim possible number of questions you need to ask to resolve the value of one? It seems to be. Well, let me show you that indeed, you can do even better than not base two or three. Okay, after i describe it, it won't be a surprise to you. Well, the trick is that suppose the distribution of the value of y is not uniform. Okay, i never said that it has to be uniform. Suppose the distribution is like this: you have half of a chance of getting zero, okay, and then you have a quarter of chance of getting one and two individually, okay. Now let's look at two trees. Okay, one tree is like this, the other three is the that.

Main point

Full script

So i mean coating trees. Okay, so the splitting point are very similar and indeed they are exactly the same. I mean, it's just the order of split. So in the first tree, i ask, okay, whether y is bigger than 0.5 first. If it is indeed smaller, okay, then i know for sure that the value of y is equal to zero. But if it is bigger, then i still have two possibilities. Well, y may be one, one, maybe two. How do i resolve that uncertainty? Well, i can ask again whether y is bigger than 1.5. If y is bigger than 1.5, i know for sure it's two, and i mean the other way, i know for sure it is one. Okay, so this is a valid tree, and if i translate it to code, then i would have this variable length code. Well, because for encoding zero i only need one bit, but then for encoding the other two values, one and two, i need two bits. Okay, now compare this with the tree on the right hand side. The only difference is i swapped the order of the two questions. I now ask whether y is bigger than 1.5. First. If it is bigger, then i know for sure that the value of y is equal to 2.. Otherwise, well, there are two possibilities. It may be zero, it might be one. I i will ask the other questions: is y bigger than 0.5? Then i will be able to resolve the uncertainty. Okay, which tree is better? Okay, do they give you- well, log base two of three questions about sitting on that? That is equal to two. Well, i still need to ask two questions for each tree, but notice that sometimes i can get away with just one pressure. For the left hand side, if the value of y is equal to zero, i just need one question, e on the. On the other hand, on the right hand side, if the value of y is equal to one, i need one question. But if y is equal to zero, i actually need two questions. So it's slightly different because of the cases. The question is which one is better? Well, better, let's say, in terms of the expected number of questions, i need to ask or expect, the codeword length. Okay, can anyone calculate the expected codeword length of the left hand side, the left tree and also on the right tree, and then compare to see whether they perform the same or they one is better? Okay, which encoding is better? Is it the left one or is it the right one? So, on the left hand side: well, half of the time i get zero. So half of the time when i calculate the expected convert length, what i would do is: well, half of the time the code length is equal to 1, but then a quarter of the time the codeword length is equal to 2, and also a quarter of the time the codeword length is equal to 2, and then i sthem up, i get a expected codeword length. And what is that? 1.5, very good. Now what about the tree on the right hand side? I have one half times two, one half times, sorry, one quarter, a quarter times two and then a quarter times one. And what do i get when i sthem up? Okay, so i get. So i have one here, i have well, 0.5 here and i have 0.25 there. Yes, 1.75, notice, this is longer, meaning it is well, no good. So the left tree is better, and because the expected codeword length is smaller. So i hope the idea is very clear to you. If you know the distribution of the symbols, then you can calculate something called expected codeword length. And the expected codeword length is indeed smaller than log base two or three. Okay, well, for the best tree, but for the tree on the right hand side it is actually bigger. Okay, so we can actually do better. So well, remember us that's log base 2 of 3 is 1.58, but then what we can achieve is the better one is 1.5. Okay, so we do better. Okay, in general, how do we encode?

Main point

Full script

How do we encode in such a way that we have very small expected codeword length? I mean, do i have to try out all the possible coding tree. If so, that would be very time consuming. Okay, there's something called entropy coding, and by the name of this coding strategy you already can associate it with an information quantity. I'm going to talk about that is entropy. Well, i pray. Just briefly mention about entropy before, and i defined it well, same as info of d. Well, indeed, the entropy is a notion in information theory, and when you apply it to data science to evaluate the impurity of a data set, then you get another name that is called the info of a particular data set, d. Now. So what is entropy and what is entropy coding? So let's see. Okay, so let's, let me use some mathematical notation now to denote the codeword length of a particular value, y, to be l of y. Okay, okay then entropy coding says that we can have a variable length code to encode the value of y, with the length of the symbol y equal to the setting of the log of the reciprocal of its probability. So let's calculate the reciprocal of the probability and then the log of that and the ceiling of that. Well, to on on this particular example. Okay then. Well, when i take the reciprocal of the probability, one half becomes two. So this is just log of two. Again, unlike the equals interface where log corresponds to the natural logarithm, in all my natural log correspond to log base 2. this is equal to 1. Okay, that means i'm going to assign one bit to the simple zero. Now the second, the next one, okay. Log of four. That is two. Okay, and log of 4 again, so 2.. So this x actually corresponds exactly to the up to the better coding tree we just identified.

Main point

Full script

So for this example, the code links i used to encode a more probable symbol is shorter because i take the reciprocal of the probability before i take the log. So log is an increasing function of this argument and the reciprocal, of course, is the decreasing function. Then overall we have a decreasing function with respect to the probability. So the larger the probability, the the more probable a symbol is going to be observed. We actually assign it with a shorter code. That makes perfect sense because that is going to reduce the expected codeword length. More probable code words receive less bits to encode. Okay, so the expected codeword length is exactly equal to what is that? If i use entropy coding, what is the expected codeword length? But that would be well. First of all, i would write p of y, of y, times l of y and then sover all the possible values of y. Very good, that is, and what would that be? So so this is, of course, one notation is expectation of p of y and capital y. Now, when i write this expectation, notice that i use the capital value y to denote the random variable y and i use little y to to exp, to to for its realization. So this, this notation will will be clear, so that other people know what is actually random and what kind of a randomness we are trying to take average off now when. So actually sorry, i shouldn't write here right here, my mistake should be l of one, okay, and that is info of the data d. Well, we don't have data here. We have the class distribution precisely specified here. But then we also have something weird, because it is, let's see, like there is a ceiling there. Okay, so expected codeword length is almost like the entropy, because if i, if i substitute this as log of 1 over p, y of y, that is exactly the definition of entropy of y, but then i have a ceiling there. Okay, in this case, the expected codeword length is precisely 1.5 and there is no. I mean, the ceiling is exactly the log of reciprocal itself. I mean, because all these are integers. So this is exactly the entropy in this example.

Main point

Full script

Okay, now, how can i well say, indeed, that the expectation is precisely entropy? Well, i have to have some way of getting rid of the setting, the trick we have already applied before, i mean just now, when we insist to say that i mean log base, two of three questions is possible. We can ask that many questions. The strategy we use is concatenation, right. So if i can con, well, if i concatenate this, a sequence of symbols, and try to find the expected codeword length of this sequence, maybe it will get exactly to the entropy. I mean, you can probably just, i mean, prove it in your head. But indeed i'm going to prove something even more. I would say that expected coding rate is almost surely the entropy. Okay, so let's calculate, i mean let's prove. One is just that formally so, the now, according to entropy coding, all we need to do is to assign the codeword length to the sequence by the ceiling of the log reciprocal of the probability. Okay, so that is exactly what entropy coding is, but i just applied it not just to one symbol, but to a sequence of symbols. Okay, now what i'm going to do, i'm going to simplify this expression so that i can prove what i just said. Almost surely, the expected- well, actually the cookware length, not the expected one, the codeword length, almost surely is equal to the entropy. So how do i do that? Well, i can, because all the y's are independent and identically distributed. I can rewrite the joint probability of this sequence as a product of individual marginal probability. Okay, and this is by the assumption of independence and also identical distribution. Now, once i do this, what can i do? Well, we often apply the trig log of a times b is equal to log a plus log b, and i can apply that here as well to further simplify things. And that will give me the s of the log reciprocal of individual probability of the marginals, summed over the the symbols. I, okay, okay, so i get a s. Now to be clear what i'm going to do again, to apply this derivation directly to the example we have seen just now.

Main point

Full script

That is the case when y1 and y2- they are iid, with the probability zero, equal to one half, and probability of 1, 2 equal to 0.25. Now i can calculate, well, the joint probability. So this joint probability p of y1- y2 is just the product of py- y1 and p, y, y, two. Now, for y one equal to zero, then the probability is point five. When y one, y two is equal to one, then the probability is point two, five. Now i deliberately write it so that i put the code length in the exponent, okay, and when i then combine them, the code length needed to encode zero one is indeed three bits. So this you can actually see it here, and when i take the log of the reciprocal, i get three. So now you get a list of codeword lengths and you can then divide it by n and see whether it reached the expected. We reached the entropy of 1.5. If you sthem and divide it by. If you sthem and divide it by the number of symbols- two, okay, then what do you get? So if i sthem- and well, sorry, i shouldn't say i s sover- the disposable cases and and i, what i would like to do is we have where we observe two, three and so forth, each of them, i'm going to divide it by n, okay, and i would like to. Well, you can see that it is going to be random. If i divide it by 2, you will get 1, 1.5, 1.5, 1.5, 2, 2, and then 1.5, 2 and 2.. 1.5 appear quite often and indeed, what i'm going to say is that when n goes to infinity, almost surely you will all well see 1.5. So that is, with probability equal to 1, you will be seeing a value very close to 1.5. Okay, and what do i mean? Again, i'm going to apply the same trick as before. I will just say that ceiling only contribute order one to the expression. But then when i divide it by n, then the order is order one over n, which will actually goes away as n goes to infinity. So, as n goes to infinity, then we are looking at just this expression there. What is this expression? And why do i say that? Almost surely it is the entropy. Well, that is by the law of large number. But the law of large number, this is a sample average. Well, average of what average? Of log reciprocal, of the probability. Now you are, what i'm thinking is i'm thinking of the log reciprocal as a random variable. And what is random about this? Well, it is the argument y here that is random. The value of the symbol is random, okay, and i'm using attribute coding formula, which is basically the ceiling of the log of the reciprocal. Now, when i look at this expression, this is nothing but the sample average of iid realization of this random variable log, one over p, y, and, by the law of large number, sample average converge to the true expectation, almost surely, with probability going to 1 as n goes to infinity. And and what is this expectation? This expectation is exactly the entropy of y. Now i can give a very precise operational meaning of entropy of y. It is indeed the covert length you need to encode the random variable y. Therefore, it is just the information content, okay, so so i hope this idea is clear. But then the question, of course, is: is entropy coding the best possible? I mean, can we do better than entropy coding? Entropy coding is an achievable scheme. Well, for the ring, by achieving film is a particular scheme that can give you a corporate length of entropy of y. It doesn't say that it cannot give you better. It is possible that you might come up with a coding scheme that is even better than attribute coding. Is it possible? Well, no, the answer. The answer turns out to be no. Indeed, you can prove, using graphs, inequality that you can never encode with the codeword length smaller than the entropy of y, and this is what we mean by a converse result in information theory. Converse means a result that can prove that information theoretical limit that applies to any coding scheme. What is very elegant about information theory is we somehow can give you the precise limit as to how much you can do. You can ask: okay, if i have a file, then what is the minim, what is the minimfile size? I can compress that file too, and that is what information theory, like the converse result, can tell you. Indeed, entropy of y is exactly that quantity. If you think of y as a file you want to compress, then entropy of y is the minimsize of that, what, what that file can be compressed to. Now, speaking about this conversion result, i mean the argument is quite convoluted, but you can do it. Basically, there's some thing called a craft inequality saying that, no matter what coding scheme you use, if you require the coding scheme to resolve the uncertainty of y, then it must satisfy the well, the, the inequality that this is. This is less than equal to one, where l i correspond to the length of the code you assign to symbol i. Okay, now, once you have this inequality, then you can think of it as a constraint optimization problem. You try to minimize the expected covert length, which is also a function of li. You try to minimize that subject to the constraint here. Well then, you can apply some well relaxation, think of l i as not potentially a real number, not a integer, and then you solve that problem. You find that indeed, the solution is, the optimal solution of l i is exactly, according to entropy codec, this log of one over the probability, and that gives you entropy, and that is the minimpossible, okay. So i've told you the proof idea, but i actually haven't told you the proof. It's a bit too heavy, it's not required for for this course, okay, but it is important to understand the interpretation of entropy, the operational meaning of entropy.

Main point

Full script

Now then, okay, we can also talk about entropy not just for one random variable, but a concatenation of random variables, and this vendor would need not be iid. So all we need to do is to: we just think of the, the random variable, as a random tuple or random vector, okay, and we just apply the same formula to measure the information content of that. So instead of using just the probability of x over really a y, we use the joint probability and we can also define something called the conditional entropy, and the definition is this: so basically, you look at the entropy of x, y and subtract the randomness of x. What is trying to say is, what is the information content in, why that is not in x? Well, when i look at the joint entropy, i'm looking at the information content of both x and y, and when i remove the information content of x, then that is exactly what defines a conditional entropy, that is, the information content in y but not in x. Okay, you can apply the linearity of expectation and you basically can simplify the formula of entropy to the formula for conditional entropy. Here i apply the base rule to d to rewrite the, the ratio of the probability, into the conditional probability. So in the formula is actually quite easy to remember the conditional. Well, entropy of y, given x, is nothing. But you take the expectation of the log reciprocal probability, but that probability is the conditional probability, that is the probability of y, given. You have observed x. Now joint distribution is in the similar way. We modify the original entropy formula, i mean we replace the probability of x by probability of x and y. Okay, so in the same way. Indeed, a better way to remember all that is to use something called the rand diagram interpretation. Now, what i'm going to do is to think of the information as a set, a set with the area that is proportional to the information content. If you have more content, then the set become bigger. Okay, now the.

Main point

Full script

The information content of x and y may overlap. Maybe two files actually contain the same content. Okay, maybe you are talking about, let's say, bible. Okay, bible can be written in english, can be written in chinese, and, but essentially they have the same content. So if you draw it, then they will have a lot of overlap. Okay, now, the joint entropy is the is the area of the union. Okay, now, if you have to well, describe the moon shaped area, a, what would you describe? I mean, how would you describe it in terms of entropy? Can anyone tell me the entropy value for this? Is it entropy of x? Well, no, because x actually is the area of the entire red part. What? What is it here? Well, the moon shaped area is the information content of x, is part of the information content of x, but that part has nothing to do with why? Because y is on the right-hand side. So, exactly, the area can be calculated as the entropy of x and y minus the entropy of y. Right, and what is that? Well, as defined on the left hand side, it is the conditional entropy of y given x. So to understand conditional entropy is very easy. You look at the venn diagram and you basically see that this area corresponds to region that is not contained in y. So then that is the information in next in in x but not y. So, sorry, i should say x given y, okay. Similarly, that is y given x there. Okay. Now what about the middle one? What is the middle one? Well, middle one, you can apply the formula, so the union, okay. So how do you get the middle one? How do you express them? The middle area, so i mean this one in terms of entropy. Can anyone write a formula? Well, i have, well, i have. Well, i can. Actually, very good, okay, if i take entropy of x- so this is the red part- and then subtract the moon shaped area, then i get this area right. Okay, very good. But of course there are other ways to calculate this as well. I can take the entropy of y, that is, i can. I can take the blue, blue shape there and then subtract this moon shaped area. That is also okay. Now, indeed, you can also write it like this: minus this: how come? Well, you can also write it like this: okay, how come?

Main point

Full script

Well, you are taking the area of x, area of y, then you subtract the area of the union. But when you add area of x and area of y, well, the middle part is double counted, because this median region appear in both x and y. If i just want to get the, the area of the middle region, i would have to submit: well, subtract, subtract the area of the union, and i'll get this. This is called the inclusion, exclusion principle in combinatorics: inclusion exclusion. Okay, but i think it is pretty easy for you to see, and indeed you can also apply the formula for calculating the conditional entropy of y given x to get this formula. How well, entropy of y given x is nothing but entropy of x, y minus entropy of x. And there you get this: entropy of x and h of y minus entropy of x, y. Okay, i hope this is clear. Okay, a. We call this area the mutual information between x and y. Why do we call it the mutual information? Well, just from the rand diagram. You know, this is the information that appear not only in x but also in y, so it is mutual. It is the information that is mutual to both x and y. That's why we call it the mutual information. Okay, and this quantity is extremely important because this is the quantity that you use to build a decision tree.

Main point

Full script

Well, not quite well in the early version, what we call the like dichotomy three. Dichotom mean three, okay, so that's sorry dichotomizer. Three, okay, that use information gain. But then later on, we use information gain ratio. That is, the later on, j40 and c4.5 realized that there's a bias towards outcome, with many, many sorry features, with many, many outcome. And to normalize it back we need to use something called the speed information and all that. Now, actually, before i say all that, it would be a good idea for me to point out. Indeed, the formula that you have been using is exactly mutual information. So, first of all, you already know that info of dj is nothing but the entropy of the distribution of y. Well, dj is indeed the conditional, the conditional version. So it's the entropy of d when x takes a particular value of j. Okay, but this alone is not the conditional entropy of y different x. Okay, this is sorry.

Main point

Full script

So this alone is not the conditional entropy of y given x. X, we also have to average over the possible value of j. This, this is what we usually write it as- x equal to j. So this is a particular realization of x. But if we average it out for all possible outcome of x, then we get exactly this one. Notice that here x is a random variable which can take different values of j, and this averaging will give you the conditional entropy of y to the x. So info of x is nothing but conditional entropy of y given x. Does it make sense? Well, it is saying that how much information content is in y after you have observed x? So, after you have split using x, what is? What is the impurity that is left for you to resolve? Okay, so so that's why info of x is what we look at when we try to decide how good an attribute is in terms of reducing the class impurity. But of course we need to normalize it. Now, when i compute this difference, it is exactly the entropy of y minus entropy of y given x. And what is this? This is the mutual information between x and y, so the information gain we calculate, that is, the drop in the impurity measured in terms of entropy, is exactly mutual information. What it's saying is how much information content x and y share. So it is saying, indeed, how informative is the feature x in predicting why? Okay, if x and y are independent, then this, actually this mutual information, is equal to zero and there's no drop in the impurity. Okay, now again, because for attribute with many, many outcomes, it tends to have a large mutual information. Now, when we apply the decision tree induction algorithm, normally we would do things greedily. And if we look at we, if we split by an attribute x that has minimal outcome, well, it is often an over commitment. And indeed the number of comparisons required for outcome with many, many, for random variable, with many, many outcomes, is more so in terms of amount of information gain per question asked. Having a large mutual information gain is not enough, so we normalize it by speed info. What is speed info, though? Well, split info is indeed the entropy of x. This i would like you to verify later on, you can do that. So what it's trying to do is we, we indeed look at per bit of information in x, how much is it related to y. Indeed, i'm just calculating the ratio between this region and this entire region, and it's it is this region. If this is high, that means most of the information in x is actually related to y. Okay, so this is the mutual information between x and y divided by an entropy of x. Okay, so, indeed, i would say that this venn diagram is like one graph. Well, you understand everything. Okay, it is very useful, okay, so well, now let's look at a more complicated setting.

Main point

Full script

Suppose i have two features, x1 and x2, and i want to think about how relevant they are with respect to y. Okay, how can i compare? Okay, and what kind of information measures i can use? Okay, one thing, there's something called the chain group. If you ask, okay, how much information x1 and x2 both together share with y, okay, that correspond to this overlap. Right, and this overlap. You can further comp, decompose it into the two s, okay, how come? Okay, so let's try to label the area of different regions. Okay, i write abcd. They correspond to the area of the four regions. Okay, and let's look at the formula on the left-hand side. I have the. This mutual information corresponds to the shaded region, that is, b plus c plus d. Okay, and what we we would like to do is to verify this equality. Is it true that b plus c plus d is equal to the sof, these two mutual information? Then, of course, i need to translate the mutual information into the variables that i use. So, the mutual information between x, 1 and y, what is it? Okay, what is this? Well, it is the region here. The mutual information between x and y, and what does it correspond to? It corresponds to b plus d, right, b plus d. And what about this one? The mutual information between x2 and y, given x1, okay, given x1. So that is, you don't look at the randomness in x1. What is the information that is mutual to x2 and y? Well, that is this region, right, that is well, i cover this. So that is c. Okay, now you basically have proved this identity. Now, using the rand diagram, you, you proved that the, this mutual information on the left is exactly equal to the sof, the mutual information on the right hand side, and there's a notion of conditional mutual information that can be easily defined using this diagram there. Okay, this is what we call the chain rule of information. Now we can also talk about [Music]. Well, sorry, we can. Another fundamental property of information is the conditioning reduces entropy. So if i have entropy of y and i condition it on x1, what would happen? It would drop, because if you incorporate more information, then you will be able to resolve the uncertainty of y better. Okay, so it will reduce the impurity of y. So therefore the information content reduces. And if i have more random variables, if i have two features, x1 and x2, then the conditional entropy of y given x1 and x2, x2 is even smaller. Okay, so this is the idea of conditioning reduces entropy. But how do you prove it? Okay, you may think: okay, well, hp of, obviously from the venn diagram. Okay, entropy of y is at least the entropy of y given x. Well, why, given x, is this right? Because i can see things covering the other. I mean, entropy of y, shouldn't it be bigger? I mean, well, the. The tricky thing is that entropy, first of all, may not be negative. I mean for continuous random variable. If x is a continuous random variable or if y is continuous random variable, entropy can actually be negative. Now, but the other thing is that this region b may not be positive. I mean even when x1, x2 and y. They are, they are discrete renderable. Even when those three entropies are positive, this center region d may not be positive in area. I mean you, it's hard to think of it, why, well, an area can be negative. Well, indeed, the formula just, i mean you can come up with an example where the middle part is negative. Okay, so then i mean, what can we say? Can information be negative? Well, turns out, mutual information must be non-negative. The center part d, indeed, cannot be written purely as a mutual information. It would have to be written as a difference of two mutual information. Now, why? Why am i saying this? Well, the entropy of y minus the entropy of y given x is precisely the mutual information between x one and y, the fact that conditioning reduces entropy. If you can prove it, that is equivalent to saying that the mutual information between x 1 and y is non-negative. Well, not only that, if you look at the second inequality there, it is saying that. So the second one, if you subtract, if you subtract this one from that one, what do you get? You will get the mutual information between x2 and y given x1, and that is non-negative if this inequality holds. Okay, so proving conditioning reduces entropy is indeed equivalent to proving conditional mutual information, and mutual information are non-negative, but can you prove that? That is in the tutorial. See, indeed, it follows from a property of kl divergence or information divergence, and that, again, well, follows from jesse's inequality, or people also use log sinequality. Yeah, so what is the meaning of this mutual information? Okay, so this mutual information correspond to what? Well, indeed, let's say you want to predict the value of y. Okay, well, you can combine two features, x1 and x2, together to predict the value of y. The questions you would like to know is: when i combine the two features together, e, how relevant is it to y? Then this mutual information capture exactly that sometimes x1 and x2 might be the same brand, same information content. I mean you, you probably have noticed from the attribute selection kind of project one, because when, when two random levels basically share the exact same information, then you don't really need the other one. I mean you, you can just use one of them to predict. What's the definition of the semicolons? Yes, semicolon is just a way to separate the two parts in this expression. So in information theory there are the. The mainstream location is we use a semicolon to separate the two variables we want to calculate the mutual information of. And then there's also another way to write it using the wedge sign here, a website in letters. In in lattice theory it means the, the meat operator. So i mean, if you don't know, don't worry about that, it is basically the intersection, so so so that is just the meaning of that one. But semicolon is easier to type. I would say you type it much easier. So in the latex you have to write wedge in order to type this symbol. Now the comma means concatenation, so you just concatenate x1 and x2 together as a random level and you you find out it's shared information with y. Okay, so that is the idea. Now, what this chain rule means is- let me repeat this, indeed let me say, say o. Now the mu, the total information x1 and x2 contain about y, can be split into two parts. One is the information x1 contained above y, of course, because you you look at x1 and x2 together, then one of the components should be just the information x1 has above one. But then other than that, that's another component that is, other than the information in x1. What is the information x2 contained about y? That is the other component to this total mutual information. Okay, both this mutu part are non-negative. You can show that because a property of kr divergence and also because, well, it makes sense to think of mutual information that's not negative. I mean, i mean i can't exp, i mean i wouldn't be able to make sense of negative information. That that doesn't sound right to me. Indeed, the the fact that additional observation, let's say observation of x1 or x2, can resolve a uncertainty, that is exactly the same as saying that mutual information is non-negative. So conditioning reduces entropy, is this equivalent to positivity of huge information? So all of these make perfect sense. The reason why i like information theory is exactly because this is intuitive. I mean, i can understand intuitive concept very well and and mathematically it is also precise. Like the beauty of information theories, it actually put our intuitive understanding often into concrete mathematical statements. And why is it important to do that? Well, you know, because the decision tree induction algorithm, you, when you ask the computer to do it, you better describe something precisely: what is relevant when, what is not relevant. And the formula of mutual information is used to tell whether a feature is relevant in predicting the target. Okay, so that formula is the golden rule. That that's why i asked you to remember that. Okay. Now another very intuitive property is called the data processing inequality. And what is it? Okay? So it is nothing but saying that. Okay, well, the processing, data processing inequality is on these two components. If, well, let me see, sorry, suppose this conditional mutual information is equal to zero. Okay, what does it mean when i condition on x1? The mutual information between x2 and y is this part, that is, c. I'm saying that this part is equal to zero. That means there's no information x to contain about why that is in addition to x. So it's more, it is indeed all the relevant information x2 has to do with y is already obtained in x1. So think about features. Well, feature selection, okay, if you have to select between x1 and x2, which one would you use? Well, x1 already contain all the information x2 has to has about y. Another way to write this is a markov chain: x1 given. Well, given x1, x2 actually has nothing to do with y. Indepen is independent of y. So these two are equivalent, writing this markov chain versus writing this equation. Now, when we say data processing, what do we mean? Often the case we want to pre-process features. So from the feature x1 i can pre-process it and generate another feature x2. If i do that, then it will satisfy this markov inequality. That is well, because all the information x2 has has to go through x1. Now the information x2 has about y actually is through x1. Now what data processing inequality is saying is that if you have this markup inequality, then if you compare how relevant the individual features is to y, so the x one with y and also x two with y, which which one has a larger mutual information? Does x one has a larger machine formation? Or the x plus x two? Well, because all the information content of x2 with y has to come from x1, then intuitively you would have the mutual information on the left bigger than the mutual function on the right right. So this intuitive understanding again follows from the chain rule. So once you have this chain rule- i mean you actually, i mean we have prove it just now using the venn diagram- i can write it in another way, that is, instead of saying x one, y and and then sup the conditional, i can start with x two, y and then i can condition x 2. Then what is the visual information between x and y? So this is just well symmetric way of writing the same same same chain rule. Now, what is the point of doing this? Well, sometimes doing redundant things, seemingly redundant things, actually give you something meaningful, and this is one of the case. So suppose this conditional mutual information is equal to zero. That is the premise of the data processing input. Inequality holds. That is one of the terms in particular. This term is equal to zero. Okay, then i can i deduce something about these two mutual information? Which one is bigger? Yes, we want to say that we have an inequality like this. How can we say it this way? Well, first of all, we have equality, if i also have this mutual information, but this mutual information is not negative.

Main point

Full script

Okay, this mutual information is not negative. I can further write this as bigger than the mutual information between x2 and y. Now notice, this terms goes away by the premise of data in processing inequality. So we are only left with this term on the left and this term on the right, and that is exactly the inequality we want to prove. So this is data processing in party. Okay, so indeed, there are many inequality you can prove. Some are not very obvious. I mean, what i've shown you are those obvious ones, but then there are- there can be more complicated inequalities, indeed all of them follow just from the basic ones you can show right here, indeed it you can well. A lot of the inequalities is based on the property of entropy, called as a modularity, and this is well beyond what is even for an information theory course will require. So now there is a, a program called the itip, and this will automatically prove whether certain information inequality works or not. You can try it out and see. I mean all this inequality proved using this information in quality prover. There are also inequality that are hard to prove. Indeed that cannot be proved from the sor generality of entry or from from this approval. Those are non-shannon type inequality and those are inactive still in the fd area of research. Still no, very little people. I mean we, we know actually very little about those inequality.

Main point

Full script

Okay, with that, this complete lecture today. This lecture indeed mentioned the fundamental work of shannon half a century ago, the paper. You can read it indeed it is very long paper, many pages, but then it is very easy to read because it is written in a style that is very intuitive and convincing. Often the case the the author who need to start a new field. He has to put a lot of what may work in in terms of convincing other people to to follow his idea, and you can see a lot of very elegant idea from this paper. The other textbook i would recommend is the information tv tech books, that is, the elements of information theory. So if you're in into i mean the information theory, which i think is fundamental in data science, then i would recommend you to start with this elements of information theory. It is not about data science, though. This is more about the communication problem. Indeed, information theory has many applications. It has application to telecommunication, network coding and all that. It also have application in security problems. The source coding problem is indeed what gives a very concrete operational meaning to entropy or information. We have described, indeed, something called the entropy coding. Today there are also something called the huffman coding you can explore, and if we want to encode something without knowing the underlying probability distribution, then there's also this lamppositive algorithm, which is the sig algorithm that compress your file. Okay, of course there are. I mean right now the zip algorithm we use has additional features other than the lambda shift original amazon. Okay, and you can check the deflate algorithm and all that okay. So with that, yeah, there are a few- i mean interesting stories behind.

Main point

Full script

You can take a look at this- that there's a fundamental features, fundamental axioms, that is satisfied by entropy. And then there's also an attribute to who. Who? Who coined the bits? Who? Who first come up with this term? This shannon is the one, is the person who used bits in the paper first, but then he attribute this name to john ritter, turkey.

Main point

Full script

And for newman, who invented this? Basically, well, who? Who actually applied the first electronic computer called aniak, to design their hydrogen bond using monte carlo simulation? He actually is the one who suggested the term entropy to shannon, so you can use: okay. So that's an interesting quote from him saying: okay in a debate. Well, what? Why choose the word entropy? Well, in the first place, your uncertainty function. Well, should be. Well, sorry, he said you should call it entropy for two reasons. In the first place, this uncertainty function has already been used in previous work, also called entropy in. The other reason is that no one know what entropy really is. So in a debate, you will always have the advantage, because when you say something that sounds very technical, then you are convincing sometimes. Okay, good, that's all. Sorry for overrunning.

Main point

Full script

And now let me just mention that there are two notebooks, tutorial notebooks. They are a bit theoretical indeed, i asked you to prove something. The proof should be manageable, but yes, so it is kind of too much beyond what we require for the course. Again, i'm regarding all the tutorial notebook as optional, so so that should be okay. So you, if you don't want to do it, then that's fine, but then if you do it, then it may count towards your bonus. Now, last week tutorial has been deleted by one week, so the deadline for last week tutorial will be next week, but then the tutorial for this topic, information theory, will also be due at the same day, because the i want the the grader to be able to respond to you in the week after the lecture has been into.

Main point

Full script

Well, to be in two weeks after the lecture has been released. So to avoid too much delay, this will be due on the same thing, okay. With that said, that's all for the electric today. Okay, so any questions? I'll stay behind for questions. So if you don't have questions you may need, okay. If you have questions about grading, please send us an email and indeed, lets you know. Shu is [Music] the ta who are responsible for the for the homework grading.

Main point

Full script

So please let him know and then he should be able to resolve it okay. So there's some private questions. What should i do if i fail to meet up? So you can't really fail the midterm, because meter only counts 20 percent of the grade, so you so failing? I mean there's no passing rate for the meter and 20 is not a lot as compared to the total score from the project and also for the, the final examination, which is 50. So don't worry too much if you don't get a good grade in the meter. I mean, this is just a learning experience. It's just for you to practice and i'm sure that you can catch up. Okay, now another question is: okay, i think that's fine. Yeah, i'll read the email about any questions about the quiz. Ah, for the final exam, yes, you do need at least 30 marks in the final exam to to pass. That is a requirement from from the department, but many people misunderstand this minimrequirement. It is a necessary but not sufficient condition to pass the, the course. So even if you get 30 or above in the final examination, it doesn't mean you can choose not to submit projects and still think you can pass the course. So we have two components. We often have two. I mean look at two things. One is your continuous assessment. That includes the, the two projects and also the quiz. The other one is the final examination. If your continuous assessment is very poor- below 40 often the case- i would have a hard time defending you. So trust me that i'm actually defending students so they can pass the course. But then i cannot break. I mean the recommendation also. If the break is too low, it is impossible for me to pass. And okay, yeah, if you, if you think you did not get a good grade in the midterm, then try harder. And if you want some push, then i can tell you that a lot of students actually did very good in the midterm. Because i actually see the score as they come in. I can monitor the quiz in real time so we see i your attempt is actually sent to the, sent to the server immediately so i can see your, even your detailed attempt there, the i mean. If i i don't mean to stress everyone out, but if you need motivation, some, some people work well under stress, and motiva i mean comparison with others, some don't. If you do, then i i can tell you that just from what i observe. Like over 10 people says over 10 out of 40 something. People get 90 or above that, which is pretty high, just go. And when students get 100 mark within 15 minutes, so that is amazing. I mean i, i couldn't do it myself and that that's just. But don't be discouraged as well by this, yeah, so don't, don't be discouraged by- i mean by- the extreme case. So they are. They are liars, okay. The average people are are okay, okay. One more question, sir: when will the final exam be held? Some students need to book etiquette to make that. So i don't know, because the final exam date is organized by arrow and pray. It will come, come out, like i guess, towards the end of march or i. I actually don't don't exactly know when i mean it will come out exactly, but you can check. It might be written in arrows website. Now i i'm not very up to date, but then we have to follow. Now i i don't know from from past experience. The exam may be quite late because we are not a big class. I mean, if we have a lot of students, then i can often ask the exam to be held earlier. Yes, the midterm grade. I will. I should say i will do it by. I mean the latest will be this weekend, i mean saturday, because i have all day meeting on friday. If i don't get to finish the job tomorrow, then i would pretty delayed it to saturday again. I would say, once you have done the midterm, don't worry too much about the grade. This is hard to do, but i would say something good, i mean i mean you have a life out there and enjoyed it. I mean once you finish the me time, this is a release for you, really for you. So so enjoy, and if it is not good, that's fine. You just aim higher for the final exam and for the project. You can always catch up. Okay, as soon as, i mean as long as you just put hard work in it. So that should be fine. Okay, final exam timetable will be released on march 24th. Very good, and would that be a high penalty to wrong check? Yes for multiple choice questions, because those are very easy, i mean in terms of random guessing. Okay, so in order to discourage that, there will be heavy penalty on kind of multiple choice. And what do you mean by multiple choice? Drop down menu is also a multiple choice question, a true false question. You have two choices as well. So this: these are very prone to random guessing, so for that i would have to impose high penalty. So i think if you, if for for true false question, of course, if you check it and get it wrong, then is all the marks will be gone for that true force. Ah no, you will not get negative score on a question. So i think the question is: we'll we'll several wrong check? We will eventually get, as you deducted to negative score? Not really so, but then for numerical questions it's very hard to guess, so the penalty will be lower there. But i hesitate to tell you exactly the penalty. What i actually prefer you to do is prefer you to use the check to avoid careless mistakes. Indeed, well, one student already told me that for one of the questions where i asked them to give the minimand maximf score, he swapped the minimand maxim and then this careless mistake go unchecked and eventually get submitted and that actually caused the entire mark of the question. Once again, please make good use of the checking capability. This is actually what i wish i had when i am doing my exams. I mean i often the case. I mean i i guess something wrong just because of carelessness, like swapping the answers and all that, and i'm not a very careful person, as you can see at the beginning of, i mean, the quiz. Nobody can enter the quiz because i set the date to 2021 and- thanks for one of you who point out this- mistakes. Will there be double penalty for entering same wrong as twice? I say for most question type, no, but if you, if it sends that, you modify your answer, so add a space or change your algebraic expression- i mean you can have two expressions that lead to the same answer- then you would potentially cause you twice. So i mean, so again, be conservative when you press the check button. Okay, don't, don't, don't. I mean try your best to fix error and don't don't do it by guessing the wrong. Check penalty for multiple choice is for the questions or is specific to the blank, if you? Well, if so, i guess you what you're asking is if you do not fill in some of the, some of the. So actually, maybe you are talking about true, false question. If you don't specify a statement to be true, then will you be penalized? Okay, if you don't specify it? It would often the case be. I mean it would regard you as you haven't finished the question and so you won't be penalized. But if you specify something as true, then you, and if it is indeed false, then you will be penalized. Whether the final exam also have check function. I would like, i i very much would like to provide this check function again. However, don't take it for granted- because previously i mean that they are issue of academic disorder- be precisely because of the check function. Some students also complained that because of the check function, it makes things very easy to cheat and to counter this in in for for me, i actually put a lot of work in writing the those questions now because i still want to provide the check function to you guys. In order for me to do that, what i need to do is we i have to randomize the questions. So even for multiple choice questions you would, everyone would play with, we see a different set of options that takes significant effort behind, and i'm actually writing a lot of like code to randomize those questions. So so you can see that, okay, the the questions that ask you to write info, info, d and all that, i have to implement the formula for calculating the information and also in in a different language, in maxima, not in python, and then after that i have to- i mean randomly and intelligently generate the data so that it is doable. It does. I mean computers can do things fast and complicated things last fast, but on the data side, if it is too complicated, it would take you too much time to do. So. I, i, i think midterm is quite a success because, i mean, the fastest time is 15 minutes to complete the quiz. Yeah, so let me next question- actually i got several wrong- check on only one choice in question one. Will it affect other choices or only itself? Ah, it will only affect that, that, that part, okay, it won't affect other parts, okay. So well, yeah, so, but the? Yes, if you want to explore all this, you can check the moodle question type, because there are different questions type. I mainly use the, but i mainly use this moodle stack. You can- i mean i can- write it up. It is very sophisticated. You can try installing it and write the questions yourself. If you're interested, you can help me out, also contribute to more questions to to build a question bank. So i, i, i, indeed i'm- conducting a project, i mean basically to improve the current state of the online quiz so that everyone will be able to check answers easily without worry of being accused of like patriotism or copying answers from others. I mean if, if the questions are not randomized, there's always a in accusations that your answer may, might be obtained from others. But once, once i randomize the question, of course there's no such accusation possible. I mean, everyone are doing different things. So another thing: actually i got- oh sorry, so that's a- hidden questions. So just one wrong check will get a very small penalty, not really very small. So once again i tell you that the check can cause, i mean the penalty can be high, especially for multiple choice- true, false question. Once you check it wrong, you may not get the score anymore. Okay, so be conservative about check. But it is also necessary to check because you have careless mistake. I mean, of course, if you don't check you can still get perfect score. But i don't encourage that because everyone- i mean we- are human, we are not machines, so we we make mistakes. But for numerical answers that is difficult to guess, then i would have a smaller penalty relative to multiple choice. We will get feedback about our wrong answers. I'll go through some of the questions in lecture. I mean some of the cup. I mean i, i can actually get a statistics of the performance of different questions. Then, based on the statistics, i can discuss about the issues. I see, okay, info split checked one question twice, only found out that i have to a to a bracket for a certain term to calculate first. [music]. Oh, okay, i, i guess, is pray about the numerical expressions. Yes to, yeah. Sometimes typing out the equations is not, i mean, easy because of many, many brackets, you might be missing a few. The nice thing is that the moodle stack actually will check and validate your expression. If it is missing a phrase, it will warrant you, it will let you know that there's an issue. But then sometimes, yes, it is hard to see- i mean when, when, when- that you have matching braces or brackets, but you forgot to add the brackets, yeah, yeah, the other thing i i can do is to basically, i mean i've defined the entropy function h so that you can just use h instead of writing out log base two of blah, blah, blah. I mean you can, i mean it is very easy to answer the questions, calculating information, gain and all that, just using the function h. You don't have to actually write out the law, okay, good, so so i think there's no more questions, okay, so hope everyone have a good night's sleep. Don't worry about the quick result. I'll try to finish moderation as soon as possible. Okay, have a good night.

Main point