L5/7 Information Theory Recap (回顧資訊理論)- YouTube - 沉浸在深度學習-柏克萊大學開放式課程- P1(機率)

Dive into Deep LearningUC Berkeley, STAT 157 影片原始連結:https://www.youtube.com/watch?v=A6ZY1SXv0RY&list=PLZSO_6-bSqHQHBCoGaObUljoXAyyqhpFW&index=26 這裡會有一些信息論部分的背景回顧，我們在在上膛課程中沒有完全涵蓋，在今天會有關於卡夫不等式的證明以及為什麼需要這個不等式，首先讓我們簡短的回顧一下上次所提到的熵，記住熵是一個負對數產出對該大小的期望的一個值，這是衡量信息量的量度與能量退化的指標... 我們到了卡夫不等式，所以首先要簡要回顧的是前綴代碼，而前綴代碼是如果我經常在某個時候閱讀一些東西時，我可以很容易地確定這就是代碼的結尾，因為沒有其他東西可以從那組代碼開始，因為在單字左邊我們有一些資訊可以讓我們知道，而右邊肯定不是前綴代碼，因為他無法參考到前面的資訊。字幕[此字幕為機器自動翻譯] ------ so here's a little bit of a background recap of the information theory part 所以這裡有一些信息論部分的背景回顧 that we didn't quite get covered on Tuesday and I got a few questions about ，我們在周二沒有完全涵蓋，我有一些 the proof for the Kraft inequality and why it's needed so let's briefly review 關於卡夫不等式的證明以及為什麼需要它的問題，所以讓我們簡要回顧 this so remember the entropy is given by the negative log of well PJ the 一下，記住熵由PJ 井的負對數給出 expectation of that size and this is a measure for the number of bits or the 該大小的期望，這是衡量比特數或 amount of information you know the number of symbols that I need in order 信息量的量度，您知道我 to store and transmit that data now the Shannon limit coding theorem tells us 現在需要存儲和傳輸該數據所需的符號數香農極限編碼定理告訴我們 that the lower bound on the number of bits is the entropy of the distribution ，比特數的下限是分佈的熵 divided by log base 2 so the PI log 2 furthermore the entropy is concave since 除以對數基數 2，因此 PI log 2此外，熵是凹的，因為 P log P is convex and what that means is that the entropy of the mixture of two P log P 是凸的，這意味著兩個 distributions is always greater and equal then the mixture of the entropies 分佈的混合總是大於和等於熵的混合， in other words if I take two data sources and mix them up then they then 換句話說，如果我把兩個數據源混合起來，那麼它們 the uncertainty can only increase relative to what I would have gotten if 的不確定性只會相對於我想要的增加如果 I hadn't mixed it up which I guess is kind of what you would expect 我沒有把它弄混的話，我會得到，我想這就是你所期望的， okay so then we reviewed the entropy of a couple of simple things like coins and 所以我們回顧了一些簡單的東西的熵，比如硬幣， if you play Dungeons & Dragons and then we got to the Kraft inequality so the 如果你玩龍與地下城，然後我們到了卡夫不等式，所以 first thing to just briefly review is a prefix code while a prefix code is 首先要簡要回顧的是前綴代碼，而前綴代碼 something where if I read things often at some point I can easily determine 是如果我經常在某個時候閱讀一些東西，我可以很容易地確定 okay that's the end of the code because nothing else could have started with 這就是代碼的結尾，因為沒有其他東西可以 that set of code words so the left hand side we have something that's decidedly 從那組代碼開始單詞所以左邊我們有一些東西， not a prefix code on the right hand side it's a prefix code so that's much easier 在右邊肯定不是前綴代碼，它是前綴代碼，所以也更 to decode as well so a maps into 0 B maps into 1 0 and so on 容易解碼，所以 a 映射到 0 B映射到 1 0 so on so basically prefixes cannot be individual code in the world code words 等等，所以基本上是前綴不能是世界代碼字中的單個代碼 okay and then there's the Kraft inequality and so the inequality said ，然後是卡夫不等式，所以不等式說 that basically if I have an encoding where the length of each string L of X ，基本上，如果我有一個編碼，其中 X 的每個字符串 L 的長度 satisfies that the sum over 2 to the minus L of X is less equal than 1 then 滿足 X 的負 L 的總和超過 2是 l ess 等於 1 so first of all every prefix code satisfies this secondly if I have a 那麼首先每個前綴代碼都滿足這一點其次如果我有 length distribution of that type then I can find the prefix code for it ok and 該類型的長度分佈然後我可以找到它的前綴代碼並且 that proof so here's a slightly more Illustrated version of how you would do 可以證明所以這裡有一個稍微更多說明的版本你會怎麼 it for the forward part you need to look at the probability that you actually do 做對於前向部分，您需要查看 get the collision and in this case I just created a random string it's you 實際發生碰撞的概率，在這種情況下，我只是創建了一個隨機字符串，您 know 1 1 0 and you can see okay collides with C and you can see that the 知道 1 1 0並且您可以看到與 C 發生碰撞的 probability of that string being a collision is 1/8 doesn't matter how long 概率那個字符串是一個衝突是 1/8 the overall string is that I'm trying to collide it with because I'm just looking 與我試圖與之碰撞的整個字符串有多長無關，因為我只是 at prefix collisions for string that starts with 0 while okay the probability 在查看以 0 開頭的字符串的前綴衝突，而它的概率還可以 that it'll cry it was an arbitrary string it's 1/2 and so if since I know 我會哭，它是一個任意字符串，它是 1/2，所以如果因為我 that the overall probability of collision is you know less equal than 1 知道碰撞的總體概率是你知道的小於 1， therefore I get the Kraft inequality for prefixes and then to prove the backwards 因此我得到前綴的卡夫不等式，然後證明向後的 part suppose that I have a length set of basically one string of linked 1 1 of 部分假設我有一個長度一組基本上是一串鏈接的 1 1 個 links to 1 of length 3 1 of length 5 and ok I just pick the numbers from the left 鏈接到 1 個長度為 3 1 個長度為 5 的鏈接，好的，我只是從左邊選擇數字， because while I'm being lazy and you can easily see that that satisfies the Kraft 因為當我很懶惰時，你可以很容易地看到它滿足卡夫 inequality but now if I wanted to actually generate you know a prefix code 不等式但現在如果我想實際生成您知道的前綴 based on that well what you do is you pick the smallest of those numbers first 代碼，那麼您所做的就是首先選擇這些數字中最小的一個， or if there might be multiple of them obviously if I had one one it wouldn't 或者如果我有一個顯然可能有多個，那不會 because I've just used up all my probabilities but okay so I pick one so 因為我剛剛用盡了我所有的概率，但是好吧，所以我選擇了一個，所以 I use the code let's say zero for it and then I use the prefix one footrest and 我使用代碼讓我們說零，然後我使用前綴 1 腳凳， so now I subtract one from all the other numbers right so my 2 3 5 turns into 1 2 所以現在我從所有其他數字中減去一個，所以我的 2 3 5 變成 1 2 4 and then I go and pick again 1 so I use the 0 for it so I get now you know 4 然後我再次選擇 1 所以我使用 0 所以我現在 the code word 1 0 is the prefix 1 for the race the remaining set is now 1 3 知道代碼字 1 0 是比賽的前綴 1剩下的一組現在是 1 3 because 2 4 gets shifted and so at every step I you know pull some of my symbols 因為 2 4 被轉移，所以在每個步驟我你知道拉我的一些 of just list and so at some point I run out of symbols and at night no point 符號只是列表，所以在某些時候我用完了符號並且在附近沒有任何意義 would the residual probability ever have increased beyond what I would have happy ，剩餘概率不會超過我會高興 for okay that's why it works okay so then the reason off for actually having 的，這就是為什麼它可以正常工作，所以實際上有 this Kraft inequality is to show that if I pick a prefix code based on minus log 這個卡夫不等式的原因是為了表明，如果我選擇一個基於負對數基數 2 的前綴代碼 base 2 of P of X that first of all I can actually engineer such a code but that's X 的 P 首先我實際上可以設計這樣的代碼，但這 basically the reason why I need a Kraft inequality I needed for the constructive 基本上是我需要 Kraft 不等式的原因，我需要建設性 proof so first of all picking links you know the next largest integer 2 minus 證明所以首先選擇鏈接你知道下一個最大整數 2 減去 log base 2 of P of X is a sensible idea because if I sum over 2 to the minus C 對數基數 2 X 的 P 是一個明智的想法，因為如果我將 2 加到負 C 的 love minus log base 2 of P of X right that's of course less equal then 愛減去 X 的 P 的以 2 為底的對數，那麼這當然不那麼相等，那麼 skipping you know the steal operator right because then that number gets 跳過你知道竊取運算符是對的，因為那個數字會 larger well the number gets smaller and that overall it gets larger okay so I 變得更大更小，總體上它變大了，所以我 have 2 to the log base 2 of P of X which is nothing else than P of X which is 1 有 2 到 X 的 P 的對數底 2，這只不過是 X 的 P，它是 1 so therefore I know that the Kraft inequality holds so therefore by the ，因此我知道卡夫不等式因此成立，因此通過 construction that we saw before we actually engineered ourselves a code now 我們之前看到的構造我們實際上為自己設計了一個代碼，現在 that's optimal within one bit because at mo 它在一位內是最優的，因為在mo I waste one bit between log base 2 of P of X and the next largest integer and I 我在 X 的 P 的以 2 的對數基數 2和下一個最大整數 can turn this into something that wastes only 1 over K bit by going to a k2 pool 之間浪費了一位，我可以通過轉到現在一個 k2 池 now the optimality proof goes through the KL divergence and it's emitted here ，最優性證明通過 KL 散度並在這裡發出， but we'll just do it on a very next slide pretty much because we need to 但我們將在下一張幻燈片上進行，因為無論如何我們都需要 have the KL divergence anyway okay so the cool baklava divergence is a way of 讓 KL 散度沒問題，所以很酷的果仁蜜餅散度是一種 measuring the distance between two distributions and sometimes people think 測量方法兩個分佈之間的距離，有時人們會 of this cubic level diversions there's something very very magical and you know 想到這個立方水平的轉移，這是非常神奇的事情，你 very hard to understand what's really going on and if you think about it's 很難理解到底發生了什麼，如果你考慮一下它 actually some very very tangible quantity so let's for the moment pretend 實際上是一些非常有形的數量，所以讓我們暫時 that we know that the optimal code uses log P it uses entropy many bits right or 假設我們知道最佳代碼使用log P 它實際上使用了許多位的熵或 Nats actually but you know entropy of a log - right so then I could ask you know Nats 但您知道log 的熵 - 是的，所以我可以問您是否 if I pick a code based on the wrong distribution how many bits would I use 知道我是否根據 t 選擇代碼他錯誤的分佈我會 in expectation then right and I can therefore measure the distance between 在預期中使用多少位然後是正確的，因此我可以 the true distribution the wrong distribution by just taking that this 通過將 difference between optimal bits and the bits with the wrong code right so 最佳位與錯誤代碼位之間的差異如此 intuitively that feels like the right thing to do 直觀地計算出正確分佈來測量真實分佈與錯誤分佈之間的距離，感覺就像現在正確的做法 now the wrong number of bits would be if I were to use minus log Q of X you know 是錯誤的位數是如果我使用 X 的負 log Q 你知道 it you know symbols to store some X whereas you know the correct thing would 它你知道存儲一些 X 的符號而你知道正確的事情 be minus log P of X so I just take the expectation of that so DP of X of minus 是 X 的負 log P 所以我只取對此的期望，因此 X 的 DP 減去 X 的負 log Q of X minus minus log P of X which is just a really complicated right way log Q 減去 X 的負 log P 這只是 of saying log P over P over Q and if you if you've done major theory you will be 說 log P over P over Q 的一種非常複雜的正確方式，如果你已經完成了主要理論，你會被 horrified by me being so cavalier here you would actually want to have to 嚇壞我在這裡太騎士了，你實際上想要 rather nikodem derivatives you would have want to have something like DP DQ 寧可尼科德姆衍生品，你會想要像 DP DQ 這樣的東西， but I'm not going to worry about those niceties here after all this is an 但我不會擔心這裡的這些細節，畢竟這是一個 undergrad class and this is about worried about this theoretical as you 本科課程，這是關於w 對這個理論感到不安，因為 want to be here anyway so what we want to do so is we want to prove that this 無論如何你都想在這裡，所以我們想做的是我們想要證明這種 KL divergence because good luck libel is kind of hard to pronounce so everybody KL 散度，因為好運誹謗很難發音，所以每個人 says KL divergence that is actually well behaved so the first thing to check is 都說 KL 散度實際上表現良好所以第一個要檢查的 that that is the divergence between P and P vanishes well that's easy because 是，P 和 P 之間的分歧消失得很好，這很容易，因為 it's just log P over P which is log of 1 which is 0 it's an expectation it's 0 它只是 P 的對數 P，它是 1 的對數，它是 0，這是一個期望，它是 0 quick question why am i calling it a divergence and not a distance yes 快速問題為什麼我稱它為分歧而不是距離 it's yes it's not symmetric so can you tell me why it's not symmetric 是的，是的，它不是對稱的，所以你能告訴我為什麼它不是完全對稱的， exactly yeah exactly so because I have the integral 是的，因為我 with regard to P so if you think about it one so the divergence between P and Q 有關於 P 的積分，所以如果你考慮一下，那麼 P 和 Q 之間的差異 is the number of extra bits that I have to pay to encode data drawn from p with 就是我額外的位數必須付費以使用來自 q 的代碼對從 p 提取的數據進行編碼 a code from q the other way is the number of extra bits that i need to do ，另一種方式是 need to pay if n codes are drawn from Q with a code from P and those two things 如果從 Q 中提取 n 個代碼並使用來自 P 的代碼，我需要支付的額外位數，並且這兩件事 need not necessarily be the same now the next thing though is we need to show 不一定需要現在一樣接下來的事情是我們需要顯示 where we want to show that this KL divergence is non-negative because if I 我們想要在哪裡證明這個 KL散度是非負的，因為如果我 show that it's not negative in the only vanishes for P equals Q then I'm home 證明它在 P 等於 Q 的唯一消失中不是負的，那麼我回家 write this thing I know that you know this is actually a meaningful measure 寫這個東西我知道你知道這實際上是一個有意義的度量 and furthermore if this is the difference between the inefficient and ，此外，如果這是低效位和有效位之間的差異， efficient bits then I know that while Shannon's coding theorem holds right to 那麼我知道香農的編碼定理適用 the Shannon limit because I just proved that any other code will be less 於香農極限，因為我剛剛證明任何其他代碼的 efficient than the one that I would have picked this way so what I'm basically 效率都會低於我會選擇這種方式，所以我基本上 doing is so this is the last line on the slide so I have the P of X log of Q over 在做的是，這是幻燈片上的最後一行，所以我有 P 的 X 對數的 Q 超過 P now the logarithm is a concave function so therefore if I pull it out P 現在對數是一個凹函數，所以如果我把它拉出來 right get some chalk 正確拿一些粉筆 and the negative logarithm mind you is of course a convex function so if I have 和負對數記住你當然是一個凸函數所以如果我有 a convex function and I have the expectation between let's say those two 一個凸函數並且我有兩個點之間的期望 points then we know that for a convex function the line between two points is 那麼我們知道對於一個凸函數來說兩點之間的線 above the epigraph of that function in other words the average of the values is 在該函數的標題之上，換句話說， greater than the value of the average okay this is also called the instance 值的平均值大於平均值，這也稱為實例 inequality and that's exactly the Equality that I'm invoking here in order 不等式，這正是我在這裡調用的平等，以便 to pull the log out so I now get greater equal then minus log of the integral D P 將日誌拉出所以我現在得到大於等於然後減去 x 的積分 D P of x over Q of X over P so the piece cancel out and then I have the log of 對 X 的 Q 對 P 的對數，所以這部分抵消，然後我有 one right which is zero and that's just how I prove this so it's very straight 一個右的對數為零，這就是我證明這一點的方法，所以它是非常直接的 forward convexity even though you know the implications are quite deep the math 凸性即使您知道含義很深，但 is actually fairly elementary okay so now we did all this algebra but what for 數學實際上是相當基本的，所以現在我們做了所有這些代數，但是 let's go back to the cross-entropy loss so remember the cross entropy loss was 讓我們回到交叉熵損失，所以請記住交叉熵損失 between let's say Y and oh but a wise now might actually be a probable 介於假設 Y 和哦，但是 a 明智的現在實際上可能是一個可能的 distribution it's the log of you know sum over e to the oh I minus y 分佈，它是你知道的對數總和到 oh I 減去 y transposed oh okay now if I compute the KL divergence 轉置哦，現在如果我計算 softmax ofoh 之間的 KL 散度 between softmax ofoh which is P what I'm estimating and well Q which is the truth ，這是我估計的 P 和 Q 這是真相 then I get you know just you know sir sum over Q I lock you I minus Q I locks 然後我讓你知道你知道先生總和超過Q我鎖定你我減去Q我鎖定 off max of oh I now I plug in the definition of the softmax which is 了哦我現在我插入softmax的定義，這 nothing else than you know log sum over I e to the oh I and so now by just 就是你知道的對數總和我e到哦我所以現在通過 straightforward algebra I get exactly the cross entropy loss out of it 簡單的代數，我得到了準確的交叉熵損失， except that I also have this entropy term of Q sitting around with me but 除了我也有這個熵項 Q 坐在我身邊，但 since that doesn't actually depend on oh I can drop it so that's how we get to 因為這實際上並不取決於哦，我可以放棄它，所以這就是我們得到的 the Cross Center if we lost from the KL divergence so when people try to 如果我們從 KL 散度中丟失，則到交叉中心，所以當人們試圖 intimidate you by telling you that they are doing maximum entropy methods or 通過告訴您他們正在使用最大熵方法或 they're using information theoretical approaches most of the times they are 他們在大多數時候使用信息理論方法來恐嚇您時，他們 just adding or removing that constant in order to intimidate people 只是按順序添加或刪除該常數恐嚇別人， so don't let them do that to you 所以不要讓他們那樣對你 okay and now this just barely scratches the tip of the surface of the iceberg 好吧，現在這只是冰山一角， yeah there's a lot more in information theory this is an awesome book I mean 是的，信息論還有很多內容，這是一本很棒的書，我的意思是 it's old by now so I remember this was an old book when 它現在已經過時了，所以我記得這個在 I did my PhD it's still an awesome book even now covering Thomas if you search 我攻讀博士學位時是一本舊書即使現在它仍然是一本很棒的書，涵蓋了 Thomas 如果您 for it you can maybe find some versions also online there's an information 搜索它，您也許可以在網上找到一些版本有一個信息 theory course that's pretty decent so there's some slides on entropy and then 理論課程非常不錯，所以有一些關於熵的幻燈片，然後 there's the book by the late David Makai it's a very nice book probably a little 是這本書由已故的 David Makai撰寫，這是一本非常不錯的書，可能 bit more detail than what you need but if you want to dive into information 比您需要的要詳細一些，但是如果您想深入 theory and how this all relates the machine learning this is a good place 研究信息論以及這一切與機器學習的關係，這是一個好地方 and then there are whole lot of papers by Martin Wainwright and Mike Jordan ，然後有很多論文作者：Martin Wainwright 和 Mike Jordan， they're both faculty in the CS department and they've published 他們都是計算機科學系的教員，他們發表了 extensively on experiential families and other things it's very nice work there 大量關於經驗家庭和其他事情的 okay and that concludes our discussion of information theories I hope I didn't scare you too much of that so far 文章到目前為止大部分