L5/6 Information Theory(資訊理論) - YouTube - 沉浸在深度學習-柏克萊大學開放式課程- P1(機率)

Dive into Deep LearningUC Berkeley, STAT 157 影片原始連結:https://www.youtube.com/watch?v=Ggh3JPGQoxw&list=PLZSO_6-bSqHQHBCoGaObUljoXAyyqhpFW&index=25 資訊理論是應用數學、電子學和電腦科學的一個分支，涉及訊息的量化、儲存和通訊等。用來找出訊號處理與通訊操作的基本限制，如資料壓縮、可靠的儲存和資料傳輸等。其中課堂中會提到的"熵"(Entropy)是訊息中的一個關鍵衡量方法，通常用一條訊息中需要儲存或傳輸一個符號的平均位元數來表示。熵衡量了預測隨機變數的值時涉及到的不確定度的量。例如，指定擲硬幣的結果（兩個等可能的結果）比指定擲骰子的結果（六個等可能的結果）所提供的訊息量更少（熵更少）。字幕[此字幕為機器自動翻譯] ------ in theory so okay so remember before that we have terms like cross entropy 理論上很好，所以請記住之前我們有諸如交叉熵 loss and so on thrown around so the question is you know what does that even 損失之類的術語，所以問題是你知道這甚至 mean and this is really the super light skim milk version of information theory 意味著什麼，這真的是我要講的信息論的超輕脫脂牛奶版本 that I'm going to cover really just enough to make sense out of the other 涵蓋的內容真的足以 things that we are doing so entropy the question that Shannon started out with 讓我們正在做的其他事情變得有意義所以熵香農開始的問題 is you know let's say I have some data source producing observations x1 through 是你知道假設我有一些數據源產生觀察 x1 到 xn so like how much information is there in this source can we quantify this in xn 就像這個源中有多少信息我們能否以 some meaningful way so for instance if I have you know a fair coin and at each 某種有意義的方式對此進行量化，例如，如果我讓您知道一枚公平的硬幣，並且在每 step you know surprises you know whether it's heads or tails now if I roll a fair 一步您都知道驚喜，您現在知道它是正面還是反面如果我擲一個公平的 dice you know I get you know six possible outcomes and there should be 骰子，您知道我讓您知道六種可能的結果和應該有 more information in that and tossing a coin and then maybe you know maybe 更多的信息，然後扔硬幣，然後也許你知道可能 causing two tossing two coins has more information than a tie so maybe not and 導致兩次扔兩個硬幣比平局有更多的信息，所以也許沒有， then you can argue about it it's kind of tricky or if you think about it you know 然後你可以爭論它是一種技巧 ky 或者如果你想一想，你知道 I could take a picture of a white wall versus a picture of this lecture theater 我可以拍一張白牆的照片，而不是這個演講廳的照片 and obviously if I take a picture of all of you this contains a lot more ，顯然，如果我給你們所有人拍張照片，這 information than just to take a picture of a white wall and the nice way that 比僅僅拍一張照片包含更多的信息。白牆和 how Shannon formulated this is by saying well the entropy is the minimum number 香農如何制定這一點的好方法是說好熵是 of bits that are needed to store this okay now how do we formalize this 存儲它所需的最小位數現在我們如何形式化它 so this is the ingenious definition of challenge namely he defined the entropy 所以這是挑戰的巧妙定義，即他定義了熵 H of P to be minus the sum over all outcomes j-p-j times log of P J so in H P 減去所有結果 j-p-j 的總和乘以 P J 的 log 所以 other words it's the expected value of minus log P and then this is very famous 換句話說，它是減去 log P 的期望值，然後這是非常著名的 coding theorem namely that the entropy is the lower bound on the number of bits 編碼定理，即熵是位數的下限 or in this case rather Nats so base e off that I needed and we're going to 或這個案例而不是Nats 所以我需要的基礎 e，我們將 prove this in this class it's probably one of the more complex proofs that we 在這門課上證明這一點，這可能是我們將要做的更複雜的證明之一， will do but afterwards you'll feel that the entropy is a lot less mysterious 但之後你會覺得熵遠沒有它那麼神秘 than it would be otherwise a couple of nice things so the entropy itself is 會 d 是一些不錯的東西，所以熵本身是 convex since P log P is convex right so linear function P log P is not convex 凸的，因為 P log P 是右凸的，所以線性函數 P log P 不是凸的， but P log P is convex because it pins up further than the linear function but 但 P log P 是凸的，因為它比線性函數更遠，但 before we do that let's actually look at a couple of things let's take a fair 在我們這樣做之前讓我們實際看看幾件事情讓我們拿一枚公平的 coin so if I have a fair coin well the entropy of that is you know 1/2 log base 硬幣，所以如果我有一枚公平的硬幣，那麼熵就是你知道 1/2 的 1/2 對數基數 2 of 1/2 then another half log base 2 of 1/2 and so that's one bit and indeed if 2，然後是 1/2 的另一半對數基數 2等等那是一點點，實際上，如果 I wanted to encode you know 10 coin tosses on a computer I could just do 我想對您進行編碼，您知道在計算機上擲硬幣 10 次 that through a bit sequence now if I have a biased coin so which with 90% ，如果我有一個有偏見的硬幣，我現在可以通過一個位序列來做到這一點，因此如果您工作，那麼有 90% 的 probability comes up heads in otherwise tails then if you work out the numbers 概率出現正面反面把數字拿出來， so 0.9 times log base 2 of 0.9 plus the corresponding 0.1 所以 0.9 乘以 0.9 的以 2 為底加上相應的 0.1 I get around 41 point four one bit so you can already intuitively see that a 我得到大約 41 點 4 一點所以你已經可以直觀地看到一個 coin that mostly spits out let's say tails should be easier to store as a 大部分吐出的硬幣讓我們說尾巴應該更容易存儲為 sequence than something that now will think produce things at random 序列而不是如果你現在會認為現在隨機生產東西 now if you've played Dungeons and Dragons you know that well they don't 玩過龍與地下城，你知道他們不會 muck around with regular dice they use a 20-sided dice and that 20-sided dice 亂扔普通骰子，他們使用20 面骰子，而 20 面骰子 requires you to store around 4.3 bit so there's a fair amount more information 需要你存儲大約 4.3 位，所以 in each roll of a dice then you would have in the standard coin toss okay in 每卷一卷都有相當多的信息骰子然後你會在標準的硬幣擲硬幣中 order to prove that theorem we need to look a little bit at what prefix codes 為了證明這個定理我們需要看一下前綴碼 are and something called a Kraft inequality yes so then that's correct 是什麼以及所謂的卡夫不等式是的所以那是正確的 yeah yes because you're absolutely right 是的因為你是絕對正確的 thank you 謝謝你 so now 所以現在 let's look at something called the Kraft inequality so there's a code 讓我們看一下稱為卡夫不等式的東西，所以有一個代碼， so codes are so-called prefix codes if I can map every symbol into you know a 所以代碼就是所謂的前綴代碼，如果我可以將每個符號映射到你知道一個 code let's say you know zeros and ones with some length L of X and where no 代碼假設你知道零和一個長度為 X 的 L 並且沒有 sequence is a prefix of another code word so for instance I couldn't have as 序列是另一個代碼字的前綴，所以例如我不能作為 one code word talking there's another word code word doghouse that wouldn't 一個代碼字說話有另一個詞代碼詞 doghouse work because dog would be a prefix to doghouse and prefix codes are very nice 不起作用，因為 dog 將是 doghouse 的前綴，並且前綴代碼非常好， because they're very nice to decode I just you know go and look things up and 因為它們是非常好解碼我只是你知道去查找東西 then you know I find something and then you know I go and pick the next term and 然後你知道我找到了一些東西然後你知道我去選擇下一個術語並且 well so for instance you know I can if I look at the first terms well that's not 很好所以例如你知道我可以如果我很好地查看第一個術語那不是 not a prefix code right because the code for ie would be a prefix for B the code 不是前綴代碼，因為 ie 的代碼是 B 的前綴 B 的代碼 for B would be a prefix for see the code for C will be a prefix for D so that's 是 B 的前綴查看 C 的代碼將是 D 的前綴，所以 pretty bad on the other hand on the right hand side 另一方面，在右側 i encode a as a 0 so now the only option that I have left is that I pick for B 1 i 這很糟糕將 a 編碼為 0 所以現在我剩下的唯一選擇是我選擇 B 1 and then 0 and ok in that case I burned up the ones here as well so I I can only 然後 0 並且在這種情況下我也燒掉了這裡的那些所以我只能 do a 1 1 then maybe 0 and then D I'm just left with 1 1 okay so that would be 做一個 1 1 然後可能是 0 然後 D 我只剩下 1 1 好吧，所以這將是 a prefix code 一個前綴代碼 now here's a really cool inequality the so called Kraft inequality which says 現在這是一個非常酷的不等式所謂的卡夫不等式，它 that if and only if you know we have a prefix code then the following holds one 表示當且僅當你知道我們有一個前綴代碼，然後以下成立 is greater equal in the sum over all X's of 2 to the minus the length of that 一個更大等於在所有 X 的總和中 2 減去該代碼的長度 code ok so they're pretty powerful inequality so and to prove it it's ok 所以它們是非常強大的不等式 o 並且為了證明它 actually not that hard so first of all we want to prove that this sum actually 實際上並不難所以首先我們想證明這個總和實際上 if we have a prefix code is bounded by 1 so what we can do is we can just look at 如果我們有一個前綴代碼是由 1 限制的所以我們可以做的是我們可以只看 all the collisions so basically I going could generate a random string and I 所有的碰撞所以基本上我去可以生成一個隨機字符串，如果我 look at the probability that this random string actually happens to be the code 將所有必須以 1 為界的概率相加，我會查看這個隨機字符串實際上恰好是代碼 word now if I sum over all those probabilities right that has to be 字的概率， bounded by 1 because I can only hit that most one code word so 1 is greater equal 因為我只能命中大多數一個代碼字，所以 1 in the probability of collision with some code word so I can therefore sum 與某個代碼字發生衝突的概率更大，因此我可以總結 over the probability that X is actually hit it's only one of them is going to be X 實際被擊中的概率，現在只有其中一個會被 activated anytime now since I generated a random string the probability that 激活，因為我生成了一個隨機字符串， it's a hit is given by 2 to the minus length of that string it's a random 它被擊中的概率是由 2 到該字符串的負長度它是一個隨機 binary string right we say how you know on this you're in 1 then the probability 二進製字符串對我們說你如何知道你在 1 那麼 would be 1/2 or 1/2 you know 0 0 then you know the probability would be 1/4 概率是 1/2 或 1/2 你知道 0 0 然後你知道概率會是 1/4 because I have now two symbols and so on and since 1 the very left-hand side of 因為我現在有兩個符號等等，從 1 開始， this we have as an upper bound one we've just proven that if we have a prefix 我們有一個上界作為上界，我們剛剛證明瞭如果我們有一個前綴 code then the inequality holds 代碼，那麼現在不等式成立， now comes the heart a bit to prove that if the inequality holds then we can 現在有點想證明如果不等式成立，那麼我們 actually engineer prefix code and this is about the simplest proof I can come 實際上可以設計前綴代碼，這是我能想出的最簡單的證明 up with there are lots of other slightly more elaborate proofs for instance ，還有很多其他稍微複雜的證明，例如 actually Wikipedia has one but that may not be the simplest one anyway this is 實際上維基百科有一個，但無論如何這可能不是最簡單的，這是 the simplest I could come up with so we're actually going to construct a 我最簡單的可以想出，所以我們實際上要以 prefix code explicitly recursively so what I'm going to do is I'm going to 遞歸方式顯式構造一個前綴代碼，所以我要做的是 pick the set of eggs which have the shortest sequence well which have the 選擇具有最短序列的雞蛋集合，其中 smallest L over L of X okay so I know that for those guys please you know to L 比 L 最小 X 好的，所以我知道對於那些人來說，請你知道 lift in some inside of that sum is one right so I know that for those guys I 在那個總和中取出一些是對的，所以我知道對於那些人，我 can find a corresponding you know binary string and make them all unique quite 可以找到一個對應的你知道的二進製字符串，然後很容易地使它們都是唯一 easily all right and then I have some rest of all the probabilities now what I 的，然後我有一些資源現在我 do is let's say for instance the length of the string is 3 so I get 2 to the 要做的就是假設字符串的長度是 3，所以我得到 2 到 minus 3 right so I get 1/8 and maybe I have 5 of those guys so I have 3 left 負 3，所以我得到 1/8，也許我有 5 個這樣的人，所以我現在還剩下 3 個 now I take the remaining probabilities which have to be less than 3/8 and split 我取剩餘的概率必須小於 3/8 並將 them up into groups of 188 I don't know corresponding for that and now I 它們分成 188 個我不知道對應的組，現在我 multiply everything by 2 to the minus 3 and that just recurs my algorithm oft 將所有內容乘以 2 到負 3，這只會重複我的算法經常 generating their prefix and basically giving all the remaining probability 生成它們前綴，基本上給所有剩餘的概率 chunks their own unique prefix and then the rest is you know just applied it 塊他們自己的唯一前綴，然後剩下的就是你知道只是再次將它應用 again to the rest so in other words all I'm doing is I'm just working my way 到其餘部分所以換句話說我所做的就是我只是 from the head the tail of my set of set of numbers and whenever I find the next 從頭到尾工作我的一組數字，每當我找到下一個 shortest term i generate the prefix and then you know 最短的術語時，我都會生成前綴，然後你知道 apply the same mechanism to the substance again and so this way at some 再次將相同的機制應用於物質，所以在某些 point I'll run out of things that I need to encode and since it every time I 時候我會用完我需要編碼的東西和因為每次我 didn't exhaust my entire budget of summing up to one I'm done it's only one 都沒有用完我的全部預算最多一個我已經完成了它可能只有一個 title probably if I have an infinite number of symbols and this proof doesn't 標題如果我有無限數量的符號並且這個證明 work for infinite number of symbols but for a finite one it wolf okay so why do 不適用於無限數量的符號但是對於有限的一個它狼好所以 we need all of this well actually for a very simple reason because now we are 我們為什麼需要所有這些實際上，原因很簡單，因為現在我們 going to prove the coding theorem visit namely that the entropy is a lower bound 要證明編碼定理訪問，即熵是 on the number of bits and I think that's pretty much Waveland the first thing is 比特數的下限，我認為這幾乎是 Waveland 的第一件事是 we're going to generate the prefix code with links 我們將生成前綴代碼帶鏈接 hello fix is the seal of log well of minus log base 2 of P of X so I take the hello fix 是X 的 P 的負對數基數 2 的對數井的印章，所以我取每個事件 binary logarithm of the probability for every event and it just round up to the 概率的二進制對數，現在它只是四捨五入到 next integer now e to the minus that has to sum up to 1 so 2 to the minus that 下一個整數 e 到必須加起來的負數到 1 所以 2 減去 sum over all the strings has to sum up to lest it less equal than 1 because all 所有字符串的總和必須總和以使其小於 1 因為 I've done is otherwise if I have 2 to the log you know that's just some of the 我所做的只是否則如果我有 2到日誌你知道這只是 over let me write it out it's easier to explain so the sum over P of X over X is 一些讓我寫的它更容易解釋，所以 X 和 X 的 P 之和是 1 right which is nothing else than the sum over X of 2 to the minus minus log 1 正確的 h 只不過是X 的 2 與 X 的 P 的負負對數 base 2 of P of X right nothing special has happened which is of course greater 底 2 的總和，沒有發生任何特殊情況，這 equal then the sum over X 2 to the minus Seal of minus log base 2 of P of X ok 當然大於 X2 與 P 的負對數底 2 的負印章的總和 X 好的 and now this is exactly where the Kraft inequality can kick in the Kraft ，現在這正是卡夫不等式可以在卡夫 inequality says if I have this then I can always find a prefix code okay 不等式中發揮作用的地方，如果我有這個，那麼我總是可以找到一個前綴代碼， so what that shows and of course we know that this is actually you know this 所以這表明了什麼，當然我們知道這實際上是你知道這 itself is greater equal than 1/2 why can i why does the why does it upper bound 本身就是大於 1/2 為什麼我為什麼它為什麼上限 hold is that lower bound hold 保持是下限保持 it's so what happened is if I go from minus log P of X to seal off log of 它所以發生的事情是如果我從X 的負對數 P 到封閉 X 的 minus log of P of X then those two numbers will never differ by more than 負對數 P 的對數然後那些兩個數字的差異永遠不會超過 one right that just round to the next integer now if I round to the next 一個權利，現在只要四捨五入到下一個整數，如果我四捨五入到下一個 integer well the discrepancy can be not no more 整數，差異不能 than one half so this number can never be less than one-half that number and 超過一半，所以這個數字永遠不會小於那個數字的二分之一，並且 since that holds and I know that this is one this has to be one half so that's 因為那是成立的，我知道這是一個，這必須是一半，所以 what it means by saying well this is within one bit of the optimal code right 這就是說得好這是在最佳代碼的一點內的意思， because the serum claims that you know it's exactly you know log base two of PJ 因為血清聲稱你知道它正是你知道 PJ 的對數基數 2 而 not the next integer all right okay now how do we go from you know log base 2 of 不是下一個整數，好吧，現在我們如何從你知道對數基數開始 PJ to making this work exactly well what we can do is we can just combine the PJ 的 2 點要使這項工作完全正常，我們可以做的是我們可以將 data into K tuples right so that will drive all the probabilities down to 數據正確組合成 K 個元組，這樣可以將所有概率降低到 something much smaller but now since I'm combining you know K tuples the rounding 更小的值，但是現在因為我正在組合你知道 K 個元組的捨入 error of up to one will now get split over K probabilities so that drives my 誤差最多一個現在將在 K 個概率上進行拆分，因此如果我使這些 k 元組足夠長，則將我的 rounding error to one over K rather than one if I make those k tuples long enough 捨入誤差驅動到 K 上的 1 而不是1， then everything works out fine 那麼現在一切正常， now the optimality of that that there is no better code goes with luque a cool 沒有更好的代碼的最優性 luque 一個很酷的 bug live later versions and we'll do that on Thursday just one quick aside so bug 更新版本，我們將在星期四做這個，只是一個快速的，所以 you might think well this is actually really nice right I mean so coding 你可能會認為這實際上非常好，對我的意思是編碼 theories is easy right so why don't we have optimal codes right because after 理論很容易，所以為什麼我們沒有最佳代碼，因為 af all you know I could just go and design some encoding algorithms well the 就你所知，我可以去設計一些編碼算法， problem is that in order to make things work really well 問題是為了讓事情真正運作良好， need very long sequences and very long sequences make for very expensive 需要非常長的序列，而非常長的序列會導致非常昂貴的 decoders and that's where for instance turbo codes come in so if at some point 解碼器，這就是例如渦輪碼的用武之地，所以如果在某些時候， you want to take an a graduate level information Theory class they'll cover 您想參加研究生水平的信息理論課，他們將涵蓋 turbo codes and low-density parity codes and all of that in a lot more detail 渦輪碼和低密度奇偶校驗碼以及所有這些更詳細 okay so thanks for today the homework will be up later tonight including the 的內容 solutions for two weeks ago ok good luck see you on Thursday 兩週前好的祝你好運週四見