Thu, 09/02/2010 - 15:33 — npatwari

I just finished dusting off some Matlab code to estimate the entropy of English character sequences from a text source. In my opinion, this is a good tool to teach entropy rate. One might use the idea to calculate the entropy rate of another language, or other discrete-valued data source, like numerical data or twitter tweets. My code isn't particularly smart; my storage (and computation) is increasing as $L^N$ where $L$ is the number of characters considered and $N$ is the character sequence length. I'm sure someone more adept at programming can implement a more efficient version (perhaps a hash table?). However, the code does work, and computes the entropy for a sequence of characters (I've tested up to 4) from a given text file. I used Shakespeare's *Romeo and Juliet*, and found per-character entropies of 4.12, 3.73, 3.35, 2.99, for $L=$ 1, 2, 3, and 4, respectively. Info on how this is done is in my lecture 4 notes from today's Advanced Random Processes class; and the letter entropy Matlab code and Shakespeare text are also posted.

- npatwari's blog
- Login to post comments