tsujigiri

The editorial comments of Chris and James, covering the news, science, religion, politics and culture.

"I'd take the awe of understanding over the awe of ignorance any day." -Douglas Adams

Tuesday, June 10, 2003

Months ago, I started this blog on a whim because I was annoyed with intelligent design creationism and wanted to vent. I also wanted to talk about science-related things, especially information theory and evolution. It seems that the most popular theme on Corpse Divine right now is music (as evidence: the lively discussion of Blur vs Radiohead, compared to the tumbleweeds blowing on other posts). In this post, I plan to address all of those topics, which will either make this a really cool post or another obnoxiously over-extended one. In this month's issue of Scientific American, there's an article about tracing the evolutionary history of chain letters. The authors collected a few dozen chain letters spanning several years, and then analyzed them using what they call a relatedness measure. By measuring the relatedness of each pair of chain letters, they were able to arrange the letters into groups of common ancestry. They were also able to infer which were the oldest versions on the basis of this measure. The same method was also used to identify the common ancestry of different mammals by measuring the relatedness of their genomes. I thought it was pretty cool. The method has been extended to things like detecting plagiarism and detecting spam email. Their method works like this (you can skip this part if the math makes you sleepy): let X and Y be two files whose relatedness we want to measure. The complexity of X, written K(X), is roughly the size of the file X after it has been compressed with a good compression algorithm (like zip). The joint complexity of X and Y, written K(XY), is the size of X and Y when they have been compressed together as a single file. The relatedness R is
R = {K(X) + K(Y) - K(XY)} / K(XY)
When X and Y are totally different, K(XY) = K(X) + K(Y), so R=0. When X and Y are identical files, then (approximately) K(XY) = K(X) = K(Y), so R=1. This is because a good compression algorithm works by finding repetitive patterns and reducing them to simpler patterns. For example, if X is a long document and Y=X, then to create the combined document XY I only need to write down X, followed by an instruction to repeat everything. So the size of the compressed file XX is roughly the same as the file X. Anyway, here's what I did with it. I wrote my own program to scan through a folder and measure the relatedness of all the files in it. I used it to trace the history of revisions to a bunch of programs I wrote a few months ago. The results made it easy to spot the places where I had branched from one approach to another. I could also tell when I had eliminated a file by dividing its functions into other files -- the other files all had a 20% relatedness to the original file. I could also instantly spot the version in which I updated the coding style of an important module. It was an afternoon of unadulterated geek excitement. Then I thought, I wonder if there is any way to apply this to music or photos? My program used gzip to do the compression. This is a very inappropriate method for music files, but I thought I'd give a shot just to see what happened. Recalling the lawsuit made by Wire against Elastica, I decided to compare Elastica's "Connection" with Wire's "Three Girl Rhumba". To make the study scientific, I collected a bunch of other songs, including other Elastica and Wire songs, some Ween, Frank Sinatra, Benfold's Five, etc. As expected, the results were a bit counterintuitive. "Three Girl Rhumba" did not compare well with "Connection." Some songs which did rate highly with "Connection":
  • Various other Wire songs besides "Three Girl Rhumba."
  • The Gourds, "Gin and Juice" (a bluegrass rendition of a hip-hop song).
  • Far and away the highest rating song when judged against "Connection" was Milli Vanilli's "Blame it on the Rain".
Grounds for another lawsuit? I think so. So the method, in its current form, doesn't work well on music files (and is pretty much guaranteed not to work on photos either). But it works extremely well on text files, and on information which can be expressed as a string of letters such as DNA. The intelligent design people (and their variously named predecessors) love to claim that "intelligent design" is something which can be inferred. Their methods are always vague and their arguments bogus. Relatedness measures provide precise, well-defined scientific tools which can be used to infer common ancestry on the basis of information theory alone. This method allows us to actually measure evolution. In so doing, it provides an elegant demolition of intelligent design theories.

0 Comments:

Post a Comment

<< Home