Search Google

Thursday, August 6, 2009

ILL ESOTERICS: From Procreation to Text Generation

By THOM HAWKINS

In 1202, Leonardo Pisano (~1170-1240, aka Fibonacci) wrote Liber Abaci, a Book of Calculations that included a math problem related to fornicating rabbits. If you start with a pair of rabbits—one male and one female—who have just been born, and assuming they reach sexual maturity after one month and have a gestation period of one month, how will this population of rabbits grow? At the start of the experiment, we have one pair. After one month, they are sexually mature, and do what bunnies that age do. A month later, the female gives birth to a litter of two—again, one male, and one female. (Okay, a few more assumptions—each time a rabbit gives birth, the litter will consist of one male and one female, both of whom survive and mate; yes, yes, incest and the gene pool and all that—but that's how you get freaks like the Easter Bunny, Bunnicula, and the Killer Rabbit of Caerbannog.) So far, we have one pair at the beginning, still one pair after the first month, and two after two months. For the third month, the first couple has another litter, but the second couple is just reaching sexual maturity and doing the horizontal bunny hop. And so on—1, 1, 2, 3, 5, 8, 13, etc. Oh yeah, one more assumption—rabbits never die. This series of numbers, not invented or discovered by Fibonacci, but merely popularized in his math text has, over the centuries, taken on cultural proportions (The Da Vinci Code, Fibonacci trading), although the numbers themselves are quite arbitrary and reproductively unrealistic. Leonardo's lagomorphs are so reliable that the series itself is often misinterpreted as each number being the sum of the two previous numbers, an assessment of sexual maturity notwithstanding. The Shao sequence takes two whole numbers and finds the difference between them, placing that number on each side of the initial pair, and working outward; for example, given a,b: a-b, a, b, a-b; then (a-b) - a, a-b, a, b, a-b, (a-b) - b, etc. In numbers, given 2,3: 1, 2, 3, 1; then 1, 2, 3, 2, 1, etc. No matter which two whole numbers are used at the start, the series will eventually begin repeating 1, 1, 0, 1, 1, 0, etc., demonstrating that given a random set, if a transformation is applied consistently, order will result. Both the Fibonacci sequence and the Shao sequence involve two elements: a rule (or set of rules) and a seed. Both sequences are also examples of a Markov chain. Named for Russian mathematician Andrey Markov (1856-1922), a Markov chain is a sequence that depends only on the present state, and not on any past state. That is, by examining the present state and any applicable rules, one can produce the next state without examining the history of the sequence. I first encountered the concept in Nicholson Baker's Double Fold: Libraries and the Assault on Paper: "As director of MIT's Operations Research Center, [Philip Morse] had the idea of applying OR's mathematical methods to the workings of MIT's library; out of that grew Morse's thickly mathematical treatise, Library Effectiveness (1968), which uses a technique called Markov analysis to determine whether a book of a particular age and number of previous circulations is likely to remain useful; in order to gather detailed circulation statistics, Morse wanted to computerize the library. The modern library, he felt, 'cannot now be operated as though it were a passive repository for printed material.'" There is a bit of a disconnect in this passage, because it refers to the book's previous circulation, which is an examination of the past. However, this should be read now with reference to recommendation engines, which can determine selections not only on past purchases of the user in question, but on purchases made by all users in conjunction with the current book. This collective knowledge is referred to as a corpus, and it provides the analytical basis for Markov analysis. The letter E is the most common in the English language. This fact is based on analysis of any significantly sized English-language corpus. Each letter has a relative frequency of appearance, as analyzed by Claude Shannon in his seminal paper, A Mathematical Theory of Communication (pdf). This is the basis for the letter points in Scrabble—the more frequent the letter is used in English, the lower the point value. In addition, by analyzing a corpus, one can also determine how frequently a letter appears in context with other letters. By increasing the context, one can generate increasingly intelligible text. Appelicontes of Teos, attempting to reconstruct the worm-eaten works of Aristotle, aroused ire by filling in the gap based on his own scholarly judgment. Given a large enough corpus of Aristotle's work, the gaps could have at least been informed, if not written by, Aristotle himself (Basbanes, A Gentle Madness, pp. 65-66). William Henry Ireland, forging the papers, and ultimately an entire play by Shakespeare, could have proven more difficult to detect by reverse engineering some of the very techniques used to identify fraudulent works (Basbanes, A Gentle Madness, p. 66). In fact, the composer David Cope has done this, by examining the corpuses of Mozart and Bach to generate "new works" by these composers. Not bad for animal husbandry.

No comments: