The New Data Revolution–Chomsky Against The Real World|Uncle Guido's Facts

Yarden Katz, in a long but thorough piece in The Atlantic (11.1.12) has written about the conflict between traditional science and new statistical-based models of investigation which are not so much concerned with the nature or origin of phenomena, but with their results. We do not have to understand how a gene works, for example, but we can mine billions of bits of data to record what that gene or combinations of it produce in the way of disease:

The sequencing revolution has just begun and a staggering amount of data has already been obtained, bringing with it much promise and hype for new therapeutics and diagnoses for human disease. For example, when a conventional cancer drug fails to work for a group of patients, the answer might lie in the genome of the patients, which might have a special property that prevents the drug from acting. With enough data comparing the relevant features of genomes from these cancer patients and the right control groups, custom-made drugs might be discovered, leading to a kind of "personalized medicine." Implicit in this endeavor is the assumption that with enough sophisticated statistical tools and a large enough collection of data, signals of interest can be weeded it out from the noise in large and poorly understood biological systems.

On the surface this may seem obvious to most of us brought up and living in the computer age. We are aware that the computer enables the programmer to collect unimaginable amounts of data and assemble them in meaningful way. A Google search is a perfect example. All it takes is a few keystrokes for the software to anticipate what the final words of a search will be; and based on the millions of similar inquiries made about that particular subject, it can list those sites most likely to answer it. Google doesn’t care who you are personally, but is only interested in figuring out – in milliseconds – how you fit within a larger data-group. If millions of searchers who type in the letters ‘W’ and ‘E’ ending up typing ‘Weather’, then it can anticipate your search before you have completed your entry and can immediate direct you to a site preferred by hundreds of millions of other searchers.

What many of us do not know is that this data-mining ‘sequencing revolution” is truly revolutionary for it is rejecting the idea that we have to go back to first principles, to understand the nature of something and how it works before we take the next step in figuring out how to modify it.

In a recent conference at MIT, the world’s most renowned experts “Brains, Minds, and Machines” (5.11) met to discuss the various relationships between and among these three components of modern life. One of the most interesting debates was between Noam Chomsky, one of the pioneers of cognitive theory who rejected Skinner’s Behaviorism and Peter Norvig of Google who argued strongly for a neo-Skinnerian model of data mining.

Noam Chomsky and others worked on what became cognitive science, a field aimed at uncovering the mental representations and rules that underlie our perceptual and cognitive abilities. Chomsky and his colleagues had to overthrow the then-dominant paradigm of behaviorism, championed by Harvard psychologist B.F. Skinner, where animal behavior was reduced to a simple set of associations between an action and its subsequent reward or punishment.

Chomsky has always argued that the only way to understand language is to probe the inner-workings of the brain – in other words to find the locus and operational dynamics of language and to understand it from this fundamental point of view – while Norvig said that language can be understood simply by tracking its use over trillions of sentences. In other words, we do not need to know the origin of language or how it results from configurations in the brain. We only need to observe the phenomena of observed language to understand of what it is comprised, what are the similarities and differences among human languages, and from the mining of those data, produce language software sophisticated enough to mimic native speakers. Not only that – and to the real point of the MIT conference – we can enable computers to act human, to be artificially intelligent.

Let alone the fact that data sequencing is producing revolutionary results in every scientific field, the very fact that traditional science is being so completely upended is even more revolutionary.

Just as the computing revolution enabled the massive data analysis that fuels the "new AI", so has the sequencing revolution in modern biology given rise to the blooming fields of genomics and systems biology. High-throughput sequencing, a technique by which millions of DNA molecules can be read quickly and cheaply, turned the sequencing of a genome from a decade-long expensive venture to an affordable, commonplace laboratory procedure. Rather than painstakingly studying genes in isolation, we can now observe the behavior of a system of genes acting in cells as a whole, in hundreds or thousands of different conditions.

This reliance on data rather than more personalized inquiry is having an impact in many areas, not just hard science. Crowdsourcing is now being looked at as a way of eliminating polling, pollsters, and pundits. If you collect data from a large enough population on a particular issue, the results have been shown to be far more accurate than any ‘expert’ analysis. A simple example taken from my recent blog post (http://www.uncleguidosfacts.com/2012/11/crowdsourcing-and-predictive-markets.html) is the famous gumball experiment. Researchers asked both mathematicians and a large random sample of ordinary Americans to guess the number of gumballs in a large glass container; and in all trials the average number guessed by the random sample was remarkably close to the actual number and far more accurate than the ‘educated’ guess of the experts.

Crowdsourcing has been used to predict the outcomes of political elections, the likelihood of a world event happening, and the results of football games. I no longer pay attention to the sports analysts’ NFL predictions reported in the Saturday papers, but on ‘Readers’ Predictions’, for these are based on large samples.

So, the legacy of B.F.Skinner has been rehabilitated. He knew that it was enough to cure a patient of agoraphobia. He did not have to know why some obscure childhood incident provoked it. No one cares anymore about the origin of human language. We only want intelligent search engines, websites that can anticipate our buying habits, cars that understand our verbal commands. We don’t care how many degrees Dr. Know-It-All has or what he says, we just want to know the right answer, or at least the most likely one.

In the social world subjectivity is out. No one needs to know what ‘You’ think; just what ‘millions’ think. There is no longer any need for Creatives in ad agencies – millions of random people can (and have) come up with brilliant sales pitches. Forget the deliberations of Wolfowitz and the Neocons – millions of crowd-sourced Americans could have told George W that it was a bad idea.

Uncle Guido's Facts

Pages

Friday, November 2, 2012