"Whenever I go into a restaurant, I order both a chicken and an egg to see which comes first"

Saturday, November 24, 2012

Big Data–The Hottest New Idea of 2012

Billions of bits, bytes, megabytes, terabytes, and godzillabytes are being generated and stored every day.  Think of the millions of surveillance cameras, weather sensors, traffic monitors, Geiger Counters and seismic instruments there are in the United States alone.  Then add to that the unimaginable number of emails, tweets, and Facebook posts that are transmitted and received every day. Then add all the text messages, Internet searches, and GPS systems; and data collected by Toyota, Honda, and Ford from each onboard computer of every vehicle brought in for service.  Then the unimaginable amount of data captured and recorded by Amazon or Google from Internet searchers and online consumers.  The amount of data that is being collected and stored today is staggering.

The problem is how to use it.  Such huge data bases allow for more sophisticated analysis of patterns and trends than smaller ones.  Data collected from one large data base can produce more reliable and useful correlations than the same amount of data from collected individual sources. The following example is illustrative:

Tobias Preis et al. used Google Trends data to demonstrate that Internet users from countries with a higher per capita gross domestic product (GDP) are more likely to search for information about the future than information about the past. The findings suggest there may be a link between online behavior and real-world economic indicators.

The results hint that there may potentially be a relationship between the economic success of a country and the information-seeking behavior of its citizens captured in big data. (Wikipedia)

In other words, Preis and his colleagues had access to Google search data from countries around the world and controlled for just one factor – future-oriented inquiries.  These could be anything from economic and financial predictions, to the likelihood of scientific breakthroughs, or the likely impact of global warming.  The data showed that either economic progress encouraged future-oriented searches; or that the tendency for such searches indicated a greater degree of positivism or entrepreneurial spirit than those who looked only backward.  In any case, business planners were able to use the data to build investment models – better to invest in countries with future-oriented populations than history-oriented ones.

Google’s Gmail reviews and records all email traffic and through application of sophisticated algorithms can discern trends both in individual user preferences (e.g. someone who often refers to Persian carpets might be interested in buying one) and in large population groups (an increase in the words ‘cough’ and ‘fever’ in Midwestern states might offer a clue to how the flu epidemic is progressing).  Both retailers and the CDC can use this information to generate sales or to set up flu treatment centers ahead of the epidemic.

The analysis of big meteorological data can improve forecasting and long-term trends.  In the days of the clipper ships, data was collected from thermometers and instruments that measured ocean temperature, currents and wind direction and velocity; and were used to guide future navigation.  Today similar data is collected from billions of micro-sensors distributed throughout the planet, fed into super-computers or thousands of parallel-processing computers to discern trends and correlate them with related meteorological events.

On a smaller scale, the US Department of Health collects data on every Medicare patient in the US, and with the advent of electronic medical records and the consolidation of much patient data through Obamacare, trends and correlations can be determined which will help guide the channeling of both public and private resources.  Already much of the received wisdom about breast and prostate cancer screening has been thrown into doubt by the mining of big data.  Analysis has shown that there is little correlation between repeated mammography or PSA screening and cancer survival rates.  The increased number of genetic screening tests to isolate genes for various types of disease can generate data which can be correlated with outcomes – i.e. how many individuals with a suspect gene actually develop the disease with which it is associated?

At the corporate level, industries can track everything about employee behavior, from hours worked to productivity; from sick days to performance.  In other words, a large company with enough data from its workforce can isolate the determinants of productivity and high performance.  A large corporation with thousands of employees working in hundreds of sites can correlate efficiency and productivity with flextime, office configuration, window access, or number and qualifications of employees. According to McKinsey there are five broad ways in which using big data can create value

First, big data can unlock significant value by making information transparent and usable at much higher frequency.

Second, as organizations create and store more transactional data in digital form, they can collect more accurate and detailed performance information on everything from product inventories to sick days, and therefore expose variability and boost performance.

Third, big data allows ever-narrower segmentation of customers and therefore much more precisely tailored products or services.

Fourth, sophisticated analytics can substantially improve decision-making. Finally, big data can be used to improve the development of the next generation of products and services. For instance, manufacturers are using data obtained from sensors embedded in products to create innovative after-sales service offerings such as proactive maintenance (preventive measures that take place before a failure occurs or is even noticed). McKinsey Global Institute: “Big Data”, Manyika et al.

There are other emerging uses of big data.  One of them is crowdsourcing (http://www.uncleguidosfacts.com/2012/11/crowdsourcing-and-predictive-markets.html), an innovative use large, often unrelated data groups, to make predictions, to offer solutions to complex problems, or to suggest new and innovative ideas.  The simplest example of crowdsourcing is to ask a large number of random people to bet on something – the number of gumballs in a jar or the winner of a presidential election. 

When researchers gave people the chance to guess the number of jawbreakers in the jar, the average estimate was remarkably near the correct answer.  The random guessers did far better than the expert mathematicians.  In the 19th century when betting on political campaigns was still legal, bettors always got the presidential elections more right than did pundits.  The principle is that people who bet on a particular issue or event have at least some analytical reason for their choice.  Although their reasons may be very subjective, astrological, or severely mathematical, the average always comes out better than considered expert opinion.

Highly complex problems have been broken down into component parts have been farmed out, or crowdsourced to individuals or teams who have not been chosen, but who have offered to solve the problem in return for a reward if they solve it.  This is a type of human parallel computing on a big data scale.

Google, in a well-known enterprise, offered a million dollars for anyone who could come up with a new algorithm for its search engine.  Although Google maintains a prestige research laboratory, the company chose to outsource the problem.  The effort was successful.

Big data and crowdsourcing are not simply new ways to mine and use data.  They represent seismic shifts in intellectual inquiry.  The individual expert, scientist, pundit, or academician are becoming increasingly irrelevant or peripheral.  If masses of unrelated individuals can arrive at better conclusions than they every time, why bother with them?

Big data is a truly revolutionary phenomenon.  The amount of data generated and collected increases geometrically every week.  Software programs that organize, exploit, and correlate these data in increasingly sophisticated ways are being developed at almost the same rate.  The two phenomena fit perfectly.  Our world will become increasingly knowable not through reflection and speculation but through hard, objective data.

Ironically, I have written also about the invasions of privacy that accompany this generation, mining, and use of data.  Although big data can be of great use to science, business, and government, the nearly universal and constant surveillance of our every keystroke should give shivers to even the most committed data freak.  I have written extensively about this (http://www.uncleguidosfacts.com/2012/01/invasion-of-privacywere-all-at-fault.html and search words ‘invasion of privacy’) but have given up.  Obviously most of us perceive more benefit to our computer cookies than harm; willingly give up privacy rights in the interests of anti-terrorism and crime-fighting; and are very happy that Amazon can recommend books and movies that we will like.

In any case, the great tsunami of big data has already inundated us, and it can only increase in size and importance.  I think it is a good thing.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.