Sunday, September 29, 2013

Popular Analytics: Not Just for Google Anymore



In an article entitled “Google vs. Death,” Time magazine’s Harry McCracken and Lev Grossman explore Google’s corporate strategy of simultaneously investing in both mainstream services and one part risky long shots.  Referred to by Google’s CEO Larry Page as “moon shots,” the research and development efforts in question are a bit more long term and ambitious than those undertaken by other companies.  Among these investments is a start-up called Calico.  

Calico, which will be run by Arthur Levinson, the former CEO of Genentech, will focus on medical issues associated with the aging process.  Among Calico’s operating premises is the technologist’s holy trinity:  Specifically, there is no problem that can’t be solved through a) hefty infusions of capital, b) the use of innovative technology and c) the application of huge amounts of processing power.  This is especially true in medicine, where doctors, researchers and other healthcare providers routinely access and analyze large patient data sets as part of diagnosis and treatment plans.

In many ways, Google’s investment in Calico is reflective of medicine’s return to its roots as an information science.  Indeed, the distinction between physicians and surgeons goes back almost a millennium.  For much of that time the two were members of different – and rival – professions.  English King Henry VIII chartered the London Royal College of Physicians in 1518.  It wasn’t until 1540 that the Company of Barber/Surgeons was granted a charter.  

The classical contrasts were clear:  Physicians performed an analysis of information about the patient’s condition and history within the context of human physiology and pathology.  Based on the information analysis, a diagnosis was made and a course of treatment prescribed.  Surgeons, on the other hand, sliced, diced, hacked, sawed, set, sewed and otherwise mechanically addressed injuries and wounds.  The distinction, at the time, was akin to that between white collar and blue collar workers; physicians practiced physic (roughly comparable to modern internal medicine) while surgeons engaged in, well, manual labor. (That’s not just a turn of phrase.  The word “surgery” derives from the Greek: χειρουργική cheirourgikē (composed of χείρ, "hand", and έργον, "work"), via Latin: chirurgiae, meaning "hand work.")  Fortunately for all concerned, the distinctions between the two in terms of professional standing and credibility as well as the use of information analysis to diagnose and treat have largely evaporated.

The points that Google is making in starting Calico, however, seem to be that:

  •  Surgery, while sometimes unavoidable and demanding the utmost talent and capability on the practitioner’s part, is essentially reactive and remedial in nature;
  • That the need for such remediation can be dramatically reduced, or at least more specifically targeted, by taking proactive measures; and
  • That the necessary proactive measures can be accurately determined by running powerful analytics against ever-growing medical data sets.
These aren’t especially groundbreaking concepts.  Medical professionals typically engage in diagnostic analyses prior to embarking upon a course of treatment.  However, many of these analyses are mental, relying on, and limited to, the individual practitioner’s experience and innate capabilities.  Even when the analyses are computer assisted, they are often based on relatively small data sets and inefficient processing capabilities.   

For example, a Board Certified Behavioral Analyst (BCBA) treating a child with Autism Spectrum Disorder (ASD) in New Jersey generally has access only to her own experiences, supplemented with information published by professional organizations, when designing an Applied Behavioral Analysis (ABA) treatment plan for the child.  In a fortunate circumstance, the BCBA may have access to records from other analysts in the same practice to which she could compare the child’s case and treatment.  However, this is still a limited information pool from which to draw, especially when compared to the data available on a state or nationwide basis.

As important as the information pool is the processing mechanism.  In the example above, the BCBA only has so many hours in a day to read case files, make sense of them, determine whether they apply to her case and, if so, how.  And all that is prior to making any determination as to what kinds of therapies or treatments the information in the files may indicate or suggest.  A critical aspect of information’s utility is its timeliness.  The best ASD therapy in the world is of little use if the time necessary to analyze existing data exceeds the time available for diagnosis prior to treatment.

These data problems are not unique to medicine.  They also acutely impact national security enterprises including defense and the intelligence community (IC) and local enterprises such as law enforcement.  As with the medical community, these entities are faced with critical problems, large data sets and a need for rapid, accurate analysis leading to accurate and effective solutions.   All three, defense, the IC and the medical community, also share a need for reliable, robust information security.  While it’s essential to provide the right information rapidly and efficiently, it’s absolutely crucial that the information be appropriately sanitized and that unauthorized parties are denied access.

Looking at these needs from an acquisitions perspective, a requirement emerges for a generic analytics capability that can be applied to domains ranging from medicine to intelligence to warfighting to academic research, business analysis and law enforcement.  Characteristics of such a capability might include:

  • The ability to define analytics parameters at the user or administrator level;
  • Data source and type agnosticism and the ability to add, remove or change data sources without significant impact to the overall capability;
  • Single Sign-On across the enterprise and/or across multiple domains;
  • Fine grained access control and automated data sanitization based on user attributes; and
  • Rapid analytics processing on commodity hardware.
There are dozens (if not hundreds) more such technical requirements.  However, one of the most important requirements isn’t technical but logistical.  In order for such a capability to make a difference, in order for it to be truly useful, it must be readily proliferated.  It’s one thing to have an entire company built around a specialized analytics capability, as Google has done with Calico.  It’s quite another to provide a drop-in, generic analytics tool that data intensive organizations of varying size can rapidly deploy and use, regardless of their area of specialization or domain.  

Put another way, unless the BCBA’s contract information technology (IT) support can rapidly install and configure the analytics tool (regardless of whether it’s on-premises or in the Cloud), and unless the BCBA can start to use it with a minimum of set-up and training time, the need isn’t being met.  As importantly, the whole exercise also fails unless small and medium-sized organizations can afford to acquire and use the tool.  

There must be data upon which the tool can operate.  However, the ubiquity of affordable, secure and readily employed analytics tools, whether across the medical or the military communities, can be expected to create a groundswell of grass-roots, popular demand for the secure, but open, availability of organizationally (and especially governmentally maintained) data that cannot be resisted by industry and government policymakers or data owners.  For an example of such data, one needs look no further than the headlines.  The Affordable Care Act, regardless of whether one loves it or hates it, will create unprecedented stores of medical information that could be used in the search for remedies and cures and truly effective therapies.  Google, for one is banking on this.

For a precedential example of such demand, one need only look at the effect of the rapid spread of mobile applications and APIs for accessing government data.  Widely proliferated mobile computing devices and processing capability created a popular expectation that government would open the data floodgates to make things faster, easier, more accurate and more convenient.  The result?  Well, take a look at 

 
or


where one can download the Internal Revenue Service’s (IRS) IRS2Go app.  In other words, “if you build it, they (or at least the data) will come.”

Perhaps the most surprising part of the “popular analytics” puzzle is that inexpensive, powerful analytics tools aren’t yet taking the IT landscape by storm.  The components necessary to create such tools are not only widely available and robust, most of them are open source and can be had without licensing or acquisition costs.  A few examples:

Capability
Product
License Type
Secure storage with cell-level access control
Open Source; Apache 2.0
Fine-grained, attribute-based access control
Open Source; Apache 2.0
Scalable, rapidly definable data analytics
Open Source, Apache 2.0
Single Sign-On and authentication management
Open Source, Apache 2.0
Data access and data loose coupling
Open Source, Apache 2.0
Management of APIs to internal and external data stores
Open Source, Apache 2.0

The case for democratizing analytics is compelling.  There’s always a possibility that one person looking at a limited data set may engage in the analysis that will lead to a breakthrough.  However, the odds of such a breakthrough increase significantly when many people look at a very large data set using powerful tools.  In the case of the BCBA doing what she can for one family’s autistic child, aren’t the benefits of increasing the odds of discovering an effective therapy obvious?  Similarly, how much more effective could a law enforcement organization be in protecting a municipality if affordable, powerful and effective analysis of criminal activity, trends and behaviors was the rule rather than the exception?

Democratizing – effectively crowd-sourcing – analytics can have profoundly beneficial results.  The need is there, the tools are there.  Why should Google have all the fun?