Most people and companies involved in predictive analytics will applaud the level of sophistication, reliability and ease of use of modern data mining software. However, these same people will acknowledge that extracting value from data mining software is much more than a "technology play".
Important additional components of the development of a successful predictive analytical solution are "people" and "process". In a real-time application, "people" includes business users, business analysts, database administrators, system integrators, and "data scientists".
The effective data scientist must be an analyst who is not only a master of the technical aspects of data mining, but is also able to stimulate and guide a creative dialogue between business requirements and opportunities, and opportunities gleaned from the data analyses. He or she has to be a team player within the diverse group progressing the project, with an understanding that mathematical modelling is only one cog in a larger wheel. These qualities can be quite hard to find, and companies are advised to be discerning in their choice of partners for the implementation of enterprise level solutions.
The data scientist also has to be aware of the limitations of sophisticated data mining software, and the many pitfalls that bedevil predictive analytics projects. Issues such as statistical tunnel vision, inability to redefine the business objectives in statistical terms, variable circularity, inappropriate use of predictor variables, poor choice of modelling algorithms, insensitivity to views expressed by business users and business analysts, inflexibility in matching analytical results to business requirements or missing glaring and valuable opportunities can spell doom for a large and very expensive enterprise level project.
Because of the nature of their educational background, and the personality types that are drawn to mathematical subjects, data analysts run the risk of unproductively locking on to specific statistical objectives, effectively a kind of tunnel vision. One example is the rampant pursuit of Gini coefficient maximisation. This unerring focus can blind the data scientist to the startling fact that there is absolutely no value or worse, relevance, to their results, or equally risky, that what appear to be valueless results based on the Gini are actually profitable. A Gini coefficient on its own is a fairly limited indicator. A baseline economic analysis of costs and benefits provides much more credible guidance.
Real data mining involves much more than simply pointing some predictors at a target variable and pushing "run". Considerable experience is required to translate a business requirement into a statistical problem. The executive or team that have defined the opportunity may have tried to do an initial translation while not fully communicating their real requirements; "we need to predict churn", "we need to identify high credit risk individuals"; or "we need to identify which of these security alarms are genuine". Each of these headings covers much complexity and many possibilities, and most likely a high degree of company specificity. Simply flagging historic data with known outcomes of "churn", "defaulter", or "real alarm" is challenging. Properly understanding the true intent of a project requires good communication skills.
Experience is a prerequisite for recognising when a result is just too good to be true. If the ubiquitous problem of over-fitting can be eliminated, then more subtle forces may be at work. Very common is the inclusion in the data set of variables involved in or closely related to the definition of the target variable, especially in binary classification problems. These have to be weeded out systematically, with constant reference to business logic, typically starting with the top ranked predictors.
In dealing with a kind of algorithmic inferiority complex, software vendors have loaded their software with increasingly exotic mathematical algorithms which can easily mislead the user into assuming that more is better. In fact, in the world of statistics, less, ie, more parsimony, is usually better. In the high cost world of the predictive analytic solution, confidence building to achieve ongoing funding approval is a real requirement, and this cannot be achieved by dazzling executives with complex "black box" algorithms. Simple tools such as decision trees, logistic regression and linear regression, which can be more easily communicated, or developed in collaboration with business users, are the best for this, and are often more statistically robust as well. The experienced data scientists will have no need for the extra bells and whistles of the exotic algorithm. Butů he or she needs to be able to recognise that one-in-a-hundred occasion when the exotic is obligatory, when a GLMM using a Tweedie distribution is the right fit.
Business opportunities in data mining results can be hard to spot. Especially if you are chasing the simplistic definition of the problem; "churn", "defaulter", or "real alarm". For starters, opportunities associated with the converse must be explored. Short-term insurance faces a massive problem of fraud, which has led to the implementation of real-time fraud detection solutions. However, shifting the risk profile of claims to better detect fraud also makes it possible to more reliably identify safe low-risk claims. Closer economic analyses shows that although repudiations of risky and fraudulent claims can be increased, the cost of forensics places an upper limit on this. The economics reveals that, in fact, a far greater economic opportunity lies in the reduction in claims handling costs by increasing the number of claims which are fast-tracked.
The take-home message from this is that in the three components, technology, people and process, people is one of the most important differentiators in choice of an implementation partner.
Most Popular Stories
- Desktop, Laptop Setups Still King
- Four DC Comics Properties Brought to TV Get Comic-Con Event
- UFC Fight Night Sees Robbie Lawler Win Unanimous Decision
- Plan to Simplify 2015 Health Renewals May Backfire
- 'Guardians of the Galaxy ' Sequel Slated for 2017
- Shania Twain's Vegas residency ending after 110 shows