Full article: Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

The book’s 3rd edition has been significantly extended to 44 chapters from 31 chapters of the 2nd edition in 2011 (the 1st edition was in 2003), with the previous texts rewritten and elaborated on using recent methods and methodologies of statistical modeling, predictive analytics, machine-learning, and data mining. The author, a renowned expert in operating with big data, discusses various specific approaches to predictive modeling, statistical analysis, and interpretation of results in numerous practical problems mostly connected to the marketing research. The new chapters include problems of the relation between statistics and data science, market share estimation, share of wallet modeling without survey data, latent market segmentation, regression modeling with incomplete data, decile analysis assessment for the data predictive power, net effects and lift model, user-friendly text mining, with multiple selected SAS subroutines which can be converted to other languages.

Chapter 1, “Introduction,” starts with the concepts of the statistical data mining and the machine-learning data mining, refers to John Tukey’s practically oriented approach of Exploratory Data Analysis (EDA) with its idea “let your data be your guide,” and describes working with small and big data. Personal computers (PC) have been widely employed both for Machine Learning (ML), which corresponds to finding relations or structure between variables in the columns of a dataset, and also for aggregating of observations in rows of the dataset, which sometimes is called Data Lifting. The ML approaches of automatic interaction detection (AID) for regression tree had been developed from the 1960s, then other algorithms followed, such as theta AID (THAID), multivariate AID (MAID), chi-squared AID (CHAID), and classification and regression trees (CART). The integration of statistics and ML began from the 1980s, and the mnemonic can be used that Data Mining $=$ Statistics + Big Data + ML. Chapter 2, “Science Dealing with Data: Statistics and Data Science,” reviews various authors’ opinions on these terms and concludes that they mean almost the same subject, although the statistical approach requires some assumptions on distributions to perform the corresponding hypotheses checking, while the data science uses the observations themselves for finding the needed estimates via computational data processing. Chapter 3, “Two Basic Data Mining Methods for Variable Assessment,” considers estimation of the pair correlation between two variables by presenting data in scatter plots, smoothing scatterplots, and applying the general association test. Chapter 4, “CHAID-Based Data Mining for Paired-Variable Assessment,” proposes to apply this tool for scatterplot smoothing and correlation estimates. Chapter 5, “The Importance of Straight Data Simplicity and Desirability for Good Model-Building Practice,” continues with examples on scatterplots and linear correlations with many variables, using genetic programming (GP), the GP-based data mining (GP-DM), and the suggested DM tool called GenIQvar model used to transform variables for reaching higher correlations and model fit. Chapter 6, “Symmetrizing Ranked Data: A Statistical Data Mining Method for Improving the Predictive Power of Data,” describes measurements in different scales (nominal, ordinal, interval, and ratio) and displays data in stem-and-leaf, box-and-whiskers, histograms and boxplots, with additional finding of an approximate interval variable.

Chapter 7, “Principal Component Analysis: A Statistical Data Mining Method for Many-Variable Assessment” deals with application of this multivariate technique, commonly abbreviated as PCA, describing its properties on numerical evaluations. Chapter 8, “Market Share Estimation: Data Mining for an Exceptional Case,” illustrates the PCA application for the market shares of some products, with several appendices presenting the needed SAS codes.

Chapter 9, “The Correlation Coefficient: Its Values Range between Plus and Minus 1, or Do They?,” considers an adjusted pair correlation estimation. Chapter 10 “Logistic Regression: The Workhorse of Response Modeling,” is devoted to the binary outcome logistic regression model, describing its features on examples of several case studies, with special attention to assessing the importance of variables and plotting results. Chapter 11, “Predicting Share of Wallet without Survey Data,” discusses evaluation of the percentage of a customer’s total spending, with illustration on the credit card industry and appendices of SAS subroutines. Chapter 12, “Ordinary Regression: The Workhorse of Profit Modeling,” presents a popular multiple linear regression model and discusses its characteristics of quality of fit, such as the coefficient of multiple determination R-squared and F-statistic, with application to predicting sales and profits. Chapter 13, “Variable Selection Methods in Regression: Ignorable Problem, Notable Solution,” continues with various tests on the predictors importance in the ordinary least squares (OLS) model, using forward selection and backward elimination stepwise techniques and suggesting enhanced and EDA approaches.

Chapter 14, “CHAID for Interpreting a Logistic Regression Model,” considers how to evaluate a contribution of each predictor in the logit model with help of the data mining tool, giving examples on multivariate CHAID trees and their graphs for the market segmentation and other problems. Chapter 15, “The Importance of the Regression Coefficient,” returns to the OLS regression and predictors’ influence on prediction, and also discusses the question of the p-values for parameter estimation in case of big data. Chapter 16, “The Average Correlation: A Statistical Data Mining Measure for Assessment of Competing Predictive Models and the Importance of the Predictor Variables,” uses a mean level of pair correlations to distinguish more and less important variables in the multiple regression, discusses multicollinearity problem, and the model reliability and validity. Chapter 17, “CHAID for Specifying a Model with Interaction Variables,” returns to analyzing linear regression with the mix-effect predictors’ products.

Chapter 18, “Market Segmentation Classification Modeling with Logistic Regression,” extends the binary logit to the polychotomous logistic regression (PLR) with several categories of the outcome, although considered not in a multinomial-logit but in the several pair-related logit models, with CHAID trees applications to market classification data. Chapter 19, “Market Segmentation Based on Time-Series Data Using Latent Class Analysis,” reviews the PCA and factor analysis (FA), clustering by k-means and latent class analysis (LCA), applied for dividing customers into groups of similar products and services. Chapter 20, “Market Segmentation: An Easy Way to Understand the Segments,” continues with modeling for developing effective marketing strategies, and presents several appendices of SAS codes for this aim. Chapter 21, “The Statistical Regression Model: An Easy Way to Understand the Model,” suggests an empirical procedure to estimate an impact onto the dependent variable in the OLS or logistic regression (LR) by reaching its indexed profiles of deciles with change in a predictor subject to other predictors holden constants, with several appendices containing SAS codes for this process. Chapter 22, “CHAID as a Method for Filling in Missing Values,” describes the missing at random (MAR) and missing completely at random (MCAR) data, and imputations with regression and classification trees techniques. Chapter 23, “Model Building with Big Complete and Incomplete Data,” adds to missing data imputation with help of PCA and response modeling. Chapter 24, “Art, Science, Numbers, and Poetry,” serves for a relaxation in the middle of reading the book, citing Einstein, sharing the author’s math verse, and contemplating on developing a golden rule in statistics.

Chapter 25, “Identifying Your Best Customers: Descriptive, Predictive, and Look-Alike Profiling,” discusses demographic segments and their finding with the CHAID help. Chapter 26, “Assessment of Marketing Models,” discusses the decile analysis, precision and separability concepts for response and profit models, Hosmer-Lemeshow goodness-of-fit, coefficient of variance (CV), and smoothed weighted mean absolute deviation (SWMAD) measures. Chapter 27, “Decile Analysis: Perspective and Performance,” continues with assessing predictive incremental gains of response models, with several appendices of SAS codes. Chapter 28, “Net T-C Lift Model: Assessing the Net Effects of Test and Control Campaigns,” proposes an algorithm of estimation the net difference of test and control (T-C) response models for campaigns used in direct marketing, and presents several appendices of SAS subroutines for such evaluations. Chapter 29, “Bootstrapping in Marketing: A New Approach for Validating Models,” deals with estimations in response and profit modeling, including means, standard deviations, confidence intervals, decile validation, and assessment of model implementation performance and efficiency. Chapter 30, “Validating the Logistic Regression Model: Try Bootstrapping,” adds consideration for the binary outcome model validation.

Chapter 31, “Visualization of Marketing Models: Data Mining to Uncover Innards of a Model,” focuses on the ways of graphical presentation of data, particularly, via star graphs and decile profile curves, with illustrations and several appendices of SAS subroutines for building the graphs. Chapter 32, “The Predictive Contribution Coefficient: A Measure of Predictive Importance,” describes identification of each predictor contribution in accordance with the standardized coefficients of regression. Chapter 33, “Regression Modeling Involves Art, Science, and Poetry, Too,” praises the author’s machine learning method called GenIQ (previously called GenIQvar (for instance, in Chapter 5 and others), and claims its superiority to the alternative common models of OLS and LRM (the latter is logistic regression model, previously also called simply LR – see in Chapter 21; both LR and LRM terms are used in the following chapters). A “Shakespearian Modelogue,” starting with a recognizable origin of the phrase “To fit or not to fit,” is devoted by the author to his invention – the GenIQ model. Some properties of GenIQ are discussed and it is suggested to use instead of OLS and LRM, although the model is not actually exactly defined. Chapter 34, “Opening the Dataset: A Twelve-Step Program for Dataholics,” describes some checking needed in starting work with a data, and supplies with SAS codes for it. “This eclectic chapter” (in the author’s words on p. 471), besides the famous E $=$ mc² and exp(i*pi) + 1 $=$ 0, contains a new one: $Love = 1 / Ego$ , evidently invented by the author himself. Chapter 35 “Genetic and Statistic Regression Models: A Comparison,” considers performance by the linear OLS model and logistic regression (again recalled as LR), describes GenIQ as the genetic alternative to the statistical LR model and illustrates GenIQ working in filling the upper deciles. Chapter 36, “Data Reuse: A Powerful Data Mining Effect of the GenIQ Model,” continues with description of this non-parametric technique, and presents it by original data and by the transformed variables for a better fit of the dependent variable, with example on the profit model. Chapter 37, “A Data Mining Method for Moderating Outliers Instead of Discarding Them,” tells us about GenIQ abilities for handling outliers. Chapter 38, “Overfitting: Old Problem, New Solution,” discusses how GenIQ deals with the overfitting by random split in decile analysis. Chapter 39, “The Importance of Straight Data: Revisited,” returns to the chapters 5 and 12 to describe in more detail how GenIQ had been applied on the profit model to make it more highly connected with the income, even if it requires an unusual trigonometric and logarithmic transformation of the income covariate.

Chapter 40, “The GenIQ Model: Its Definition and an Application,” explains what is this model in detail. It starts with the note that GenIQ uses genetic modeling (GM) as the optimization technique for addressing such problems as direct and database marketing and customer relationship management (CRM). In GM approach, an initial population of randomly taken training sets of predictors and their transformations are employed to build the models and calculate their characteristic of fitness. The models are further used to create a new population of models due to the genetic operators which chose the models with probability based on the fitness, so a better model has a higher chance to be selected. The models are copied from the current into the new population by the reproduction operator, then crossover creates two offspring models for the new population by recombination of randomly taken parts of the parent models, and mutation adds random changes in the new models. The obtained best-in-generation model with the highest fitness value serves as the approximate solution of the problem. Goals of marketing modeling are considered, and indices of aggregated performance from decile analyses are introduced. Selected respondents who likely contribute to increase the response or the profit over a random set of individuals, are used in the indices of the Cumulative (Cum) Lift, for instance, Cum Response Lift or Cum Profit Lift. The incremental outcome yielding more results in the upper (top, second, third, or fourth) decile defines a better model than a model of a lower decile, and this concept is employed in the GenIQ modeling illustrated by the case studies for response and profit models and corresponding trees. Chapter 41, “Finding the Best Variables for Marketing Models,” applies the Cum Lift criterion in the GenIQ approach discussed on several examples. Chapter 42, “Interpretation of Coefficient-Free Models,” notes that a parameter in the OLS model equals a change in the dependent variable due to the unit change in the predictor when the other predictors are held constant. Such evaluation is extended to define the quasi-regression coefficients (quasi-RC) for nonlinear models, including GenIQ model as well, that is discussed on multiple examples. Chapter 43, “Text Mining: Primer, Illustration, and TXTDM Software,” describes briefly data mining of textual data, natural language processing, computational linguistics, information retrieval, statistics of the words in the corpus, gives examples with applications, and completes with appendices containing SAS codes. Chapter 44, “Some of My Favorite Statistical Subroutines,” includes multiple referenced in the book specific subroutines, and the Index finalizes.

Each chapter presents numerical examples, SAS codes, and a list of references. The book is written as a sharing personal experience in useful discoveries in mining of different projects’ data. It may explain the book structure where a topic or technique is considered repeatedly many times on different data in different chapters. Such arrangement of material is difficult for finding the needed descriptions on a question dispersed throughout the book, but it might be useful for a recursive studying for a better understanding of a subject. There are a lot of innovative heuristic enhancements of statistical techniques which proved to be valuable due to the author’s personal involvements in multiple projects. The link http://www.geniq.net/articles.html{\#}section9 leads to the author’s website with many dozens of notes, essays, articles, SAS codes, solutions and discussions on various related problems and statistical methods. The book can be useful to practitioners and researchers in the data science and machine-learning data mining.

DOI: 00401706.2021.2020521
Stan Lipovetsky
Minneapolis, MN
[INLINEFIGURE]

Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data

3rd Edition, by Bruce Ratner. Chapman and Hall/CRC, Taylor & Francis Group, Boca Raton, FL, 2020; ISBN 978-0-367-57360-7; xxxiii + 655 pp., $54.95 (pbk)

Information for

Open access

Opportunities

Help and information

Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data

3rd Edition, by Bruce Ratner. Chapman and Hall/CRC, Taylor & Francis Group, Boca Raton, FL, 2020; ISBN 978-0-367-57360-7; xxxiii + 655 pp., $54.95 (pbk)

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date