Exploratory Data Analysis With MATLAB

by Wendy L. Martinez, Angel R. Martinez, and Jeffery L. Solka. Boca Raton, FL: CRC Press, 2017, xxv + 590 pp., $100.00 (Hardback), $46.36 (eBook), ISBN: 13: 978-1-498-7606-6.

Morteza MarzjaraniSaginaw Valley State University (retired)

This book presents an extensive coverage in exploratory data analysis (EDA) using the software Matlab. Although this software is used throughout the book, readers can modify the algorithms for different statistical packages. The book is divided into three parts consisting of 11 chapters and 5 appendices. Part I covers an introduction to EDA, part II includes EDA as pattern discovery, and part III is devoted to graphical methods for EDA. Each chapter is followed by a good number of exercises and a guide for further readings. The book is suitable for several disciplines including statistics, computer science, data mining, machine learning, and engineering. In the third edition, Chapter 2 has additional content on random projections and estimating the local intrinsic dimensionality. Chapter 3 includes a description of deep learning, autoencoders, and stochastic neighbor embedding. Chapter 5 covers a clustering approach based on the minimum spanning tree, also a discussion of several cluster validity indices. Chapter 9 has an additional section on kernel density estimation. Additionally, violin plots, beanplots, and new variants of boxplots are added to this chapter. Chapter 11 is a newly added chapter focusing on the important issue in data analysis namely visualizing categorical data and includes methods for visualizing the distribution shapes of univariate categorical data and tabular data.

Chapter 1 covers some background on EDA and how it compares with other data analysis techniques. This includes definition of EDA, an overview of the textbook, datasets used in the book along with some explanation on each, the need for transforming data along with some methods for transforming data such as using standard deviation, range, and sphering the data.

Chapter 2 is concerned with the issue of dimensionality reduction. While keeping information on all of the available variables, a mapping (linear or nonlinear) from the higher-dimensional space to a lower dimensional one is introduced. The coverage begins with principal component analysis (PCA) and then methods for desirable number of dimensions are introduced. In addition, factor analysis as another dimensionality reduction method is presented. Also, Fisher’s linear discriminant analysis (LDA), which is closely related to regression analysis and ANOVA attempting to relate one dependent variable to a combination of some independent variables and also intrinsic dimensionality where the smallest number of dimensions or variables used to model the data without loss are introduced. The chapter ends with a discussion on several estimators including nearest neighbor approach, correlation dimension, and maximum likelihood.

In Chapter 3, the issue of dimensionality reduction is continued and is extended to nonlinear methods. The chapter begins with multidimensional scaling (MDS) as a set of techniques for discovering hidden pattern(s) in a dataset. The authors then present metric MDS, classical MDS, nonmetric MDS followed by locally linear embedding as an unsupervised learning algorithm, followed by isometric feature mapping (ISOMAP), Hessian Eigenmaps, and artificial neural network approaches including self-organizing maps, generation topographic maps, and curvilinear component analysis. The chapter ends with deep learning as a supervised or unsupervised technique.

In the next chapter dimensionality reduction is approached through touring the space and looking at different possibilities. Three tour methods grand, interpolated and guided tours are presented. In grand tours, the data are viewed from all possible perspectives. In interpolation tours it starts with two subspaces: an initial one and a target subspace. The tour proceeds by traveling the two spaces and continues to tour the data by going from one target space to another. Guided tours can be either partly or completely guided by the data. An example of this type of tour is the EDA projection pursuit method where the tour is guided by the data since it keeps touring until a possible structure is found.

In Chapter 5, the authors turn attention to finding groups or clusters in the data. That is, organizing the data into groups in such a way that observations within a group are more similar to each other than they are to observations belonging to a different group or cluster. Steps of clustering are presented and several approaches to clustering including hierarchical methods, optimization methods-k-means, and a very popular method called spectral clustering, and minimum spanning tree are presented in this chapter. The authors then look at evaluating the cluster and the quality of cluster results and to estimating the correct number of groups in the data. The chapter also addresses how well a method retrieves natural clusters, and how sensitive is the method to missing data. Additionally, we see Silhouette statistic (as a way of estimating the number of groups in a dataset) and gap statistic (for estimating the number of clusters using any technique) for grouping data.

In the next chapter, the authors discuss a method for clustering based on univariate and multivariate finite mixture to probability density estimation. They then discuss techniques used in model-based clustering such as finite mixture, the expectation-maximization (EM) algorithm as a general method for optimizing likelihood functions, and model-based agglomerative clustering where it enables us to find initial values of our parameters for any given number of groups (a bottom-up approach). Also they show how to use a GUI tool to generate random samples based on finite mixture models presented in the chapter. The authors show how model-based clustering can be used to estimate probability density functions as finite mixture and their applications within a supervised learning framework. Further, they present the Bayes approach to pattern recognition as a fundamental technique and more complicated methods such as neural networks, classification trees, and support vector machines.

In Chapter 7, smoothing scatterplots are presented as a bridge between parametric and nonparametric approaches allowing to search for patterns. The next topic covered here is lowess or a more general counterpart loess which is a locally weighted regression procedure for fitting a regression curve or surface by smoothing the dependent variable as a function of the independent variable. The following topics include smoothing splines for uniformly spaced data, smoothing parameters for bivariate distribution and polar smoothing ends the chapter.

The authors then focus on visualizing clusters including a well-established method known as dendrogram or tree diagram where the user can specify a value along the axis and different clusters or partitions are obtained depending on what value is specified. The readers are introduced with some terminologies including node (internal or terminal also known as leaf), and root. The tree may be vertically or horizontally oriented. Due to inefficiency of dendrogram, treemaps (initially designed to display very large data with rectangles organized from largest to smallest) are covered next. Further, rectangular plots (similar to treemaps) but display the points as glyphs and determine the splits in different ways, ReClus plots or rectangle clusters developed by Martinez 2002 are introduced as a way to view the output of nonhierarchical clustering method such as k-means, model-based clustering, etc.

In Chapter 9, we see various methods for visualizing the shapes of distributions including histograms, univariate, and kernel density appeared in Chapter 2 and was used it to visualize the density of univariate data sets. Boxplot, violin plots are used to add information to the summary statistics displayed in a boxplot. Also, we see presentations on beeswarm, a one-dimensional plot like stripchart, but with closely packed, non-overlapping points. In addition, several types of quantile-based plots including probability plots and q-q plots are covered here.

Multivariate visualization are presented next where several methods for visualizing and exploring multivariate data are presented and are most suitable for small data sets. This includes glyph plots idea to represent each observation (with dimensionality) by a cartoon face along with the disadvantage(s). Scatterplots, enhanced scatterplots, linking, brushing consisting of square or rectangle created in coplots, dot charts, Andrews’ curves, biplots (extension of simple two-variable scatterplots) are among the topics covered in this chapter.

The last chapter presents methods for exploration and visualization of categorical data. Discrete distributions such as binomial, Poisson are discussed. Poissoness plot and their extensions (leveled Poissoness plot) are covered here. In addition, contingency table bar plots, spine plots, mosaic plots, Sieve plots (as a way to visualize frequencies in a two-way contingency tables) and log odds plot are among the last topics included in this chapter.

This book is intended for a wide audience including statisticians, computer scientists, and engineers. A wide range of topics along with Matlab codes are given. Each chapter ends with a good number of exercises which would be very helpful to complement the knowledge learned from the chapter. It is a great source for the students/researchers. It is suitable for a course in the targeted areas at the senior undergraduate or graduate courses. Although Matlab is used throughout the book, the algorithms can easily be converted in other platforms.

Morteza Marzjarani
Saginaw Valley State University (retired)

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Exploratory Data Analysis With MATLAB

by Wendy L. Martinez, Angel R. Martinez, and Jeffery L. Solka. Boca Raton, FL: CRC Press, 2017, xxv + 590 pp., $100.00 (Hardback), $46.36 (eBook), ISBN: 13: 978-1-498-7606-6.

Information for

Open access

Opportunities

Help and information

Exploratory Data Analysis With MATLAB

by Wendy L. Martinez, Angel R. Martinez, and Jeffery L. Solka. Boca Raton, FL: CRC Press, 2017, xxv + 590 pp., $100.00 (Hardback), $46.36 (eBook), ISBN: 13: 978-1-498-7606-6.

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date