10,547
Views
13
CrossRef citations to date
0
Altmetric
Data Science

A First Course in Data Science

&

Abstract

Data science is a discipline that provides principles, methodology, and guidelines for the analysis of data for tools, values, or insights. Driven by a huge workforce demand, many academic institutions have started to offer degrees in data science, with many at the graduate, and a few at the undergraduate level. Curricula may differ at different institutions, because of varying levels of faculty expertise, and different disciplines (such as mathematics, computer science, and business) in developing the curriculum. The University of Massachusetts Dartmouth started offering degree programs in data science from Fall 2015, at both the undergraduate and the graduate level. Quite a few articles have been published that deal with graduate data science courses, much less so dealing with undergraduate ones. Our discussion will focus on undergraduate course structure and function, and specifically, a first course in data science. Our design of this course centers around a concept called the data science life cycle. That is, we view tasks or steps in the practice of data science as forming a process, consisting of states that indicate how it comes into life, how different tasks in data science depend on or interact with others until the birth of a data product or a conclusion. Naturally, different pieces of the data science life cycle then form individual parts of the course. Details of each piece are filled up by concepts, techniques, or skills that are popular in industry. Consequently, the design of our course is both “principled” and practical. A significant feature of our course philosophy is that, in line with activity theory, the course is based on the use of tools to transform real data to answer strongly motivated questions related to the data.

1 Introduction

We discuss our implementation of a first-year undergraduate course in data science as part of a 4-year university-level BS in data science, and we also elaborate what we see as important principles for any beginning undergraduate course in data science. Our principal aim is to stimulate discussion on relevant principles and criteria for a productive introduction to data science.

2 Background on Data Science

The term “data science” was coined by Jeff C. Wu in his Carver Professorship lecture at the University of Michigan in 1997 (Wu Citation1997). In this and a subsequent 1998 Mahalanobis Memorial Lecture (Wu Citation1998), Wu advocated the use of data science as a modern name for statistics. This is the first time the term “data science” was used in the statistical community. Cleveland (Citation2001) outlined a plan for a “new” discipline, broader than statistics, that he called “data science,” but did not reference Wu’s use of the term. The International Council for Science: Committee on Data for Science and Technology began publication of the Data Science Journal in 2002, and Columbia University began publication of The Journal of Data Science in 2003.

Data science became popular during the last decade with the booming of many major Internet corporations, such as Yahoo, Google, Linkedin, Facebook, and Amazon, and many startups built from data, such as Palantir, Everstring, the Climate Corporation, and Stitch Fix. Nowadays, “data science,” along with “big data,” has become one of the most frequently used phrases in venues such as business, news, media, social networks, and academia, with “data scientist” becoming one of the most popular job titles (Davenport and Patil Citation2012; Columbus Citation2018).

Despite the fact that data science has become so popular, and we are using products enabled by data science on almost a daily basis, there is currently no consensus on the definition of data science. While Wu’s proposal of the use of the name “data science” adds a modern flavor to traditional statistics, we, along with a majority of working data scientists, consider data science as a broader concept than statistics. We view data science as the science of learning from data: a discipline that provides theory, methodology, principles, and guidelines for the analysis of data for tools, values, or insights. Here tools may include those that can help the user for better analysis, such as tools for visualization, data collection or exploration, and value refers mainly to those with commercial or scientific value.

Our view of data science has ingredients from several sources, including traditional statistics—Leo Breiman’s “two cultures” argument of modeling (Breiman Citation2001)—and, in terms of coverage of topics, David Donoho’s “50 years of data science” lecture at Princeton University, 2015 (Donoho Citation2015; see also Donoho Citation2017). In particular, our view of data science consists of both the generative and the predictive “culture.” Effectively, this would include machine learning—mostly with a predictive nature—as part of data science, thus putting these two subjects of learning from data, namely, statistics and machine learning, under a common umbrella. This allows a unified treatment of a wide range of problems, including estimation, regression, classification, ranking, as well as unsupervised (or semi-supervised) learning under the broad term “modeling” (or analysis). The benefit is immediate: developments and expertise in these two historically separate subjects could inform each other, and many redundant course offerings due to administrative barriers can be removed. Another crucial element in our view is that, one could start with a large amount of data without any particular questions in mind, and relevant questions would be figured out while exploring the data. This is what drives the recent surge of interests in data science, given the prevalence of data generating sources such as the Internet, mobile, and portable devices, and the increasing feasibility of collecting large amounts of data. A third point is that data science should also include an interface layer that interacts with domain knowledge or the business aspect, and also algorithms or techniques that deal with the implementation, that is, the computer science aspect. So, in our view, data science is an interdisciplinary subject that encompasses the traditional regimes of statistics and machine learning, business or domain sciences, and computer science.

3 Introductory Undergraduate Data Science Courses

Driven by a huge demand in data science (Manyika et al. Citation2011; PwC Citation2015; Columbus Citation2017), many academic institutes have started offering degrees in data science, with many at the graduate and a few at the undergraduate level (see, e.g., National Academies of Sciences, Engineering and Medicine Consensus Report Citation2018). The curriculum may differ at every institute, due possibly to the fact that there is still no consensus in the definition of data science. At the University of Massachusetts, Dartmouth, we started offering a BS and MS in data science from Fall 2015. Quite a few articles have been published that discuss data science courses (e.g., Tishkovskaya and Lancaster Citation2012; Baumer Citation2015; Escobedo-Land and Kim Citation2015; Hardin et al. Citation2015; Horton, Baumer, and Wickham Citation2015). Our discussion here will be about undergraduate data science and, more specifically, a first course in data science (labeled as “DSC101” at the University of Massachusetts Dartmouth). Such a course gives an overview and brief introduction to the concepts and practices of data science, and serves three goals.

  • It introduces to students the notion that data entails value, thus helping motivate students to the study of data science.

  • It provides students with a big picture and basic concepts of data science, as well as the main ingredients of data science.

  • Students will learn some practical techniques and tools that they can apply later in more advanced courses or when they start work after their degree program.

Our curriculum design centers around the data science life cycle and is not simply a loose collection of various topics in data science. It is based on a process model. The idea is that we view individual steps or tasks in data science as forming a process where some may depend on, or interact with others, or may repeat as more insights are gained along the way, until the reach of a conclusion or the birth of a data product.1 A brief introduction to each piece in the process then forms the individual parts of DSC101, with details to be covered in more specialized or advanced courses. The design of a data science course could also be based on case studies. There are courses in statistics designed with this approach (e.g., Nolan and Speed Citation2000). However, we have not seen many data science courses designed this way; the exceptions are Hardin et al. (Citation2015) and Nolan and Temple Lang (Citation2015). A case study based approach would require a careful selection of study cases with each emphasizing a different aspect of data science so as to ensure coverage of the course on data science topics, which is far from easy, and requires regular updating. Other alternative course structures include the Berkeley Data 8 “Foundations of Data Science” course (see data8.org).

Another feature that distinguishes our DSC101 from similar courses is its practical flavor. Apart from its traditional statistics rigor, DSC101 also has a strong industry flavor: it has an emphasis on the practical aspects, and many examples are taken from applications in industry; the idea is to provide students with authentic data experiences (Grimshaw Citation2015). The first author has previous data science experience in industry, and in designing this course we use examples from data science in industry and carry out some reverse engineering to decide what topics, projects, and other components are to be included so that students can gain experience with the practical demands of industry. For example, we choose to use R as the programing language for this course, due to the increasing popularity of R in industry. Similarly, given that a data scientist typically spends about 60–70% of their daily work in preprocessing the data, including the collecting, cleaning and transforming of the data, we have a project that requires students to collect and process unstructured auto sales data from the web, and students are encouraged to use Python for this purpose.

The remainder of this paper is structured as follows. First we present two examples of data science applications to motivate the concept of the data science life cycle in Section 4. This is followed by a discussion of the theoretical basis for student activity in Section 5. Then we discuss philosophies of the course design in Section 6. This is followed by an introduction in Section 7 of individual pieces in the data science life cycle, namely, the generation of questions, data collection, various topics in exploratory data analysis (EDA), and then linear regression and hypothesis testing. Finally, we conclude with remarks.

4 The Data Science Process and Life Cycle

As stated in Section 1, our design of DSC101 centers around the data science life cycle. In this section, we will explain the data science life cycle in detail, through two examples. One is about a large-scale study in untangling the relationship among smoking, low birthweight, and infant mortality. The second is on how an e-commerce web site may use historical transaction records to build an item recommendation engine. As will become clear shortly, these represent two different modes of how a data product could be built, and correspondingly, two different paths in the data science life cycle.

The first example is from a noted study—the Child Health and Development Studies, carried out by Yerushalmy (Citation1964, Citation1971) in the 1960s on how a mother’s smoking, low birthweight of infants, and infant mortality, are related (). Several prior studies, for example, Simpson (Citation1957), suggested a much greater proportion of lower birthweights (i.e., < 2500 g for newborns in the US) among smoking mothers than nonsmokers. Meanwhile, low birthweight was a strong predictor of infant mortality. Is smoking related to infant mortality? Data were collected for all pregnancies (about 10,000 cases before 1964, and later increased to about 15,000) between 1960 and 1967 among women in the Kaiser Foundation Health Plan in Oakland, California (Nolan and Speed Citation2000). The data include the baby’s length, weight, and head circumference, the length of pregnancy, whether it is first born or not, age, height, weight, education, and smoking status of the mothers, as well as similar information about the father, etc. Yerushalmy’s (Citation1964) study confirmed prior claims on a greater proportion of low-weight births but no higher mortality rate for smoking mothers. Yerushalmy collected more data, for about 13,000 pregnancies, and refined his research focus on low birthweight infants. This led to the unexpected finding that, among the low birthweight infants, those from a smoking mother actually survived considerably better than otherwise. A later study (Wilcox Citation2001), directed by Allen Wilcox on a much larger dataset of about 260,000 births in the state of Missouri (1980–1984), resolved the low birthweight paradox and found that infant mortality was primarily caused by other factors, such as preterm birth. Wilcox writes:

Fig. 1 Smoking, low birthweight, and infant mortality. The link between nodes indicates association instead of causation.

Fig. 1 Smoking, low birthweight, and infant mortality. The link between nodes indicates association instead of causation.

“the mortality difference must be due either to a difference in small pre-term births or to differences in weight-specific mortality that are independent of birthweight. This demonstrates the central importance of pre-term delivery in infant mortality, and the unimportance of birthwieght” (Wilcox Citation2001, p. 1239).

The second example is about item recommendation on an e-commerce web. An e-commerce vendor would typically collect traces of every “mouse click” when a user visits its web, including items a user clicks, views, or purchases. Such data are often called clickstream data, which contains fairly rich information about users’ purchase behavior: for example, the most popular items, items a user typically buys together (called “co-bought items”), and geographical patterns in users’ purchase behavior. Such user behavior profiles can be used to recommend selected items to the user, or to select appropriate contents to show the user when he enters a new page. This is called item recommendation or personalization. For example, in , a user has clicked a Nikon camera. The co-bought statistics from historical data, taking into account of item prices, can be used to decide which items to display that would lead to the most user clicks or the most profit for the vendor.

Fig. 2 Items recommended when a camera is clicked. Courtesy walmart.com.

Fig. 2 Items recommended when a camera is clicked. Courtesy walmart.com.

The first example describes the path taken by traditional statistical analysis. That is, one starts with a question in mind, then collects data, followed by data analysis, more data, refined question, and then a conclusion. The second example describes an alternative path, where large data have been collected (e.g., as a by-product of normal business operations) but it is not clear what to do, so one will need to come up with a relevant question (such as “what behaviors predict purchases?”) through some preliminary analysis on the data, and then conduct data analysis until reaching a conclusion or outcome. One thing in common is that both examples consist of the same set of tasks: data collection (including data cleaning and preprocessing), questions, analysis and outcome (a conclusion, a model, or data products). Data analysis can be either EDA in which one explores the data and constructs hypotheses, or confirmatory data analysis (CDA) in which one tests prespecified hypotheses via a model on variables of interest. One reaches a conclusion or outcome either by EDA or CDA. Note that some steps may be repeated multiple times. Among these tasks, there is a dependency: some tasks only start upon the completion of those proceeding ones. Each data science application has a start, followed by a series of tasks, and finishes with an end. We use a concept called process to describe this, in analogy with the process concept used in computer operating systems. Interdependent tasks are linked by a (directed) arrow—the task pointed to by the arrow only starts when the task at the source of the arrow completes. Putting these together, we arrive at a directed graph. This is the data science life cycle, similar to the software life cycle (Langer Citation2012) used in software engineering.

is our proposed diagram for the data science life cycle. “Data & Q” indicates a state in the data science life cycle such that, at the current state, one has collected the data and formulated a study question.

Fig. 3 The data science life cycle.

Fig. 3 The data science life cycle.

Similar models or diagrams have been proposed for data science during the last few years, for example, Schutt and O’Neill’s data science process diagram (O’Neil and Schutt Citation2013), Phillip Guo’s data science workflow (Guo Citation2012), the PPDSC cycle (Wild and Pfannkuch Citation1999), and the Wickham–Grolemund data science cycle (Wickham and Grolemund Citation2016). These are illustrated in . However, there are major differences from our model. Schutt and O’Neil’s diagram focuses on the data and describes intermediate stages in the building of a data product, so it is essentially a data cycle. Guo’s workflow model describes the dependency of various tasks in a data science project setting; it includes many details and may not be general enough. The PPDAC cycle is the closest to ours in the sense that it consists of one possible paths in our diagram. The Wickham–Grolemund data science cycle takes a data-centered approach and forms a cycle by including various operations to the data. Our model focuses on the tasks in data science, and allows the interaction between tasks and their repetitions, as well as the possibility of having a clearly defined question in mind at the start.

Fig. 4 Alternative diagrams related to the analysis of data. (a) The data science process diagram of Schutt and O’Neil; (b) the data science workflow of Guo; (c) the PPDAC cycle; (d) the Wickham–Grolemund data science cycle.

Fig. 4 Alternative diagrams related to the analysis of data. (a) The data science process diagram of Schutt and O’Neil; (b) the data science workflow of Guo; (c) the PPDAC cycle; (d) the Wickham–Grolemund data science cycle.

5 Theoretical Basis for Student Activity

Semester-long undergraduate courses are designed with specific aims and learning outcomes in mind, and college or university administrators require these to be explicitly articulated. Additionally, instructors bring with them a theoretical stance on how a course will be executed over a semester. This theoretical aspect to course design and implementation is rarely explicitly articulated (and sometimes not even by the instructor to themself!). A clearly articulated theoretical basis for the design of an introductory data science course is of great importance because it sets the scene for how a course unfolds throughout a semester. Of the many differing educational theories that might productively apply to designing data science courses, activity theory (Leontiev Citation1978; Davydov, Zinchenko, and Talyzina Citation1983; Raeithel Citation1991) provides a coherent and productive foundation, and we discuss how aspects of activity theory interact with the data science life cycle.

As this is only a first course in data science, and usually offered at the beginning of the first year, students typically have not acquired a strong background in calculus or statistics, so we need to begin by working with what intellectual tools they do have and then introducing them to new analytical and computational tools. We focus in the beginning, mainly on EDA, and concepts related to various parts in the data science life cycle. This includes introduction to concepts or tools such as sampling, descriptive and summary statistics, data visualization and graphical tools, moving on progressively to the use of such tools as principal component analysis, clustering, linear regression, and hypothesis testing.

A major point is the following: the tools are introduced to enable students to effectively transform raw data into something more useful. The focus is on the raw data, the motivation to transform them—the objective—and the tools used to effect those transformations. This is the opposite of a scenario in which techniques of data analysis are taught with artificially designed and relatively simple toy data (i.e., students practice tool use in the absence of appropriate or realistic data). Becoming a useful and skillful data scientist requires addressing the full complexity of data, and finding appropriate tools to effect insightful transformations on those data. This is the central reason why activity theory drives so much of our thinking in course design for data science: from an activity theory perspective the context of data science for these beginning students is the raw data, questions posed about those data, agreed objectives, and transformation of the data by activity, utilizing analytical tools, in a cyclic process. By this perspective, introductory data science is contextualized for the students as a meaningful, empowering process. In many academic courses students do exercises and practice on toy datasets to complete homework exercises and study for an examination to get a satisfactory grade. The reality of the context makes the DSC101 course quite different from this.

Specific curriculum instances of activity theory are often described in terms of an “activity triangle” (see, e.g., Engestrom Citation1991, Citation1999, Citation2000; Price, De Leone, and Lasry Citation2010). Typically, these activity triangles have a structure as illustrated in .

Fig. 5 A generic activity triangle.

Fig. 5 A generic activity triangle.

An activity triangle encapsulates the interrelationship between the main constituents of a curriculum activity as conceived by activity theory.

As a specific example consider a traffic data example (see more details in Section 7.1), which consists of the starting point and destination of each trip and time stamp at each road during the trip. The activity starts with raw material which is real traffic data.

The community is the class of students and the instructor, but may also include an audience, other than the instructor, for whom the students are to build a data product such as a predictive model, and write a report. For example, the traffic data may well have come from someone who wants to know certain things about the data, so in this case students write reports for that person, who is also part of the community.

The subject or subjects consist of an individual student or small groups of students working together to produce an outcome, typically a written report of an analysis or a data product.

The tools are usually the software tools, such as the R programing language, and conceptual tools, such as regression or clustering techniques, that students can bring to bear on the objective.

The object, or objective, is determined through discussion by students, the instructor and any external client, and in the case of the traffic data example may involve such things as determining traffic bottlenecks at particular times of day.

The rules vary from activity to activity, and may include such general things as avoidance of plagiarism, appropriate referencing of sources, cooperation within and between student teams, sharing of findings, ethical behavior, and responsibility for meeting deadlines.

Division of labor can work in several ways including different students within a team taking charge of different aspects of analysis, or different teams focusing on different aspects of an objective with the aim of pooling findings.

The same dataset may—and usually does—generate a number of different activities and objectives as students ask further questions about the data, and set out to examine their determined objectives. When this happens the objective will change, the subjects may change in that students may form new groups, spontaneously or at the instructor’s direction, the division of labor may change, and the tools will most likely need to be modified and new tools brought to bear on achieving the objective.

An activity triangle, as realized in a specific curriculum module, is coordinated with the data science life cycle. Although many variations are possible from activity to activity, as described above, it is common that certain aspects of an activity triangle stay fixed throughout a semester: typically the subjects are the students; the community is the class of students and the instructor; the rules are articulated at the beginning of semester and stay more or less fixed; and commonly the division of labor, either within or between groups, stays much the same. The data science life cycle impacts the activity triangle, and vice versa, from question to objective, analysis to tools, and outcome to conclusion.

As students are engaged in a specific project—some of which are detailed below—and cycle through the data science life cycle, a new activity triangle emerges in which new questions inform new objectives, new analyses require new tools, and new outcomes provide new conclusions. Thus one sees a dynamic sequence of activity triangles as progress on a project involves cycling through the data science life cycle. The activity triangles inform the data science life cycle in that they describe how the various aspects of the data science life cycle are implemented through activity.

We focus on the practice and craft of data science—part of what it means to become apprenticed as a beginning data scientist. This does not mean, however, that something akin to Lave’s situated action model (Lave Citation1988; Lave and Wenger Citation1991), in which one learns by self-directed, novice participation in a communal activity, provides a better theoretical model for designing a data science course than does activity theory. The essential feature of activity theory that is helpful in this regard is that an object comes before an activity based on that object, and motivates the activity (Nardi Citation1996). While learning to become a data scientist through behaving as if one were an apprentice, thrown into an ongoing field of activity, can be a positive and highly educative experience—and is the motivation for many student internships—our focus in beginning data science courses is on activity motivated by student desire to transform data, using tools they have at hand, or are capable of developing: this constitutes an “object” (or “objective”) in activity theory. Data is transformed through activity that relates to an objective usually coming from a naturally arising question about the data.

6 Course Design

Activity theory helps us focus on two aspects of DSC101 that are important to its success. The first is that the data is real-world data for which a question—sometimes rather vague—is naturally proposed. For example (see also Section 7.1), given a collection of traffic data for many trips, including starting and endpoint as well as timestamp at each road during the trip, what questions could students ask that have the potential of becoming a data product? This aspect of the DSC101 course is important in focusing students on an end-product of their studies in data science: a rewarding and satisfying career. From the beginning, students in DSC101 gain a lived experience of what constitutes both the practical and conceptual aspects of the working life of a data scientist. The second aspect of DSC101 highlighted by an activity theory perspective is empowerment: the extent to which the activities and tools used in those activities actually empower students to do something satisfying. Students in DSC101 should never complain: “When will we ever use this?” The answer is obvious from the nature of their activity in attempting to answer questions about real-world data with tools provided to them, or built by them.

The design of our DSC101 course centers around the data science life cycle and the activities that involve. The course starts with an introductory lecture of data science with two goals in mind. One is to give students a sense that data entails value, another that it is possible to make a difference, to influence outcomes, by leveraging values from the data. We introduce numerous interesting stories from a variety of fields, ranging from science, finance, metrology, sports, to Internet and e-commerce, on how insights can be obtained from the data through models and analytical tools. Of course, these stories also convey an idea to students of what constitutes data science, and how their activity, on raw data, with specific objectives, can transform that raw data to insightful outcomes through the use of appropriate tools. Then the data science life cycle is introduced, followed by various parts of the cycle, including asking interesting questions from data, data collection, EDA, modeling, and CDA.

To reflect the practical aspect of this course, also due to its growing popularity in the data science community, we dedicate two weeks of lectures for R programing (Verzani Citation2008), which is the programing language used for instruction and student projects. There are many alternatives to R, the free and open source nature of R together with a very large and diverse R user community make a relatively compelling case for including R as a basic programing language and data analysis tool. Through being inducted into the R ecosystem students are exposed to a huge network of open data analytic resources and tools by learning the basics of R programing: it is not simply a useful and widely used tool they learn—it is also a huge and diverse community of potential support. People who use R, write R packages, and provide instruction in, and support for, R come from a widely diverse collection of backgrounds, so exposing beginning data science students to a vision of data science that cuts across numerous disciplines.

7 Topics Covered in the Course

As described earlier, topics covered in our DSC101 are individual parts in our data science life cycle. In particular, Section 7.1 corresponds to “question,” Section 7.1 to “data,” Sections 7.4–7.6 correspond to “analysis” part of the data science life cycle, respectively. In this section, we will describe each of the topics in detail.

7.1 Asking Interesting Questions

Asking informed questions, from data or given evidence, is one of the most crucial parts of traditional sciences: it forms the start of a scientific investigation. On the other hand, it is one of the primary driving forces behind the recent explosive growth in data science applications. Imagine that an e-commerce vendor has collected huge user access data; what new business models can it generate? If a search engine has collected a large collection of searched keywords, how could such data be utilized? It is possible to use such data to optimize the selection of advertisements and their placement in a page, or even to improve the design of the search engine.

To paraphrase Browne and Keeley (Citation2007, p. 3) in the context of a DSC101 course: Questions about data require the person asking the question to act in response. By our questions, we are saying: I am curious; I want to know more. The questions exist to inform and provide direction—an objective—for all who hear them. The point of questions is that one needs help and focus in obtaining a deeper understanding and appreciation of what might be in the data. To inspire students to think and appreciate the value of data, and ask good questions, students are encouraged to ask questions for any data to which they may have access. As an example, in-class groups are formed among students to discuss potentially what one could do with large traffic data.

Suppose one is given traffic data of a city. Data include about 30 million records of vehicles with each consisting of: the starting point and destination of each trip, and time stamp at each road during the trip. The same car may have multiple entries in the records. There are two cases: knowing or not knowing the auto plate. What can one do with such data?

7.2 Details of R Programing

R is chosen as the programing language for the course, recognizing the growing importance of R programing in data science as well as its great utility in modeling (modeling is offered as a senior level undergraduate data science course at the University of Massachusetts Dartmouth). Topics covered include three parts.

  • The first is on programing language features. This includes data structures such as lists, vectors, arrays and matrices, data frames; structured programing constructs such as loops, conditional statements and functions; data and text manipulation (including regular expressions) tools, file I/Os (including Excel spread sheets), etc.

  • The second is on the statistical aspect of R, which covers R functions to generate data of various distributions, and R functions for statistical tests.

  • The third is on R functions for graphics and visualization. As an elementary course in data science, only R functions or simple graphical tools related to basic plotting functionalities are discussed.

To sharpen the programing skills of the students, very simple algorithms related to searching and text manipulation are introduced. Programing exercises are assigned as labs, and programing questions, such as analyzing the program output and implementing a simple function, are included in the exams. Sample R code is provided for most of the examples, so that students can try R programing on their own and gain hands-on experience.

7.3 Sampling and Data Collection

Data collection is an important aspect of data science. In DSC101, the idea of random sampling and sampling techniques such as simple random sampling and stratified sampling are introduced. To better appreciate the idea of random sampling, several types of misuses of sampling are discussed, including sampling from the wrong population, convenience sampling, judgment sampling, data cherry-picking, self-selection, and anecdotal examples. Each of these is discussed with a story, selected from the news or from the instructor’s experience. Before a formal analysis of each story, time is allocated for students to think and to form group discussions to see if there is anything potentially wrong in the story. Students are also encouraged to share their own examples. As a practice, students are assigned a lab to collect auto sales data, including sales prices and the age of their favorite car model, and judge if their data collection suffers from any sampling bias. Such learning by doing practice may improve students’ interest in the course.

7.4 Exploratory Data Analysis

EDA was pioneered by Tukey in the 1960s (Tukey Citation1977). It refers to various things one would try out before a formal and often complicated data analysis, and is therefore often viewed as a preliminary data analysis. It is typically applied in situations when one wishes to know more about the application domain, and EDA often helps one gain a better sense of what the data looks like, which may be suggestive in the choice of a model or data transformation. Similarly when one has data but does not have a well-defined question, exploring the data to discover patterns or regularities may inspire interesting questions. Of course, sometimes EDA may be sufficient if the question of interest is rather simple or the underlying pattern is salient enough. Common tasks in EDA include the following: descriptive and summary statistics, graphical visualization, data transformations, clustering, etc. We discuss each of these in the following.

7.4.1 Descriptive and Summary Statistics

Descriptive and summary statistics are very helpful in data analysis. From such statistics, one can often get a ball-park idea of the data distribution. These are also useful in presenting data or communicating results to other people, especially when graphical visualization is not possible. Three types of descriptive or summary statistics are introduced in DSC101. The first is for the measure of location in the distribution, including mean, median, mode, and the more general quantiles and percentiles. The second is for measures of dispersion, including variance and standard deviation. The third is about the shape of the data distribution. This includes a measure of asymmetry of the data, skewness, and a measure of the peakedness of the data, kurtosis.

7.4.2 Graphics and Data Visualization

Data visualization is an important part of EDA, and also a useful tool for communicating results. It is being used more and more in the practice of data science, for example, one may see plots or charts in almost every issue of the New York Times, and the Guardian newspaper, in its various country and international editions.

This part of the course starts with guidelines, or rules of thumb, for a useful visualization. Note that our focus is the visualization of data instead of abstract concepts (Yan and Davis Citation2018); here one seeks to understand the data or information behind by displaying aspects of the data. Then a collection of graphical tools are introduced, including basic tools such as bar, pie, Pareto charts and their stacked or grouped version; statistical graph tools such as histograms, boxplots, stem-and-leaf plots; as well as tools suitable for the visualization of multivariate data. Some interesting datasets are used in introducing the graphical tools, for example, the US crime and arrest data, the US statewide mean January temperature for a given year and the mean during the last century. Students use the tools and example R code to visualize the data, then share what they observe from the graphs or other visualizations, and give interpretations. To better appreciate the effect of graphical visualization (Nolan and Perrett Citation2016), in-class discussions are formed where students are given a dataset, such as a multiway contingency table, and then tasked to design their own way of visualization, and designs from different groups are compared. This is a good opportunity for students to apply what they learn with creativity, and greatly motivates students’ interests in the course. Indeed quite a few students view this as the best part of the course.

For the visualization of multivariate data, tools such as bubble plots, Chernoff faces (Chernoff Citation1973), and radial plots are introduced. In particular, students find Chernoff faces interesting and intuitive, and that helps them to gain insights: for example, on the US crime or political ideology by states. Principal component analysis is another tool introduced to visualize multivariate data and for dimension reduction.

7.4.3 Data Transformation and Feature Engineering

Feature engineering refers to the creation of new features from the data, or, combining or transforming existing features into new ones that suitably represent or reveal interesting structures or patterns in the data. It is a task on which data scientists typically dedicate major time. It is crucial to the success of many modeling tasks. Better features often lead to better results, more flexibility, and better interpretation of the results. While the entire world has been excited about the success by an emerging machine learning paradigm, deep learning (Hinton and Salakhutdinov Citation2006; LeCun, Bengio, and Hinton Citation2015), on the automatic discovery of useful features from data, applications beyond image, speech, and natural language processing still rely heavily on feature engineering. As students in DSC101 are unlikely to have any prior data science experience, we only introduce the concept of feature engineering and focus on the easiest part—data transformation. Data transformation is needed when different features have drastically different numerical scales, or when the underlying pattern or regularity in the data becomes more salient due to data transformation. Topics discussed include Tukey’s idea of “straightening the plot” (an idea that guides data transformation from human perception) (Tukey Citation1977), and the Box–Cox power transformation (Box and Cox Citation1964). Several transformations frequently used in practice are discussed. This includes logarithmic or square root transformation, data standardization to mean 0 and variance 1, linear scaling of the data to a range [a, b], nonlinear bucketing of the data (e.g., assign a numerical value 1 to income lower than 20,000, and 2 for the range [20,000, 50,000), and so on).

7.4.4 Clustering

In practice, data are often heterogeneous. This is due possibly to spatial, temporal effects, or differences in other characteristics (e.g., male or females often have very different life style or shopping behavior). Heterogeneity is especially common for big data. It is often desirable to divide the data so that data in the same subgroup is of a similar nature. One way to achieve this is via clustering. Three classical clustering algorithms are introduced, including hierarchical, agglomerative, and K-means clustering (Aggarwal and Reddy Citation2013). The idea of the algorithms and important properties are discussed. More advanced and modern clustering methods such as model-based clustering (Fraley and Raftery Citation2002), spectral clustering (von Luxburg Citation2007), cluster ensemble (Strehl and Ghosh Citation2003; Yan, Chen, and Jordan Citation2013), etc. are not discussed in lecture but may be used for course projects for students with adequate preparation in calculus and linear algebra.

7.5 Simple Modeling With Linear Regression

Simple linear regression is introduced both as a continuation of visualization, in the sense that the regression line is the line that is “close” to most of the data points, and also as a way to summarize data with a simple function. This leads to the concept of modeling. Example models are given that students are likely to have learned in their high school texts or from other courses. For a better appreciation of the concept, students are asked to give their own examples of models. Formulation of simple linear regression is introduced as a least squares optimization problem, as well as the concept of R2 as an indicator of the amount of variance explained in the model. The term regression is discussed, using classical father-son height data. Simple linear regression is naturally extended to multiple regression, using the auto mileage per gallon (MPG) data from the UC Irvine Machine Learning Repository. Before discussing this example, students are asked to make a guess on which factors are important to the gas mileage of a car; after seeing the regression analysis results students would better appreciate the value of data analysis. Relevant R functions for linear regression are introduced, along with discussion of how to read the regression output. Depending on the preparation of students, it may be possible to extend the discussion to multiple linear regression as recommended by the revised American Statistical Association (2016).

7.6 Confirmatory Data Analysis and Hypothesis Testing

In the CDA part of DSC101, the statistical framework of hypothesis testing is introduced. There have been lots of controversies on the usage of p-values in recent years (see, e.g., Cumming Citation2013). However, it is still widely used in industry. For example, many vendors in industry use A/B testing2 and p-values for the comparison of alternative models or strategies. The concept of hypothesis testing is often challenging to students, as it represents a different way of reasoning compared to logic deduction, with which they are likely more familiar. To help students, two analogies are introduced and analyzed, one being the court trial and the other proof by contradiction. This greatly helps students in understanding. An example from industry is used to explain why hypothesis testing is useful, e.g., A/B test in deciding if a new strategy or model does better than the existing one via hypothesis testing. Several students expressed a view that they liked this part of the course as it seems surprisingly useful for many real world problems.

7.6.1 Difference From a Statistics Course at Similar Level

As can be seen, a big part of the course would overlap with a typical statistics course at the similar level. We attribute this to the intimate relationship between data science and statistics; we would not expect a data science course to be very different from a statistics course. That said, compared to related statistics courses at institutes with which the authors are familiar (there is not a similar statistics course at our institute), there are several major differences apart from topics apparently missing in these statistics courses (i.e., topics on biases in sampling, feature engineering, visualization of multivariate data, PCA, clustering). Similar statistics courses would not be structured by the (data science) life cycle, and the main theme of the courses here is on leveraging data for insights, conclusions, models, or data products. In a similar statistics course, there would not be any motivating lectures on leveraging value from the data, nor is there any discussion of data science life cycle in the form of carefully chosen examples or in-class discussions. There would not be so much discussion on visualization in a typical statistics course. Also likely the data for projects are given instead of asking students to find or scrape data by themselves. Potentially, there may also be differences in the execution even if the schedules might look similar. For example, we use many examples from the industry (including some from the author’s past work), which may not be the case for a typical statistics course.

8 Other Course Components

Section 7 discusses topics for lectures, yet there are other components of the course not touched, namely, labs or course projects, and presentations. We will briefly discuss these here; for more details, we refer the reader to the sample syllabus in the appendix.

8.1 Labs and Course Projects

An important part of a data science course is projects. As DSC101 is offered mostly to first-year students, and students typically do not have prior exposure to any programing language, the course project is in the form of several small labs. Each lab touches a major topic in the course, and students are typically given two weeks time to work on each project. Students will write a lab report describing the project, where and how the data are collected, a description of the data analysis procedure, and conclusion, if any. R code is required to submit with the lab report. This is a critically important part of DSC101 because it introduces students to an essential characteristic of a data science professional: the ability to clearly communicate the results of data analysis (see, e.g., Sisto Citation2009; O’Neil and Schutt Citation2013).

The first project is mainly on data collection. Students are required to find data online or from other sources, and then conduct some simple exploratory analysis. One is to download and extract auto sales price for a particular car model from a popular auto sales web, cars.com, for cars of different years. The average prices are calculated for cars of the same years, and then a price-year plot is produced. The second example is from kaggle.com, which consists of historical records of airplane crashes since 1908. Students download and process the data, then visualize airplane crashes by year, airlines, and aircraft models. In terms of empowerment, some students became very excited about the notion of data analysis for insights, and started analyzing data related to their own interests. For example, one student chose to analyze data on basketball games, and observed the rising of 3-point shots in recent years; he also made interesting predictions on the strategy of future basketball games.

The second project is to read an article of data analysis. One example is about analysis on the swimming competitions in the Rio Olympics. Two interesting phenomena were observed, namely, the noted difference in time between back and forth laps, and the observed disadvantage toward athletes assigned to lower-numbered lanes. Students are required to write a report on how the author uses the data and carries out his analysis to reach his conclusions. Students are asked if there are any biases in the way the author was designing the study. The second part of the project is to have students find two examples of misuse of sampling techniques in collecting data, from recent news or articles.

The third project is about descriptive statistics and sampling techniques. Several datasets are given and students are asked to compute the skewness and kurtosis. The second part is about sampling techniques, to compare simple random sampling (SRS) and stratified sampling. Students find or generate their own dataset that is “heterogeneous,” and then compare SRS and stratified sampling on the variation in the sample means if they are to repeat the sampling 100 times.

The fourth project is the visualization of US population by states, for Census 2000 and 2010, respectively. In particular, students are required to produce an appropriate heatmap on the US map, and then plot a bubble plot on the rate of change in population on the map.

The last project is about the application of different clustering methods, including K-means, agglomerative, and divisive clustering. Students produce dendrograms and compare the results. For this project, students are required to do a short presentation for the project of a 10 min duration, including questions and answers. As stated above, an important part of a data scientist’s job is to communicate a problem of interest, or to present analyses, to other people. We make presentation of projects, and in-class discussion, in addition to written reports an important part of the course.

8.2 Assessment

The students’ performance in the course is assessed in all course components, including quizzes, labs, in-class discussion, a midterm, a final exam. Also there are two in-class practice sessions. The idea is to ensure students go through the relevant course materials and apply these to problem solving. The instructor can observe students performance and provide help on any potential issues students may have. This is allocated to two key topics of the course, R programing and hypothesis testing. The grade breakdown in a typical semester is as follows: quizzes—10%; in class discussion, practice or presentation—20%; labs—20%; midterm—20%; final—30%. Team-based learning is incorporated in in-class discussion or presentation, or labs (students can choose to do it individually, or as a team).

8.3 The Students, Engagement, and Feedback

We have been teaching this course since Fall 2015 (this course is offered every Fall). Typically about 40% of the students are data science majors, with others from a very diverse list of majors, such as mathematics, computer science, biology, electrical engineering, mechanical engineering, accounting, management information systems (MIS), etc. This is not a service course.

We do not offer a similar introductory statistics course at University of Mass Dartmouth. At one other institute, one author taught a similar course, Elementary Statistics. In DSC101, the students are more engaged. We attribute that to the following based on our observations and feedbacks from students. This course is better motivated with many realistic applications. The course requires more hands-on from students, for example, students need to try out simple examples using R programing during class. The in-class discussions use topics students are familiar with and with which they could apply their creativity. Finally, students have more freedom in choosing their projects using real data.

Feedback from students suggests that they generally like the in-class discussion, the hands-on exercise on examples discussed in class, the exam problem on data visualization, and also the freedom in choosing problems for their projects.

9 Conclusion

We have briefly introduced a first course in data science offered at the University of Massachusetts Dartmouth since Fall 2015. To facilitate our discussion, we clarified our viewpoints on what data science is, and introduced the notion that data entails value yet to be explored. Our design of the course is both principled and practical. The design centers around the data science life cycle—topics covered in the course correspond roughly to individual pieces in the life cycle. That is, data collection, the generation of a study question, data analysis, how to draw conclusions, and how to communicate results. As a first course in data science, our focus is on the motivation and concepts, and the formal analysis part is limited to EDA, linear regression and hypothesis testing. The practical aspect of the course is reflected in several ways. Our design of the course has incorporated many elements from current data science practice. We use the popular R programing language for instruction, students hands-on exercises, and projects (but we also encourage the use of Python for projects). Our examples and the data used for course projects are mostly from real world applications. In terms of empowerment, the course has been fairly successful in that at the conclusion of this course, students can comfortably carry out elementary data analysis using R and tools introduced in the class, on varied realistic, and real, datasets. Some students even started analyzing datasets related to their own interests, for example, the basketball/baseball games data, the Zillow.com real estate data, etc. One thing worth noting is that this course has managed to attract several students from other majors to our data science program. We hope that our DSC101 course can benefit educators who are new in the field, or students who are interested in data science.

Supplementary Materials

The supplemental materials consist of a sample course schedule for DSC101 taught at UMass Dartmouth, as well as a brief discussion on how this course could be taught by a Computer Science faculty.

Notes

Acknowledgments

We thank the editors, the associate editors, and anonymous reviewers for their helpful comments and suggestions.

Notes

1 A data product is any product built from the data. It can be a piece of software (such as a recommendation system in an e-commerce web), a collection of data that some vendors use to make profit (e.g., personal data processed from data crawled from many different sources and arranged in tabular format, such as https://www.truthfinder.com), or a software tool that one can use to carry out the analysis for a specific application.

2 A/B test is the application of hypothesis testing to compare the effectiveness of two alternatives (one termed as. [and the other e”) . It is used widely in industry to compare alternative models or strategies.

References

  • Aggarwal, C. C., and Reddy, C. K. (2013), Data Clustering: Algorithms and Applications, Boca Raton, FL: Chapman and Hall.
  • American Statistical Association (2016), “Guidelines for Assessment and Instruction in Statistics Education (GAISE) College Report,” available at https://www.amstat.org/asa/files/pdfs/GAISE/GaiseCollege_Full.pdf.
  • Baumer, B. (2015), “A Data Science Course for Undergraduates: Thinking With Data,” American Statistician, 69, 334–342. DOI:10.1080/00031305.2015.1081105.
  • Box, G., and Cox, D. R. (1964), “An Analysis of Transformations,” Journal of the Royal Statistical Society, Series B, 26, 211–252. DOI:10.1111/j.2517-6161.1964.tb00553.x.
  • Breiman, L. (2001), “Statistical Modeling: The Two Cultures,” Statistical Science, 16, 199–231. DOI:10.1214/ss/1009213726.
  • Browne, M. N., and Keeley, S. M. (2007), Asking the Right Questions (11th ed.), Upper Saddle River, NJ: Pearson/Prentice Hall.
  • Chernoff, H. (1973), “The Use of Faces to Represent Points in k-Dimensional Space Graphically,” Journal of the American Statistical Association, 68, 361–368. DOI:10.1080/01621459.1973.10482434.
  • Cleveland, W. S. (2001), “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics,” International Statistical Review, 69, 21–26. DOI:10.1111/j.1751-5823.2001.tb00477.x.
  • Columbus, L. (2017), “IBM Predicts Demand for Data Scientists Will Soar 28% by 2020,” available at https://www.forbes.com.
  • Columbus, L. (2018), “Data Scientist Is the Best Job in America According to Glassdoor’s 2018 Rankings,” available at https://www.forbes.com.
  • Cumming, G. (2013), Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis, New York: Routledge.
  • Davenport, T. H., and Patil, D. J. (2012), “Data Scientist: The Sexiest Job of the 21st Century”, Harvard Business Review, 90, 70–76.
  • Davydov, V., Zinchenko, V., and Talyzina, N. (1983), “The problem of activity in the works of A. N. Leontiev,” Soviet Psychology, 21, 31–42. DOI:10.2753/RPO1061-0405210431.
  • Donoho, D. (2015), “50 Years of Data Science”, in Tukey Centennial Workshop, Princeton, NJ, available at http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf.
  • Donoho, D. (2017), “50 Years of Data Science,” Journal of Computational and Graphical Statistics, 26, 745–766. DOI:10.1080/10618600.2017.1384734.
  • Engestrom, Y. (1991), “Activity Theory and Individual and Social Transformation,” Multidisciplinary Newsletter for Activity Theory, 7/8, 6–17.
  • Engestrom, Y. (1999), “Activity Theory and Individual and Social Transformation,” in Perspectives on Activity Theory, eds. Y. Engestrom, R. Miettinen, and R.-L. Punamaki,Cambridge: Cambridge University Press, pp. 19–38.
  • Engestrom, Y. (2000), “Activity Theory as a Framework for Analyzing and Redesigning Work,” Ergonomics, 43, 960–974. DOI:10.1080/001401300409143.
  • Escobedo-Land, A., and Kim, A. Y. (2015), “OKCupid Data for Introductory Statistics and Data Science Courses,” Journal of Statistics Education, 23, 1–25. DOI:10.1080/10691898.2015.11889737.
  • Fraley, C., and Raftery, A. (2002), “Model-Based Clustering, Discriminant Analysis, and Density Estimation,” Journal of the American Statistical Association, 97, 611–631. DOI:10.1198/016214502760047131.
  • Grimshaw, S. (2015), “A Framework for Infusing Authentic Data Experiences Within Statistics Courses,” The American Statistician, 69, 307–314. DOI:10.1080/00031305.2015.1081106.
  • Guo, P. J. (2012), “Software Tools to Facilitate Research Programming,” Ph.D. dissertation, Stanford University.
  • Hardin, J., Hoerl, R., Horton, N. J., Nolan, D., Baumer, B., Hall-Holt, O., Murrell, P., Peng, R., Roback, P., Temple Lang, D., and Ward, M. D. (2015), “Data Science in Statistics Curricula: Preparing Students to ‘think with data’,” The American Statistician, 69, 343–353. DOI:10.1080/00031305.2015.1077729.
  • Hinton, G., and Salakhutdinov, R. (2006), “Reducing the Dimensionality of Data With Neural Networks,” Science, 313, 504–507. DOI:10.1126/science.1127647.
  • Horton, N. J., Baumer, B., and Wickham, H. (2015), “Setting the Stage for Data Science: Integration of Data Management Skills in Introductory and Second Courses in Statistics,” CHANCE, 28, 40–50. DOI:10.1080/09332480.2015.1042739.
  • Langer, A. M. (2012), Guide to Software Development: Designing and Managing the Life Cycle, London: Springer.
  • Lave, J. (1988), Cognition in Practice: Mind, Mathematics, and Culture in Everyday Life, New York: Cambridge University Press.
  • Lave, J., and Wenger, E. (1991), Situated Learning: Legitimate Peripheral Participation, New York: Cambridge University Press.
  • LeCun, Y., Bengio, Y., and Hinton, G. (2015), “Deep Learning,” Nature, 521, 436–444. DOI:10.1038/nature14539.
  • Leontiev, A. N. (1978), Activity, Consciousness, and Personality (originally published in Russian in 1975), Englewood Cliffs, NJ: Prentice-Hall.
  • Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., and Byers, A. H. (2011), Big Data: The Next Frontier for Innovation, Competition, and Productivity, New York: McKinsey Global Institute.
  • Nardi, B. (1996), Context and Consciousness: Activity Theory and Human-Computer Interaction, Cambridge, MA: MIT Press.
  • Nolan, D., and Perrett, J. (2016), “Teaching and Learning Data Visualization: Ideas and Assignments,” American Statistician, 70, 260–269. DOI:10.1080/00031305.2015.1123651.
  • Nolan, D., and Speed, T. (2000), Stat Labs: Mathematical Statistics through Applications, New York: Springer-Verlag.
  • Nolan, D., and Temple Lang, D. (2015), Data Science Case Studies in R: A Case Studies Approach to Computational Reasoning and Problem Solving, Boca Raton, FL: Chapman and Hall/CRC.
  • O’Neil, C., and Schutt, R. (2013), Doing Data Science: Straight Talk From the Frontline, Sebastopol, CA: O’Reilly Media.
  • Price, E., De Leone, C., and Lasry, N. (2010), “Comparing Educational Tools Using Activity Theory: Clickers and Flashcards,” in AIP Conference Proceedings (Vol. 1289), AIP, pp. 265–268.
  • PwC (2015), “What’s Next for the Data Science and Analytics Job Market?,” available at https://pwc.to/2FL8GEG.
  • Raeithel, A. (1991), “Semiotic Self-Regularization and Work: An Activity Theoretical Foundation of Design,” in Software Development and Reality Construction, ed. R. Floyd, New York: Springer-Verlag.
  • Simpson, W. J. (1957), “A Preliminary Report on Cigarette Smoking and the Incidence of Prematurity,” American Journal of Obstetrics and Gynecology, 73, 808–815. DOI:10.1016/0002-9378(57)90391-5.
  • Sisto, M. (2009), “Can You Explain That in Plain English? Making Statistics Group Projects Work in a Multicultural Setting,” Journal of Statistics Education, 17, 1–11. DOI:10.1080/10691898.2009.11889522.
  • Strehl, A., and Ghosh, J. (2003), “Cluster Ensembles—A Knowledge Reuse Framework for Combining Multiple Partitions,” The Journal of Machine Learning Research, 3, 583–617.
  • National Academies of Sciences, Engineering and Medicine Consensus Report (2018), “Data Science for Undergraduates: Opportunities and Options,” available at https://nas.edu/envisioningds.
  • Tishkovskaya, S., and Lancaster, G. A. (2012), “Statistical Education in the 21st Century: A Review of Challenges, Teaching Innovations and Strategies for Reform,” Journal of Statistics Education, 23, 1–56. DOI:10.1080/10691898.2012.11889641.
  • Tukey, J. W. (1977), Exploratory Data Analysis, Reading, MA: Addison-Wesley.
  • Verzani, J. (2008), “Using R in Introductory Statistics Courses With the pmg Graphical User Interface,” Journal of Statistics Education, 16, 1–17. DOI:10.1080/10691898.2008.11889558.
  • von Luxburg, U. (2007), “A Tutorial on Spectral Clustering,” Statistics and Computing, 17, 395–416. DOI:10.1007/s11222-007-9033-z.
  • Wickham, W., and Grolemund, G. (2016), R for Data Science: Import, Tidy, Transform, Visualize, and Model Data, Sebastopol, CA: O’Reilly Media.
  • Wilcox, A. (2001), “On the Importance—and the Unimportance—of Birthweight,” International Journal of Epidemiology, 30, 1233–1241. DOI:10.1093/ije/30.6.1233.
  • Wild, C. J., and Pfannkuch, M. (1999), “Statistical Thinking in Empirical Enquiry,” International Statistical Review, 67, 223–265. DOI:10.1111/j.1751-5823.1999.tb00442.x.
  • Wu, C.-F. J. (1997), “Statistics = Data Science?,” in H. C. Carver Professorship Lecture, Ann Arbor, MI: The University of Michigan, available at http://www2.isye.gatech.edu/∼jeffwu/presentations/datascience.pdf.
  • Wu, C.-F. J. (1998), “Statistics = Data Science?,” in P. C. Mahalanobis Memorial Lecture, Kolkata: The Indian Statistical Institute.
  • Yan, D., Chen, A., and Jordan, M. I. (2013), “Cluster Forests,” Computational Statistics and Data Analysis, 66, 178–192. DOI:10.1016/j.csda.2013.04.010.
  • Yan, D., and Davis, G. E. (2018), “The Turtleback Diagram for Conditional Probability”, The Open Journal of Statistics, 8, 684–705. DOI:10.4236/ojs.2018.84045.
  • Yerushalmy, J. (1964), “Mother’s Cigarette Smoking and Survival of Infant,” American Journal of Obstetrics and Gynecology, 88, 505–518. DOI:10.1016/0002-9378(64)90509-5.
  • Yerushalmy, J. (1971), “The Relationship of Parents’ Cigarette Smoking to Outcome of Pregnancy—Implications as to the Problem of Inferring Causation From Observed Associations,” American Journal of Epidemiology, 93, 443–456. DOI:10.1093/oxfordjournals.aje.a121278.