Publication Cover
Statistics
A Journal of Theoretical and Applied Statistics
Latest Articles
17
Views
0
CrossRef citations to date
0
Altmetric
Target Article

On the relation between data science and statistics

Received 30 May 2024, Accepted 31 May 2024, Published online: 18 Jun 2024

Abstract

There is an ongoing discussion on how data science relates to statistics with a wide range of different opinions on whether these areas are basically the same or that they might differ from each other in terms of history, interpretation of data, the used techniques, the types and the size of data, etc. In this article, which has been written on the occasion of our special issue on Statistical Methods for xAI, I would like to state my view on these questions and potentially stimulate some further discussion.

2020 Mathematics Subject Classification:

Let me start my consideration by reflecting about what the word ‘data’ actually means. It originates from the past participle of the Latin word ‘dare’, which means ‘to give’. Thus it can just be translated as ‘the given’. That, of course, allows for very broad interpretation of the expression. Collect everything that is given and get clear about what is wanted. This is a standard strategy which has often been recommended at highschool to tackle basically all sorts of problems in mathematics. However, in statistics, data usually stand for empirically gained observations, drawn from some survey or experiment – or more occasionally collected. This refers to various fields of applications such as econometrics, biometrics or data from the internet. The word statistics has been known since the 18th century when it was introduced for data which were useful for a state and its politics. At least since the early 20th-century statistics must be considered as an own field of science when its rigorous mathematical theory has been developed, with close relations to information theory and probability theory as empirically observed data are crucially affected by random errors or contaminating effects. The relevant information must be extracted from the random raw observations. The general claim of statistics is applicability to all types of data. Then, a major question is how statistics relates to data science as this term surely indicates the same claim.

One legitimate way to distinguish between these expressions is by the academic origin of the scientists who would rather identify themselves as either statisticians or data scientists. Statisticians are usually affiliated to fields of economics (econometricians), medical science, genetics or biology (biometricians) or mathematics (mathematical/theoretical statisticians), while this list is certainly not exhaustive. The word data science is rather used by computer scientists. Of course, the academic origin of a scientist has major impact on the aspects that he may consider as of particular interest. Although it is sometimes argued that data science is a much younger field than statistics and many of its basics might just currently be worked out, this expression has also existed for many decades. I would like to refer to the paper 50 Years of Data Science, which David Donoho published in 2017 (see [Citation1]), and which could be considered as a provocation – to some extent. A famous quote, which denies crucial structural differences between data science and statistics, stems from Nate Silver, who is cited to have said: ‘I think data-scientist is a sexed up term for a statistician’ (see [Citation2]).

Apart from historic issues, there are aspects of data which are not in the main focus of statistical research. While statistics also contributes to problems on how surveys should be planned to obtain data which are as informative as possible (statistical design), it mainly concentrates on the analysis of empirical data, e.g., on how to approximately determine relevant parameters from the model based on the data (estimation), to check hypotheses on those parameters (testing) or to label observations as members of specific groups based on additional training data (classification and statistical learning). This latter aspect has become of particular interest in machine learning. By contrast, the contribution of statistics to data transmission or storage of data might be viewed as rather marginal. On the other hand, I do not have the impression that computer scientists may consider such issues first when talking about data science, but they would also rather think of analysis of empirical data.

When people try to explain differences between data science and statistics, unfortunately, one often encounters arguments which are certainly not valid. One of those aspects refers to the size and the types of the data under considerations. Therein data scientist are said to study ‘big data’ whereas statisticians are supposed to work with small datasets. Examples from medical statistics are provided, but people apparently do not understand that the occurrence of ‘small data’ in biometrics is due to e.g., the limited number of patients from whom data can be collected. It is not because the statistical methods only work for small sample sizes. This viewpoint is also surprising because, in the past, statisticians have sometimes been criticized to live in ‘asymptopia’, i.e., they use methods with high-quality properties when the sample size is assumed to tend to infinity; but care too little about their performance in real life when only finitely many data are observed. Asymptotic statistics has become a major field within mathematical statistics since the first half of the 20th century. The reason to switch to an asymptotic view is that theory simplifies, for instance, in many settings, asymptotic optimality of specific procedures can be shown while finite-sample optimality has not been proved yet. So what exactly does ‘big’ mean when talking about big data? If it refers to the sample size, then it is certainly included into classical statistical approaches. Thus, apart from huge samples of independent and identically distributed random variables, ‘big’ could also refer to the complexity of the data, e.g., their dimensionality. However, also then, there has been lots of statistical research on such kind of data types. As an example, I mention functional data analysis as a famous sub-field of statistics, in which observed random functions form the data. Then the data are even located in an infinite-dimensional function space. Also we could mention topological data analysis and statistical image analysis. People who claim major differences between statistics and data science seem to ignore major areas of statistics. Sometimes, statistics is also called a sub-branch of mathematics. Please note that for statistics as a whole that is certainly not true; while there is mathematical statistics as an important intersection as it forms the theory of statistics.

When dealing with datasets which may be considered as ‘big’ questions may arise such as ‘I have so many data. What shall I do with them?’. And data experts are supposed to provide the answer. However the first question should not be what to do with the data but what one wants to learn from the data. As in basically all branches of science a model is required that relates the data to the underlying quantities of interest. After fixing such a model optimal procedures have to be found to extract the desired information. In the third step, the procedures have to be implemented and the findings have to be re-transformed to make conclusions in the original real-life problem.

To conclude, in my view, it is legitimate to use both expressions – data science and statistics; and to prefer one of those words based on your scientific area. But one should avoid to artificially construct substantial differences between the fields which are based on unacceptably narrow interpretation of the word statistics.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.