Data Science Ethics: Concepts, Techniques and Cautionary Tales

David Martens, Oxford, UK: Oxford University Press, 2022, xii + 255 pp., $80.00(H), ISBN 978-0-19-284726-3.

Sabrina GiordanoDepartment of Economics, Statistics and Finance, University of Calabria, Cosenza, ItalyCorrespondence[email protected]

We live in the era of digital transformation which has magnified the ability to access all types of information and has empowered data science to turn any kind of data into business, security, health, economic advantages and much more. However, this is at the expense of an increasing invasion of privacy and a widespread experience of data-driven decisions that are often unexplained and sometimes discriminatory. The debate on what is right and what is wrong to do with data is still open and this book makes an outstanding contribution on it.

Throughout the book, the cautionary tales educate the readers on the consequences on people, companies, and society of overlooked ethical aspects. Indeed, several discussion exercises stimulate questions about the right balance between the usefulness of data science practice and its ethical implications.

The content of the book is structured in seven chapters. The introduction (Chapter 1) outlines the role of each chapter as consecutive steps in a data science project: from collecting and processing data to modeling it and making use of the results. Each stage entails specific ethical issues which are evaluated by the author based on three criteria: Fairness, Accountability, and Transparency (FAT). In fact, within the FAT framework, every stage has to be fair, in terms of privacy and discrimination against sensitive groups—for example, gender, race, religion; transparent, in relation to the way data is collected, used and made accessible, and to the clear explanation of model predictions and consequences of their use in practice. Accountability relates to demonstrable measures of effective fairness and transparency. What is meant by fairness and transparency changes according to the perspective of the individual, depending on whether he or she is a manager, a data scientist, a person on whom the information is collected (data subject), or to whom the model is applied (model subject). The three ethical criteria and the four roles of the subjects are the keys to reading all chapters of the book.

The ethical aspects of data gathering process are the core topics of Chapter 2 and raise questions about the type of data to collect and use, for which proposals and for how long to keep it available. Answers are given starting from the legal principles, which protect privacy as a human right, to the use of the key cryptographic mechanisms of data protection such as encryption, hashing, obfuscation, and the techniques of the differential privacy where noise is added to data before use. Different scenarios and discussion points present the thin line between what the user should know (with explicit informed consent) and what can be allowed for the legitimate interest of controllers in assessing, for example, health risk or security. The need for a balance between privacy and security is discussed by uncovering potentially dangerous consequences of government backdoors to access digital personal data in order to increase citizens’ security, while exposing them to a high risk of abuse. Bias in data is also presented as a fairness issue (unfair representativeness of the population or of certain sensitive groups), whereas ethical issues, which historically and currently arise from the classical method of collecting data on individuals, are examined at length with ad hoc cautionary tales and case studies.

After gathering data, fairness issues of privacy violation and discrimination can still emerge, and ethical data preprocessing methods are therefore needed as illustrated in Chapter 3. Removing personal identifiers from a dataset does not avoid the risk of reidentifying persons or unveiling their sensitive (sexual, political, religious etc) information. On the one hand, grouping data, by making continuous variables discrete, or suppressing some values, or techniques like k-anonymity, l-diversity, t-closeness help to reduce the probability of linking a person to a specific data instance. On the other hand, they unfortunately reduce the informative content of the datasets thereby diminishing the predictive performance of models. Highly appropriate real cases and cautionary tales are inserted to warn us with the problem of reidentifying a person or revealing sensitive attribute of a person based on information that can be potentially obtained through additional sources, for instance, external datasets, locations or webpages visited, social media actions. Another objective of the proposed ethical data preprocessing is to measure and remove the bias against sensitive groups, which potentially is in the original dataset, as a means to prevent the results of the prediction model, applied to biased data, from being discriminatory. For practical purposes, measures for dataset fairness and methods to remove such bias are provided and exemplified.

Chapter 4 deepens ethical aspects of modeling related to fairness (privacy, discrimination) and transparency (explainability). The mentioned privacy-preserving methodologies in Chapter 4 consist of adding noise to the model outcomes, analyzing encrypted data directly in a cloud computing service, and performing either a joint data analysis among multiple parties without sharing data, or a deep learning algorithm with a centralized model that uses data from multiple clients. Discrimination-aware approaches both clarify how to measure potential discrimination against sensitive groups in the model predictions, and provide a range of solutions to detect and eliminate bias during model building by looking for a tradeoff between the model accuracy and fairness. Cautionary tales on historical ethnic discrimination show that data modeling is not always free of unfair practices. Indeed, the third part of the chapter is dictated by the need to justify a decision made based on a prediction model without having to say: “Computer algorithm says so!”. All subjects have the right to know why a decision has been made on them (explanation of instance-based prediction) and managers often want to understand how the prediction model comes to its decisions over a large dataset (global explanation). Therefore, a set of examples in the chapter highlights the need for explainable and comprehensible model predictions in order to avoid skepticism and reluctance to use the model to make a real decision, and to identify errors in the model itself and provide further directions to improve its performance.

Chapter 5 emphasizes the importance of an ethical evaluation of the model. It stimulates the use of appropriate measures of predictive performance (e.g., showing misclassification rates as well as accuracy metrics), fairness (e.g., assessing the privacy of the dataset and reporting transparently the involved sensitive groups) and measures related to what extent an explanation of the model can be provided. The author’s complaints of unethical use of data (data dredging) and interpretation of the results (p-value hacking, missed multiple comparisons), suggests that researchers should report transparently (good and bad) outcomes and ensure reproducibility. With cautionary tales and discussion points the author makes one reflect on ethical conduct in line with the principles of research integrity.

The last stage is the model deployment which is not exempted from ethical concerns, and some of these are discussed in Chapter 6. This part of the book draws the reader’s attention to the following issues: the access to the data science system can be, for various reasons, limited and this constraint can give power to those who have access; the predictions generated by the model may provide different treatments to people; models may be vulnerable and lend themselves to dishonest use thereby affecting negatively people and society. Such aspects highlight the need for a data science ethics policy and the advisability of creating an ad hoc committee to ensure its implementation.

The author composed the book as a sequence of questions, raised by the continuing need to derive knowledge from data while respecting its protection, and as a set of techniques and measures in response to them. Nevertheless, the book is not a cookbook on what to do to be ethically correct in each step of a data science practice; it reveals the ethical implications in the data science applications and proposes a range of solutions highlighting their merits and risks, in the ongoing search for a balance between the practical utility of data analysis and in compliance with ethically right choices. As a matter of fact, ethical data science is not a checklist, but it is a way of thinking and acting when working on data.

The book highlights the author’s ability to draw on ethical concerns through real-life examples. In fact, the most striking ones set precedents that often caused the milestone change of company and governmental choices in the direction of an ethical practice of data science. Moreover, in all chapters, ethical concerns are introduced by an opening story, and the underlying concepts are presented by immersing the reader in existing cases linked to world-famous company names, but also to simple people in which the reader can identify with.

The book does not outline in-depth technical, mathematical, and computational details, but the bibliographical references are timely and give a fairly complete overview of the measures and techniques currently on offer. Furthermore, the legal references in European and other legislation, highlight regulatory developments.

The book is suitable for students in data science and business, data scientists who transform data into knowledge and innovation, or managers who derive business and competitiveness from data. Yet, this is a book aimed at all those who want researchers, companies and governments to be ethically responsible when making decisions using their own data and for purposes that might involve them. Each of us can become a data subject and a model subject, thus, each of us should read this book to become aware of what ethical thinking should orient data-driven decisions, which affect us much more closely than we might expect.

Sabrina Giordano
Department of Economics, Statistics and Finance,
University of Calabria, Cosenza, Italy
[email protected]

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Data Science Ethics: Concepts, Techniques and Cautionary Tales

David Martens, Oxford, UK: Oxford University Press, 2022, xii + 255 pp., $80.00(H), ISBN 978-0-19-284726-3.

Information for

Open access

Opportunities

Help and information

Data Science Ethics: Concepts, Techniques and Cautionary Tales

David Martens, Oxford, UK: Oxford University Press, 2022, xii + 255 pp., $80.00(H), ISBN 978-0-19-284726-3.

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date