61
Views
1
CrossRef citations to date
0
Altmetric
Articles

Automating and utilising equal-distribution data classification

ORCID Icon, ORCID Icon, ORCID Icon, , &
Pages 100-115 | Received 16 Apr 2020, Accepted 09 Dec 2020, Published online: 05 Jan 2021
 

ABSTRACT

Data classification, i.e. organising data items in groups (classes), is a general technique widely used in data visualisation and cartography, in particular, for creation of choropleth maps. Conventionally, data are classified by dividing the data range into intervals and assigning the same symbol or colour to all data falling within an interval. For instance, the intervals may be of the same length or may include the same number of data items. We propose a method for defining intervals so that some quantity represented by values of another attribute is equally distributed among the classes. This kind of classification supports exploratory analysis of relationships between the attribute used for the classification and the distribution of the phenomenon whose quantity is represented by the additional attribute. The approach may be especially useful when the distribution of the phenomenon is very unequal, with many data items having zero or low quantities and quite a few items having larger quantities. With such a distribution, standard statistical analysis of the relationships may be problematic. We demonstrate the potential of the approach by analysing data referring to a set of spatially distributed people (patients) in relationship to characteristics of the areas in which the people live.

RÉSUMÉ

La classification de données, c'est-à-dire la répartition de données en groupes (en classes), est une méthode largement utilisée en visualisation de données et en cartographie, en particulier pour la création de cartes choroplèthes. Généralement les données sont classées en divisant la plage des données en intervalles et donnant le même symbole ou la même couleur à toutes les données rangées dans le même intervalle. Par exemple, l'intervalle peut être de la même longueur ou avoir le même nombre d'éléments. Nous proposons une méthode pour définir les intervalles de telle sorte que la quantité est représentée par la valeur d'un autre attribut qui est équi-reparti. Un exemple est de diviser un ensemble de régions géographiques en classes à partir de l'attribut « Taux de natalité » de sorte que les classes aient des valeurs totales à peu près égales pour l'attribut « Population » ou « Terres Arables ». Ce type de classification facilite l'analyse exploratoire des relations entre l'attribut utilisé pour la classification et la distribution du phénomène dont la quantité est représentée par l'attribut supplémentaire. Cette approche peut être particulièrement utile quand la distribution du phénomène est très inégale, avec de nombreuses données ayant une valeur nulle ou faible et un certain nombre ayant des quantités plus importantes. Avec de telles distributions l'analyse statistique standard des relations peut être problématique. Nous démontrons le potentiel de notre approche en analysant des données se référant à un ensemble de personnes spatialement réparties (des patients) en relation avec les caractéristiques des zones dans lesquelles vivent ces populations.

Acknowledgements

This research was supported by Fraunhofer Center for Machine Learning within the Fraunhofer Cluster for Cognitive Internet Technologies, by DFG within Priority Programme 1894 (SPP VGI), and by EU in projects Track&Know and SoBigData++.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes on contributors

Gennady Andrienko is a lead scientist at Fraunhofer Institute IAIS and professor at City, University of London. He published two monographs and a textbook, co-edited several special issues of major journals, and co-authored numerous papers addressing theoretical aspects and practical applications of visual analytics. Further details can be found at http://www.geoanalytics.net.

Natalia Andrienko is a lead scientist at Fraunhofer Institute IAIS and professor at City, University of London. She published two monographs: “Exploratory analysis of spatial and temporal data. A systematic approach” (Springer, 2005) and “Visual analytics of movement” (Springer, 2013) and a textbook entitled “Visual analytics for data scientists” (Springer, 2020). She received a Test of Time award at IEEE VIS 2018 and several best paper awards at major conferences.

Dr. Ibad Kureshi is a senior research scientist and heads the Data Science, Modelling and Visualisation team at Inlecom Systems. His current work focuses on the creation of digital twins and models spanning different applications areas, e.g. urban management, manufacturing and maintenance, and transport and logistics.

Kieran Lee is a Research Practitioner in the Sleep Centre of Royal Papworth Hospital. His work involves enabling and conducting exploratory research on a variety of open themes related to the sleep disordered breathing condition Obstructive Sleep Apnoea. He has a particular interest in pragmatic and impactful healthcare science.

Ian Smith is the director of the Respiratory Support and Sleep Centre at Royal Papworth Hospital. His research portfolio includes projects to optimise pathways of care for people with sleep apnoea and developing new diagnostics in the assessments of respiratory failure and sleepiness.

Dr. Toni Staykova is a research director at Ukemed Ltd. Toni Staykova is a specialist geriatrician with 25 years clinical experience and special interest in clinical innovation. Toni has significant innovation project experience in many health care domains.

Notes

2 All illustrations in this paper are screenshots made with the software system V-Analytics, which is publicly available at URL http://geoanalytics.net/V-Analytics/.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 487.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.