ABSTRACT
A bivariate discrete frequency table, one of the significant exploratory data analysis (EDA) tools, organizes data systematically. The existing frequency table is straightforward, but when the number of elements in the data is large enough, the table can be complicated. In this research, we proposed a new bivariate discrete frequency table by grouping the elements in each variable. The table can be constructed using the R code provided with the article. We described the table using simulations from the bivariate binomial distribution, bivariate Poisson distribution. Real data, obtained from the English Premier League website, is also used to illustrate the new table. The findings indicated that the proposed bivariate frequency table provides a better alternative when the number of elements is substantial and reveals the essential data features.
1. Introduction
Data are more attractive and capture the minds of people if depicted in either tabular or graphical form. The tabular representations are precise and provide the reader with apparent features of the data; however, the graphical representations have more visual significance since they are useful in detecting patterns in a dataset (Beniger & Robyn, Citation1978; Davies, Citation1929; Gelman, Citation2011; Gelman et al., Citation2002; Kastellec & Leoni, Citation2007; Xu & Wang, Citation2020). The hidden raw data features can only be uncovered if the data is organized in a meaningful form, such as a frequency table. A frequency table partitions raw data into classes of appropriate sizes, displaying observations and their respective number of occurrences (Kenney, Citation1939; Manikandan, Citation2011; Mohammed, Adam, Ali et al., Citation2020). Generally, the main reason for summarizing raw data is to explore the extra information therein. It is also easier to understand the underlying distribution, the features of variables, and know the statistical tool to be used for inference.
Data obtained as a result of measurements such as length, height, weight, or temperature, assume values within interval or range. Such measured observations are called continuous data. If are continuous observations, , and the . Continuous data take values within a given interval and generally are measured values such as the amount of rainfall, length, or area, whereas discrete data are whole numbers (Gardiner et al., Citation1979). A set of discrete data is often obtained by counting or enumeration, while continuous data are usually obtained through measurement (Fisher & Marshall, Citation2009; Kenney, Citation1939). Discrete data are countable finite observations and the table that summarizes the discrete data. The elements are the natural classes; there are no class limits and class boundaries (Gravetter et al., Citation2020; Kenney, Citation1939). The discrete frequency table is classified into two, based on the number of variables. A table that organizes data on only a single discrete variable is known as a univariate discrete frequency table. Meanwhile, a bivariate discrete frequency table is a table that displays data on two joint discrete variables.
The existing bivariate discrete frequency table is straightforward and very significant. However, when the number of elements in the joint discrete data is large enough, it leads to a very long table that can be difficult to handle. In this research, we proposed a new bivariate discrete frequency table containing datasets with a large number of elements. The table can be constructed by grouping the elements in the joint discrete data.
2. Bivariate discrete frequency table
Let be pairs of discrete observations of variables and , the existing frequency table is given as Table . The notations, and ,, respectively, denote the number of elements in the two joint discrete datasets, , denote the elements of variable displayed in the columns and , are the elements of the second variable presented in the rows, is the joint frequency of variables X and Y in cell .
Number of classes
The number of classes for the continuous frequency tables is mainly dependent on the size of the data and several scientific rules, such as the rules proposed by Sturges (Citation1926), Cochran (Citation1954), Doane (Citation1976), Scott (Citation1979), and Freedman and Diaconis (Citation1981), can be used to determine the number of classes. Meanwhile, the number of classes for the discrete frequency tables depends on the number of data elements. Thus, when the number of elements in a dataset is small, no matter how big is the dataset, the frequency table will have a small number of classes. Whereas, when the number of elements is large, irrespective of the data’s size, the frequency table will have a large number of classes (Mohammed, Adam, Zulkafli et al., Citation2020).
3. Proposed bivariate discrete frequency table
The proposed table can be constructed by grouping the elements into classes, as shown in Table . The simplest case is when the elements of the two variables are grouped into two. The table can be described using three different cases, as, respectively, illustrated using Tables , and Table . The first case is when the number of elements in both the variables is even, the table is complete. Meanwhile, the second case is when the number of elements for the first variable is even and that of the second variable is odd, the table is incomplete, the last class of the second variable has a single element. The third case is when the number of elements in the first variable is odd while the second variable is even, the table is incomplete. The last class of the first variable will have a single element. if and are both less than 10, and , in the existing bivariate frequency table, Table can be used; modification is not necessary.
The values and , are, respectively, the number of classes for variables and , and are the number of elements in the two joint datasets, and , , are the number of elements in each class of the two variables. Also, and are the modes of the classes of the two variables, they represent the magnitude of observations in each class. The proposed bivariate frequency table is complete if and . This implies and . However, when either or , table is incomplete. This condition results to either or . That is, the number of elements of the last class of either of the two variables or both have a different number of elements.
Mode of the proposed frequency table
When dealing with the discrete data, the mode is the most suitable measure of location. In the proposed table, the mode () is used to represent the observations in each class. It is possible to have more than one mode in a class when two or more observations have the same highest frequency. The modes and , of the classes of the two variables, which represent the magnitude of observations in each class, are the elements in the classes which occurred the most. Since the existing class rules are applied to the continuous frequency tables, there are no existing rules for the discrete case. Therefore, we derived a rule for grouping the elements into class intervals for the proposed discrete frequency table. The grouping criteria consider the neighboring elements, either in ascending or descending order since they have similar characteristics. The idea behind grouping the elements is to get a manageable table since a substantial number of elements in the data results in a very long table, which cannot be easily handled. In the proposed frequency table, all the classes can have an equal number of elements, but sometimes either the first or last class may have different elements. If the number of elements in both the two variables is less than 10, there is no need for grouping.
Modifying the Cochran (Citation1954) rule with the number of elements instead of the sample size (), we derived a formula for grouping the elements () as
where and is the grouping number. The proposed bivariate discrete frequency table can be constructed in the R package using the code given in the appendix.
To describe the proposed table, we performed two simulation studies using bivariate binomial and Poisson distributions. Two joint discrete variables and are said to have a bivariate binomial distribution if their probability density function is given by
where and are, respectively, the first and second successes, and is the common number of trials. The , , , Var(Y) = . Meanwhile, the bivariate Poisson distribution is given by
The notations, , , , the parameters of the distribution, are positive real numbers, is an integer between and . The mean and variance of variable are equal, that is, . So also the mean and variance of variable , . The covariance of and is given as Karlis and Ntzoufras (Citation2003).
4. Results and discussion
Simulation
In this study, to observe the pattern of the bivariate discrete frequency table and illustrate the proposed table, we performed simulations using bivariate binomial and Poisson distributions. Two different studies using 100 samples from bivariate binomial distribution both with parameters , , but different , and are carried out. The third study used 100 samples of sizes 1000 from bivariate Poisson distribution with parameters , , and , meanwhile, the fourth used 100 samples of sizes 1000 from bivariate Poisson distribution with parameters , , and .
The first study shows that the number of elements for variables and are respectively within the range , and . The least joint frequency among the 100 samples is , while the maximum frequency is 89. One sample is used to construct the existing bivariate discrete frequency table, Table , and to describe the bi-element bivariate discrete frequency table, Table . The elements in the bivariate discrete data are partitioned into two, as suggested by EquationEquation (1)(1) (1) . The classes of variable have an equal number of elements; hence, the classes are complete. Meanwhile, all the variable classes have the same number of elements except the last class; therefore, the classes are incomplete. The class mode s, and , represent the magnitude of observations in each class of the two variables.
For the second study from the bivariate binomial distribution, the pattern indicates that the numbers of elements for variables and are both in the interval , and the minimum joint frequency is , and the maximum frequency is 39. Again, a sample is used to depict the existing bivariate discrete frequency table, Table , and to illustrate the tri-element bivariate discrete frequency table, Table . As suggested by EquationEquation (1)(1) (1) , the elements in the sample data are grouped into three. Both variables’ classes have an equal number of elements; hence, the proposed table is complete. The class modes, and , represent the magnitude of observations in each class of the two variables. The class modes are the elements in each class that occurred the most. A class could have more than one element as a mode if two or more elements appeared equally.
The third study using data from bivariate Poisson distribution with Parameters , , and shows that the numbers of elements for variable and are, respectively, in the interval and . Meanwhile, the smallest frequency of the 100 samples is , whereas the highest frequency is . The existing and the bi-element bivariate discrete frequency tables, and , are both constructed using one of the samples. EquationEquation (1)(1) (1) suggested partitioning the elements in the sample data into two. All the variable classes have an equal number of elements except the last class; hence, the classes are incomplete. Whereas variable classes have the same number of elements; hence, the classes are complete. Class modes , and , which represent the magnitude of observations in the variables’ classes, are the elements in each class that occurred the most.
The fourth simulation study using a bivariate Poisson distribution shows that the number of elements for variable and are, respectively, within the interval and . While the least joint frequency of the 100 samples is and the maximum frequency is . One sample is used to construct the existing bivariate discrete frequency table, , and depict the tri-element bivariate discrete frequency table, . As suggested by EquationEquation (1)(1) (1) , the elements in the sample data are grouped into three. The last class of variable has a different number of elements; hence, the variable classes are incomplete. While all variable classes have an equal number of elements; therefore, the classes are complete.
Moreover, the proposed bivariate discrete frequency table is demonstrated with the first having a different number of elements, using data from bivariate Poisson distribution with Parameters , , and . is the existing table, while is the proposed table with elements partitioned into three and the first class of the table having a different number of elements.
Application
Moreover, to illustrate the proposed table using real data, we used the English Premier League Team . The data, which covers seasons 2006/2007 to 2017/2018, was obtained from the English Premier League website and deposited on the Kaggle website. The data contains 41 variables and 240 observations, but only two variables, the number of wins and clean sheets, are used in this study. presents the bivariate discrete frequency table of the number of wins and clean sheets for 12 English Premier League seasons. Meanwhile, displays the tri-element bivariate discrete frequency table. The variables and respectively represent the number of clean sheets and wins. In all the seasons, Manchester City recorded the highest number of wins 32 with 18 clean sheets, followed by Chelsea with 30 wins and 16 clean sheets. The least performed club in all the seasons is Derby County, with only one win and three clean sheets. Using EquationEquation (1)(1) (1) , the proposed bivariate frequency table, , is constructed by grouping the elements into three, tri-element. Both variables’ classes have different elements in the last class; hence, the proposed table is incomplete. The class modes, and , represent the magnitude of observations in each class of the two variables.
In the proposed table, , the number of clean sheets and wins are grouped into three. Only one club recorded wins in the interval 1,3,4 with clean sheets in the interval 2, 3, 4, two clubs having a number of wins in the interval 1, 3, 4, with clean sheets in the interval 5, 6, 7. Up to the last wins class where we have one club having a number of wins in the interval 30,32 with a number of clean sheets in the interval 14, 15, 16. So also, only one club had a number of wins and clean sheets in the intervals 30,32 and 17, 18, 19, respectively. The proposed bivariate frequency table, , is more manageable as compared with the existing table counterpart, . Indeed, the existing table ceases to be practical when the elements in the two variables, and are large enough.
5. Conclusion
The proposed bivariate discrete frequency table is more manageable and straightforward as compared with the existing counterpart. Indeed, the table provides a better option when the number of elements in the paired discrete data is substantial.
Public interest statement
Exploratory data analysis (EDA) plays a significant role in statistics. The existing bivariate discrete frequency table, one of the EDA tools, is straightforward, but when the number of elements in the data is large enough, the table can be complicated. This reasearch proposed a new bivariate discrete frequency table by grouping the elements in each variable. The new table is described using simulations from the bivariate binomial distribution, bivariate Poisson distribution, and real data, obtained from the English Premier League website. The public will find the new table helpful, as it provides a better alternative when the number of elements is substantial and reveals the essential data features.
Acknowledgements
This research is partially funded by Universiti Putra Malaysia grant GP/2018/969400. The first author is supported by Tetfund, Federal government of Nigeria scholarship grant.
Disclosure statement
No potential conflict of interest was reported by the authors.
Additional information
Funding
Notes on contributors
M. B. Mohammed
M. B. Mohammed is a Lecturer in the Department of Mathematics and Computer Science, Federal University of Kashere, Gombe State, Nigeria. His research interest is in exploratory data analysis, extreme value theory, and circular statistics, among others.
M. B. Mohammed is a lecturer at the Department of Mathematics and Statistics, Federal University of Kashere, Gombe State, Nigeria. He was born on 16 May 1982 in Gombe, Gombe State, Nigeria. He, respectively, obtained his national diploma and bachelor’s degree in statistics from Federal Polytechnic Damaturu and Modibbo Adama University of Technology Yola, Nigeria. He finished his Master of Sciences in Statistics from the University of Ilorin, Nigeria, in 2015. He obtained his Ph.D. in statistics in the area of Exploratory Data Analysis (EDA).
His research interest is in Exploratory Data Analysis, Extreme Value, Circular Statistics, and Survival Analysis.
H. S. Zulkafli
H. S. Zulkafli is also a senior lecturer in the Institute for Mathematical Research, Universiti Putra Malaysia. She has vast experience in Bayesian statistics.
N. Ali
N. Ali is a senior lecturer in the Institute for Mathematical Research, Universiti Putra Malaysia. Her research interest is in extreme value theory.
O. R. Olaniran
O. R. Olaniran is a lecturer in the statistics department, University of Ilorin, Kwara State, Nigeria. His research areas are; biostatistics and data science. He has experience in Bayesian statistics, mathematical statistics and inference, statistical computing, biostatistics and survival analysis, and time series and econometrics.
References
- Beniger, J. R., & Robyn, D. L. (1978). Quantitative graphics in statistics: A brief history. TheAmerican Statistician, 32(1), 1–11. https://www.tandfonline.com/doi/abs/10.1080/00031305.1978.10479235
- Cochran, W. (1954). Some methods for strengthening the common chi-square test. Biometrics, 10(4), 417–451. https://doi.org/10.2307/3001616
- Davies, G. R. (1929). The analysis of frequency distributions. Journal of the American Statistical Association, 24(168), 349–366. https://doi.org/10.1080/01621459.1929.10502532
- Doane, D. P. (1976). Aesthetic frequency classifications. The American Statistician, 30(4), 181–183. https://www.tandfonline.com/doi/abs/10.1080/00031305.1976.10479172
- Fisher, M. J., & Marshall, A. P. (2009). Understanding descriptive statistics. Australian Critical Care, 22(2), 93–97. https://doi.org/10.1016/j.aucc.2008.11.003
- Freedman, D., & Diaconis, P. (1981). On the histogram as a density estimator: L2 theory. Probability Theory and Related Fields, 57(4), 453–476. https://doi.org/10.1007/BF01025868
- Gardiner, V., Gardiner, G., & Catmog, G. G. (1979). Analysis of frequency distributions.
- Gelman, A. (2011). Why tables are really much better than graphs. Journal of Computational and Graphical Statistics, 20(1), 3–7. https://doi.org/10.1198/jcgs.2011.09166
- Gelman, A., Pasarica, C., & Dodhia, R. (2002). Let’s practice what we preach: Using graphs instead of tables. The American Statistician, 56(56), 121–130. https://doi.org/10.1198/000313002317572790
- Gravetter, F. J., Wallnau, L. B., Forzano, L.-A. B., & Witnauer, J. E. (2020). Essentials of statistics for the behavioral sciences. Cengage Learning.
- Karlis, D., & Ntzoufras, I. (2003). Analysis of sports data by using bivariate poisson models. Journal of the Royal Statistical Society: Series D (The Statistician), 52(3), 381–393. https://doi.org/10.1111/1467-9884.00366
- Kastellec, J. P., & Leoni, E. L. (2007). Using graphs instead of tables in political science. Perspectives on Politics, 5(4) , 755–771. https://doi.org/10.1017/S1537592707072209
- Kenney, J. F. (1939). Mathematics of statistics. D. Van Nostrand.
- Manikandan, S. (2011). Frequency distribution. Journal of Pharmacology & Pharmacotherapeutics, 2(1), 54. https://doi.org/10.4103/0976-500X.77120
- Mohammed, M. B., Adam, M. B., Ali, N., & Zulkafli, H. S. (2020). Improved frequency table’s measures of skewness and kurtosis with application to weather data. Communications in Statistics - Theory and Methods, 1–18. https://doi.org/10.1080/03610926.2020.1752386
- Mohammed, M. B., Adam, M. B., Zulkafli, H. S., & Ali, N. (2020). Improved frequency table with application to environmental data. Mathematics and Statistics, 8(2), 201–210. https://doi.org/10.13189/ms.2020.080216
- Scott, D. W. (1979). On optimal and data-based histograms. Biometrika, 66(3), 605–610. https://doi.org/10.1093/biomet/66.3.605
- Sturges, H. A. (1926). The choice of a class interval. Journal of the American Statistical Association, 21(153), 65–66. https://doi.org/10.1080/01621459.1926.10502161
- Xu, D., & Wang, Y. (2020). Area-proportional visualization for circular data. Journal of Computational and Graphical Statistics, 29(2), 351–357. https://doi.org/10.1080/10618600.2019.1654881
6. Appendix
Bitable <- function(data = data, group = group){
# data is the bivariate discrete data
# group is the number of elements in each class,
# which can be determined using EquationEquation 1(1) (1) .
## creating the univariate frequency table
Unitable<-function(data,colNum){
freq<-data[,colNum]
n <- length(freq)
id <- sum(grepl(“\”.,freq))
sorted_data <- sort(freq)
uni_freq<-unique(sorted_data)
n_freq<-length(uni_freq)
if(id = = 0){
freq2<-table(freq)
freq_discrete1<-as.data.frame(freq2)
col_1<-c(1:n_freq)
freq_discrete2<- cbind(col_1,freq_discrete1)
colnames(freq_discrete2)[1:3]<-c(“class”,”xi”,”f”)
freq_discrete2
} else {print(“Not Discrete Data”)}
}
spliter <- function(x, n, force.number.of.groups = TRUE,
len = length(x), groups = trunc(len/n), overflow = len%%n) {
if(force.number.of.groups) {
f1 <- as.character(sort(rep(1:n, groups)))
f <- as.character(c(f1, rep(n, overflow)))
} else {
f1 <- as.character(sort(rep(1:groups, n)))
f <- as.character(c(f1, rep(“overflow”, overflow)))
}
g <- split(x, f)
if(force.number.of.groups) {
g.names <- names(g)
g.names.ordered <- as.character(sort(as.numeric(g.names)))
} else {
g.names <- names(g[-length(g)])
g.names.ordered <- as.character(sort(as.numeric(g.names)))
g.names.ordered <- c(g.names.ordered, “overflow”)
}
return(g[g.names.ordered])
}
stat_mode <- function(x, return_multiple = TRUE, na.rm = FALSE) {
if(na.rm){
x <- na.omit(x)
}
ux <- unique(x)
freq <- tabulate(match(x, ux))
mode_loc <- if(return_multiple) which(freq = = max(freq)) else which.max(freq)
return(ux[mode_loc])
}
split_list_into_single = function(LIST,ind){
xlist = LIST[[ind]]
paste0(xlist[1:length(xlist)],”,”,collapse = ““)
}
get_mode = function(xx,ff,ind){
xm = rep((xx[[ind]]),ff[as.numeric(xx[[ind]])])
paste0(stat_mode(xm)”,,”,collapse = ““)
}
create_tab = function(data, colNum, nvar){
#nvar is the number of variates intended
# colNum is the column number
Freq_table = Unitable(data, colNum) ## generate frequency table
x = Freq_table[,2] #subset x
freq = Freq_table[,3]
if(length(x)%%nvar = = 0) {
splits = spliter(x,n = length(x)/nvar,force.number.of.groups = T)
}
else {
splits = spliter(x,nvar,force.number.of.groups = F)
}
sumf = as.numeric(lapply(splits,function(i) sum(freq[i])))
tab = matrix(NA,nrow = length(splits),ncol = 3)
for(i in 1:length(splits)){
tab[i,] = c(split_list_into_single(splits,i),sumf[i],get_mode(splits,freq,i))
}
if(sumf[length(splits)]! = 0){
tabf = data.frame(tab)
colnames(tabf) = c(paste0(“xi”,”,”,”xi+1”,”,”,”xi+2”,” … ”),”freq”,
paste0(“Mode”,”(“”,xi”,”,”,”xi+1”,”,”,”xi+2”,” … ”,”)”))
tabf
}
else{
tabf = data.frame(tab)
colnames(tabf) = c(paste0(“xi”,”,”,”xi+1”,”,”,”xi+2”,” … ”),”freq”,
paste0(“Mode”,”(“”,xi”,”,”,”xi+1”,”,”,”xi+2”,” … ”,”)”))
tabf[-length(splits),]
}
}
#### Creating the bivariate frequency table
## Function that categorizes the elements in the data
cut.unique = function(x, group){
cut.unique = function(x,group){
uniquevalues = unique(x)
sort.uni = sort(uniquevalues)
ngroups = ceiling(length(sort.uni)/group)
em = NULL
if(length(sort.uni)%%group = = 0){
for(i in 1:ngroups){
emf = NULL
for(j in (group-1):0){
emf = c(emf,paste0(sort.uni[i*group-j],”,”))
}
em = c(em, paste0(emf,collapse = ““))
}
}
else{
ngroups2 = ceiling((length(sort.uni)-(length(sort.uni)%%group))/group)
em. = NULL
for(i in 1:ngroups2){
emf. = NULL
for(j in (group-1):0){
emf. = c(emf.,paste0(sort.uni[i*group-j]”,,”))
}
em. = c(em., paste0(emf.,collapse = ““))
}
em. = c(em.,paste0(sort.uni[((length(sort.uni)-(length(sort.uni))%%group))+1):length(sort.uni)
em = em.
}
em
}
#Function that creates unique classes of group size g
#install.packages(“stringr”)
library(stringr)
cutdiscrete = function(x,g){
# x: discrete values;
# g: intended number of groups
uniqueg = cut.unique(x,g)
fvec = matrix(NA,nrow = length(x),ncol = length(uniqueg))
for(i in 1:length(uniqueg)){
classi = as.numeric(stringr::str_extract_all(uniqueg[i], “\d+”)[[1]])
if(length(classi) ! = 1){
begin = classi[1]
endin = classi[length(classi)]
fvec[,i] = ifelse((x≥ begin) & (x≤ endin),i,0)
}else{
fvec[,i] = ifelse(x = = classi,i,0)
}
}
ffvec = rowSums(fvec)
ffvec2 = factor(ffvec,labels = uniqueg)
ffvec2
}
## A function that construct the two-way frequency table
bivfreqtab = function(data,group){
if(max(length(unique(data[[1]])),length(unique(data[[2]]))) ≤ 10){
tab = table(data[[1]],data[[2]])
}else{
s1 = cutdiscrete(data[[1]],group)
s2 = cutdiscrete(data[[2]],group)
tab = table(s1,s2)} return(tab)
}
# get the modes using create_tab function
data1 = data.frame(data[[1]])
data2 = data.frame(data[[2]])
mod1 = create_tab(data1, 1, nvar = group)[,3] # modes of the first variable
mod2 = create_tab(data2, 1, nvar = group)[,3] # modes of the second variable
tab = bivfreqtab(data = data, group = group)
tabc = rbind(colmode = paste0(mod2), tab, deparse.level = 0)
tabf = data.frame(cbind(rowmode = c(““,paste0(mod1)), tabc))
colnames(tabf) = c(“rowmode”, colnames(tab))
tabf
}
}