2,149
Views
33
CrossRef citations to date
0
Altmetric
Articles

A Survey of Big Data Issues in Electronic Health Record Analysis

, , , , , & show all

ABSTRACT

The Electronic Health Record (EHR) groups all digital documents related to a given patient such as anamnesis, results of the laboratory tests, prescriptions, recorded medical signals as ECG or images, etc. Dealing with such data representation incurs a plethora of problems, such as different data types, even unstructured data (i.e., doctor’s notes), huge and fast-growing volume, etc. Therefore. EHR should be considered as one of the most complex data objects in the information processing industry. Accordingly, taking into consideration its complexity, heterogeneity, fast growth, and size, the analysis of EHR data increasingly needs big data tools. Such tools should be able to analyze datasets characterized by the so-called 4Vs (volume, velocity, variety, and veracity). These notwithstanding, we should also add the fifth V—value—because analytics tool deployment makes sense only if it leads to health-care improvement (as personalized patient care, decreasing unnecessary hospitalization, or reducing patient readmissions). In this study, we focus on the selected aspects of EHR analysis from the big data perspective.

Introduction

The Electronic Health Record (EHR) collects a digital representation of individual patient documents as results of the medical tests, doctor’s descriptions, prescriptions, images, etc. Its implementation is expected to improve the procedures and quality of health care, help to control the costs, and reduce fraud (Graña and Jackwoski Citation2015). Its implementation induced changes in the way physicians deal with information and knowledge, forcing documentation procedures that are resented by some care providers as cumbersome and a waste of efforts (Reich Citation2012), producing resistances to its implantation. The paradigmatic example is its recent implementation in Greece in the middle of a significant economical crisis (Vijayakrishnan et al. Citation2014) It has been also proposed that EHR is a key instrument to support public health (Kukafka et al. Citation2007), which requires extending its contents to nonmedical environmental information, providing standard protocols and processing pipelines to make the EHR information broadly available. The health information exchange system connects primary care, hospitals, pharmacies, and laboratories to the health-care information management network, in order that the EHR might play a central role in communications between health institutions and companies, as well as inside smart hospitals (Mertz Citation2014). Nevertheless, working EHR implementations produce massive interactions and generation of data. Therefore, EHR data management should be seen as large-scale data processing of complex data representations. In 2012, worldwide digital health-care data was estimated to be equal to 500 petabytes and is expected to reach 25,000 petabytes in 2020.Footnote1 These huge quantities of data require specific computational methods, such as distributed processing architecture (El-Sappagh and El-Masri Citation2014). Accordingly, taking into consideration its complexity, heterogeneity, fast growth, and size, health-care data can be categorized as a big data problem, requiring special tools to analyze it (Mathew and Pillai Citation2015)in order to deal efficiently with the 5Vs:

  • Volume has been identified as the main feature of big data problems, because data scale requires innovative tools to store and access it. Let’s imagine that a doctor’s note could be stored as text file, which is a few kilobytes, but a raw image requires a few megabytes, and more-sophisticated diagnostic tools, such as MRI, might easily be a few gigabytes. If we multiple such volume size by the number of tests carried out in the hospital, we should be ready to deal with tera- and petabytes, but we should be also ready for exabytes volumes, because each reasonable analysis should have access to entire patient histories. Taking into consideration medical data, we should also address the following issues:

    • Data distribution versus privacy. Dealing with distributed data could potentially cause numerous problems such as diversity of data source qualities, and we also should notice that using distributed databases might easily clash against legal or commercial limitations that do not allow sharing raw data among databases or merging them in the common repository. Therefore, developing privacy-preserving versions of the lived-in data analysis techniques has been the subject of intense research efforts, and it primarily aims to offer a possibility of distributed data analysis with a minimum risk of information disclosure.

    • Response time of analytic systems. The problem is not the volume to store (prices are going down), but the response time. Users do not agree to wait too long for the answer of a typical query.

    • Special structure for data, i.e., migrating from operating database systems to analytic databases (data warehouses)

    • Standards for data collections (such as the Digital Imaging and Communications in Medicine (DICOM)Footnote2 used by picture archiving and communication system (PACS)) and transfer (such as HL7Footnote3).

  • Velocity. Data is in motion, new information about patients is added, some medical records are updated. Therefore, smart analytic tools applied to EHR analysis require that the models need not be rebuilt from scratch when new data come, but only improved. Thus, the analysis of the data stream rises as one of the crucial topics for data scientists.

  • Variety. The EHR groups’ inhomogeunous data: on the one hand, structured data in the form of standardized medical info, such as DICOM or International Classification of Diseases (ICD) codes, but on the other hand, highly valuable information that is buried in the doctor’s notes, which are written in natural language, waiting to be mined and assigned value. Therefore, methods of data structuralization, natural language processing, automatic image/video analysis are the focus of intense research. The following problem should be also considered:

    • Data privacy versus data integration. The EHR inhomogeneity extends to the personal nature of the data, which is protected by law in most civilized countries. Therefore, security and privacy are mandatory in EHR processing and transmission through communication channels.

    • Cost of data integration and structuralization versus profits.

  • Veracity. Data quality seems to be the crucial issue, because only high-quality data can produce credible models. Unfortunately, most of the data is biased and noisy. Often, we may find anomalies, such as outliers in data or missing items. Human errors are an important issue as well. Kohn et al. (Kohn, Corrigan, and Donaldson Citation2000) report that, each year, 44.000–98.000 patients die in the United States due to avoidable clinical mistakes. Therefore, development of methods that are able to detect wrong treatments and/or diagnoses, and clean the EHR of errors are strongly required.

  • Value. Analytic tool deployment makes sense only if it leads to improvement in health care (such as personalized patient care, decreasing unnecessary hospitalizations, or reducing patient readmissions). Apart from the economic aspects concerned with ways of reducing health-care costs while assuring the same or higher quality level of service, we should also remember that such methods should focus on the issue of how to improve patient well-being. For instance, the detection of disease outbreaks (Dawson, Gailis, and Meehan Citation2015) or general/specific population conditions, such as prevalence of infections (Shang et al. Citation2015) in specific populations, has a high public-health value and can readily be computed from EHR data.

In this work, we focus on selected aspects of EHR analysis from the big data perspective (Cyganek et al. Citation2015), such as efficient computer system structures, data quality and data privacy, data mining, cost-sensitive approaches, and, especially, the medical data stream analysis. The analysis of huge volumes of fast-arriving data is recently the focus of intense research, because such methods could provide a competitive advantage to a given company. A most promising approach is data stream classification, which is employed to solve nonstationary problems related to discovery of client preference changes, spam filtering, fraud detection, and medical diagnosis, to enumerate only a few.

Efficient computing system structure

With regard to the volume dimension of big data, an important performance metric is the response time of the computer system utilized to process the data. Some big data applications are local oriented, i.e., the results of data processing are needed only locally. In the local-oriented scenario, the computer system required to process the data includes local servers that are responsible for running the algorithms that analyze the data, storage devices that store the data, and a local area network (LAN) connecting all elements of the system. In turn, some big data applications require computational resources of scale greater than a local system to store, analyse, and deliver results. For ease of reference, we call these scenarios global-oriented. In particular, in the mainstream of global-oriented big data applications, the system produces results that are requested by a number of users distributed in a geographical area exceeding the local system. Another scenario is when big data analysis needs huge computational power that cannot be provided locally, and consequently, specialized data centers are utilized to process data. Moreover, in order to reduce both capital expenditures (CAPEX) and operational expenditures (OPEX), global-oriented big data analysis can be achieved by using the concept of cloud computing. More specifically, usually data centers that provide various processing and computing services are of very large scale, which facilitates the deployment process in terms of the investment cost reduction. For instance, purchasing thousands of pieces of equipment for the data center (computers, storage, switches, cables, etc.) significantly reduces the required CAPEX cost. Second, the data centers are commonly deployed in places that allow low costs in terms of operational expenditures, which follow mostly from energy consumption. For example, data centers that are located in places with relatively inexpensive energy (e.g., close to big power plants or sources renewable energy) and low values of power usage effectiveness (PUE) requirements resulting from the reduced needs for cooling (e.g., cold climate).

Note that the common element of all global-oriented big data applications is the requirement of data transmissions across large distances via wide area networks (WAN). Nowadays, the state-of-art WAN solution is given by optical networks including technologies such as wavelength division multiplexing (WDM) and elastic optical networks (EONs). Usually, it is a two-layer network architecture with an optical network in the lower layer and packet-switched network in the upper layer. In the case of big data applications, two scenarios are possible to provision big data network traffic, i.e., service network provisioning by the packet-switched network layer, which ensures delivery of the big data traffic, and optical network provisioning, with which big data traffic is transmitted directly over the optical layer with the use of short-lived lightpaths. In more detail, in the service network provisioning scenario, the traffic related to big data analysis is carried in the packet-switched network layer using the existing virtual topology of lightpaths, i.e., the underlying optical network is assumed to be overprovisioned, so that extra capacity is left to enable big data traffic according to new requests. However, limited resources of network capacity can significantly increase the system response time to unacceptable values for some big data applications that require very fast reaction. The optical network provisioning scenario allows establishing dynamic short-lived lightpaths directly in the optical layer, fully dedicated to carrying the big data traffic. More specifically, each new data transfer request related to the big data analysis is provisioned by a new lightpath that connects directly the client node and the computing resources. This strategy supports transmissions with very high bandwidths (for instance, up to 1 Tb/s in EONs) ensuring very short response time of the system. As well as the response time, other important performance metrics to be addressed in the context of big data processing are: scalability, resilience, security, low cost (including both CAPEX capital expenditures and OPEX operating cost), energy efficiency (Erl, Puttini, and Mahmood Citation2013; Klinkowski and Walkowiak Citation2013; Venters and Whitley Citation2012; Walkowiak et al. Citation2015).

Today, big data analytic solutions are becoming widely accessible (Asri et al. Citation2015; Cyganek et al. Citation2015; Mathew and Pillai Citation2015), including the well-known Hadoop and MapReduce for massive job execution in scalable clusters of servers and many other free, open source tools for distributing data processing, data visualization, and data mining of large-scale datasets, allowing applications such as predictive epidemics, personalized health care, minimization of medication errors (Nguyen et al. Citation2013), and evidence-based medicine. As well, the big investment on computational power can be avoided by using commercial cloud services, which allow a smooth scale-up of the system, even to the point of mobile app development (Martíez-Pérez et al. 2015). However, cloud-based implementations using third-party services raise strong security and privacy issues (Katt Citation2014; Mohammed et al. Citation2015), and their use, it has been found, is still immature (Hu and Bai Citation2014). Development of secret sharing schema and algorithms will improve the opportunities to benefit from cloud services in EHR systems (Ermakova and Fabian Citation2013).

Data quality

We have also to keep in mind that the quality of the results of any analytical tool will depend on the quality of data used for the model building process and/or the a priori knowledge that we have at our disposal. It is obvious that an unrepresentative learning dataset, or incomplete (or inconsistent) rules will potentially lead to the construction of a low quality classifier. To boost the quality of the learning system input information, it is positive to use knowledge and data from different sources, e.g., data from numerous databases. In addition, we also have to answer the following questions:

  1. What is the quality of each source? To what extent does merging data or knowledge from unreliable sources (as from a low-quality expert or unrepresentative dataset or database that includes numerous mistakes) decrease the quality of classification?

  2. How to merge learning materials of different quality? We usually have a learning set and rules at our disposal, then we have to answer the question: “Do we use each source separately, e.g., we train individual classifiers on the basis of each type of learning material, and then we will find a method to combine their decisions?” Also, we can transform various forms of learning materials into one unified form (meaning that datasets are transformed into rules using, e.g., well-known machine learning methods), or generate a set of learning examples on the basis of rules. Regardless of the choice, there still exist problems involving the quality of learning material.

  3. Is the learning data material consistent? Usually we meet the problem of knowledge consistency. However, if we would like to merge learning materials form different sources, then, this merged one could be inconsistent.

The quality of knowledge plays a key role for machine learning systems. In designing a decision support system, we are able to gain knowledge from different sources (i.e., from different experts or on the basis of different databases), and we also have to take different qualities into consideration.

This problem was partly described for induction learning (Bruha and Kocková Citation1993; Dean and Famili, Citation1997; Gur-Ali and Wallace Citation1993) and concept description (An and Cercone Citation1999; Bergadano et al. Citation1988), and we can find several works dealing with decision making on the basis of the learning set quality; for instance (Schapire Citation2001) aims to construct classifiers based on a very small learning set. Those problems might be regarded as the decision making based on the unreliable (nonrepresentative) sources.

Let’s consider the problem of how to assess the quality of a rule for probabilistic reasoning; in the next section, we propose a kind of measure that can be applied to knowledge acquisition process for diverse forms of rule-based models. The use of semantic resources, such as ad hoc ontologies, and data correctness verification based on semantic processing has been proposed to achieve data quality in the big data scenario of EHR (Gai et al. Citation2015). The alignment of the semantic expectations with the actual data extracted from the EHR annotations allows detection of errors at a much more abstract level, before their exploitation by decision support systems’ (Kashfi Citation2011) data mining. Also, paying special attention to usability issues has been pointed out as a way to achieve much better quality data (Villa and Cabezas Citation2014). If the health-care provider is not overwhelmed by the data input process, the probability of errors will be much smaller (Vedanthan et al. Citation2015).

Cost and benefit

Today, for most of the practical decision problems coming from different areas of human activities, we can choose an appropriate analytical tool as a classification method to make a high-quality decision. However, its choice is usually limited by several constraints related to data gathering and classification of computation costs. We can convert this problem into one of training and exploitation of classifiers. First, and probably the best-known criterion, is the misclassification cost, which is widely discussed in the literature (Duda, Hart, and Stork Citation2001) and is the key criterion in Bayesian decision theory. It proposes to define a so-called loss function, which informs about the misclassification costs between each pair of classes. According to this theory, the optimal classifier makes decisions minimizing the expectation value of a loss function. We can also find several propositions on how to deal with it. For instance, Peng et al. (Citation2005) discussed how to create a cost-sensitive classifier ensemble for a medical decision support system. Of course, the misclassification cost is important, but we can find another source that might generate additional costs during training or exploitation of classifiers:

  • Cost of training is usually high for classifiers that build structures used during classification, e.g., decision trees (Alpaydin Citation2010).

  • Cost of testing.

  • Cost of decision making usually takes the cost of necessary feature acquisition into consideration.

Let us focus here on the last type of cost sources; however, the discussion will be continued in the next sections as well, where we will concentrate on the cost-sensitive approach applied to combined classifiers. Meanwhile, the cost may be measured as a price of medical tests, or time, in the case of measurement repetitions that require significant time. Such a meaning of cost-sensitive classification arises frequently in many fields of human activities such as the industrial production process (Verdenius Citation1991), robotics (Tan and Schlimmer Citation1989), and technological diagnosis (Lirov and Yue Citation1991), to enumerate only a few. However, this problem is most clearly visible for medical diagnosis (Núñez Citation1988). As mentioned, for most decision problems we have the necessary tools to make a high-quality decision, but, from a practical point of view, we have to notice that a physician has a limited budget within which to make tests for diagnosis. Therefore, in real cases, physicians have to balance the costs of various tests with the expected benefits. Physicians usually have to make the diagnosis fast on the basis of (low cost) features acquired from measurements that do not require much time to conduct—because the therapeutic action has to be taken without any delay, e.g., in the case of a rescue operation.

In addition to benefits that might be obtained by the patient, the practitioner, and the general public from the big data processing of EHR information (Michel-Verkerke, Stegwee, and Spil Citation2015), there is a strong interest in the health-care industry where the EHR-related products are becoming a big economic sector. The EHR implementations alone were estimated at $15.5 billion in 2010 and are projected to grow to $19.7 billion (Nguyen, Bellucci, and Nguyen Citation2014). The current global mobile health-care market was valued at $6.6 billion in 2013 (Ben-Assuli Citation2015). The pharmaceutical industries, as well as other health-related industries, might be directly interested in EHR data mining (Kim, Labkoff, and Holliday Citation2008; Moor et al. Citation2015), looking for innovations, assessment of the effect of drugs, identification of target health-care markets, and real data for the development of medicines (Berger et al. Citation2015). Also, government agencies are interested in expense prediction in order to allocate resources (Vivas-Consuelo et al. Citation2014). All of them need consistent, updated, anonymous, and error-free data.

Data mining

The availability of extensive EHR databases, allows data mining analysis performed by a wide variety of applied views. On-the-fly construction of probabilistic models from the stored EHR data allowed obtaining detectors of epidemic outbreaks (Dawson, Gailis, and Meehan Citation2015), using Bayesian modeling and particle filtering. The evaluation of the prevalence of infection features in home health-care data (Shang et al. Citation2015) allowed identifying an increasingly serious problem due to population aging. Similarly feasible is the use of longitudinal studies using past EHR information (Oakkar et al. Citation2015), such as the one covering the acculturation of Asian American men and its correlation with the body mass index. Also, retrospective analysis of EHR data allows improved definition of new clinical trials, or observation studies (Kreuzthaler, Schulz, and Berghold Citation2015), for an optimal design of the trial. Natural language processing allows extracting meaningful information from semistructured data, wherein some information is written down by the clinician in the form of an informal report. This has been exploited more in some specific areas, such as digestive diseases (Hou, Imler, and Imperiale Citation2014), or in the discovering of surgical site infection risk (Michelson, Pariseau, and Paganelli Citation2014). A mixture of natural language processing of free text writing in EHR and statistical analysis of some of the data fields allowed for the detection of past history of indigents leading to heart failure in a population of 50,000 primary care patients (Vijayakrishnan et al. Citation2014). Data mining can be extended to the relationships among physicians, nurses, and other roles in the health-care system, discovering the underlying social network structure (Malin, Nyemba, and Paulett Citation2011), which can be useful to set the proper access rights and channel the information in a proactive way. HER-derived phenotyping allows for mapping of the EHR data into specific and meaningful medical concpets. Ho, Ghosh, and Sun (Citation2014), as well as Wang et al. (Citation2015), propose to use the nonnegative tensor factorization for extraction of phenotypes from the patient claim records. This way, obtained tensor factors can be used as phenotype candidates that automatically indicate patient clusters on specific diagnoses or treatments. Finally, the use of EHR allows building patient summaries, such as noted frequent medication, which allow treatment in emergency situations when the patient has little cognitive ability and cannot answer questions (Remen and Grimsmo Citation2011).

Data mining from EHR has been done in some specific areas, often related to highly prevalent diseases that ask for some public governance understanding of the facts in order to define appropriate health policies. This is the case of childhood obesity, which is a growing concern in both developed and underdeveloped countries. The analysis of EHR data allows determining demographic factors with heavy influence in the rise of the abnormal condition (Baer et al. Citation2013; Brown et al. Citation2015; Flood et al. Citation2015). Another general public concern is cardiac conditions, which are a major cause of death. There have been a number of data mining studies based on EHR data aiming to identify risk factors (Green et al. Citation2012; Rubbo et al. Citation2015). The final example is diabetes, whose prevalence is steadily increasing due to eating and exercise habits and increasing older populations. EHR data mining looks for risk factors as well as comorbidities and cost analysis of effective treatment (Chung et al. Citation2015; Mashayekhi et al. Citation2015; Zhong et al. Citation2015).

Data stream analysis

The analysis of huge volumes of fast-arriving data is recently the focus of intense research, because such methods allow building a competitive advantage for a given company. There are some works devoted to data visualization oriented toward interactive manipulation of the data (Gotz, Wang, and Perer Citation2014), with some emphasis on the search for patterns of clinical events. One useful approach is the data stream classification, which is employed to solve problems related to discovery of customer preference changes, spam filtering, fraud detection, and medical diagnosis to enumerate only a few. Basically, there are two classifier design approaches:

  • build and use, which first focuses on training a model quickly and then using the trained classifier to make decisions;

  • after-model-construction tries to tune the model parameters continuously while making decisions.

The first approach is very traditional and is useful only under the assumptions that data used for training and being recognized come from the same distribution, and the number of training examples ensures that a model is well trained (i.e., it is not undertrained). Of course, for many practical tasks, such assumptions could be accepted; nevertheless, many contemporary problems do not allow its acceptance and should take into consideration that the statistical dependencies describing the classification task as prior probabilities and conditional probability density functions could change. As a result of the changes, the posterior probabilities (responsible for the decision boundary shapes) could change as well. Additionally, we should take into account the fact that data come very fast; therefore, it is impossible to label arriving examples manually by a human expert, but each object should be labeled by a classifier.

The first problem is called concept drift (Widmer and Kubat Citation1996) and the efficient methods that are able to deal with it are still the focus of intense research, because appearance of this phenomenon could potentially cause a significant accuracy deterioration of a classifier under exploitation (Wozniak, Kasprzak, and Cal Citation2013).

Therefore, developing methods, which are able to effectively deal with this phenomenon has become a vital issue of pattern classification. Basically, the following approaches may be considered to deal with the problem:

  1. detecting the drift and retraining the classifier,

  2. rebuilding the model frequently, or

  3. adapting the classification model to data population (distribution) changes.

The last approach, “build, use, and improve,” is usually dedicated to data stream analysis, in which the apparent changes are rather slow and gradual. The methods related to this topic usually use continuous model updating after each incoming example (online learners) or collect a batch of data to update the model (sliding windows).

In many pattern classification tasks there exist dependencies among patterns to be classified. It is typical, e.g., for chronic disease diagnosis (for which historical data plays the crucial role for high-quality medical assessment), to name only a few.

The formalization of a classification task requires successively classified objects that are related one to another, but also could take into consideration the occurrence of external factors changing the character of these relationships. Let’s illustrate this task, using an example of medical diagnosis. The aim of this task is to classify the successive states of a patient. Each diagnosis could be the basis for a certain therapeutic procedure. Thus, we obtain a closed-loop system in which the object under observation is simultaneously subjected to control (treatment) dependent on the classification. The mentioned control could be recognized as the cause of the concept drift. As distinct from the traditional concept drift model, where drift appears randomly, in this case the drift has a deterministic nature.

Direct learning from EHR data is a challenge, because both the heterogeneous nature and the mixture of structured and nonstructured information of the EHR data ares quite inappropriate for conventional machine learning. The current trend to apply deep learning approaches to many signal processing problems (face recognition, object recognition, strategy in games) finds difficulties when applied to EHR data due to coding it as a real-valued vector. As an example, a predictive study about suicide tried to apply restricted Boltzmann Machines (Tran et al. Citation2015) to embed the EHR data into a vector space that might be subject to further analytical process; it required ad hoc modifications in the form of rules for the preservation of data structure and the constraint of preserving nonnegative weights, in order to produce meaningful results. Another pressing issue that hinders machine learning application is that of privacy. The data must be freed from variable values that might allow user identification or its imputation to some group of people with negative implications (e.g., HIV infected), although the data might still be useful for applications such as recommendation of treatment based on similarity with other patients. The approach of Chignell et al. (Citation2013) performs clustering of some features, retaining the residuals after removing the cluster centroids, thus obtaining a shifted representation that is quite local and difficult to invert. Residual data can effectively be used for recommendations based on patient similarity.

Multidimensional tools for EHR data representation and analysis

As previously alluded, EHR can contain different types of data, represented in various formats. However, what is the most important for EHR data are the ways of extracting vital information to facilitate patient health-care processes. Frequently, EHR data depend on a multitude of factors that render their analyses even more difficult. However, there are recently developed methods of representation and analysis of multidimensional data with the help of tensor methods (Kolda and Bader Citation2009; Lathauwer et al. Citation2000b). These can be used for representation and analysis of some types of EHR data, as will be discussed in this section.

A tensor is a multilinear real function operating on a vector space and its dual. However, for data mining, a tensor can be perceived as a multidimensional cube of numeric values that extend such well-known objects as vectors and matrices. For a short introduction to tensors for data analysis and software implementations, see, e.g., Cyganek (Citation2013). Examples of multidimensional medical data are ample, such as magnetic resonance imaging (MRI) data, X-ray images, or EHR-derived phenotyping, which goal is to map EHR data to some medical concepts (Ho, Ghosh, and Sun Citation2014; Wang et al. Citation2015), to name a few. Interestingly enough, diffusion MRI, which aims at measuring the diffusion of water molecules in biological tissues, as well as diffusion tensor imaging (DTI), heavily rely on tensor analysis (Lauterbur Citation1973; Le Bihan et al. Citation1986).

In the case of EHR data, tensors can be used mostly for data compression and data analysis. For both applications, a proper tensor decomposition method should be employed. In the following, we provide a short overview of some principal tensor decomposition methods with information on potential application to EHR data processing (Cyganek and Wozniak Citation2015).

  1. Higher Order Singular Value Decomposition (HOSVD) is a multidimensional extension of the well-known SVD matrix decomposition to decomposition of tensors. It allows tensor representation as a product of a core tensor and mode matrices (Kolda and Bader Citation2009; Lathauwer et al. Citation2000a). HOSVD is related to the Tucker decomposition, which was introduced into psychometrics by the pioneering work of Ledyard Tucker in the 1960s (Tucker Citation1966). HOSVD has a number of interesting properties. For instance, it builds orthogonal tensor subspaces, which can be used for pattern recognition (Cyganek Citation2013). It also can be used for data compression, although there are better methods (Wang and Ahuja Citation2007).

  2. Best rank-1 tensor decomposition, also known as CANDECOMP/PARAFAC (CP). This decomposition aims at approximating a tensor as a sum of vector products which are all of rank 1 (Lathauwer, Moor, and Vandewalle Citation2000b). For instance, in a phenotyping problem, an initial tensor is composed of the counts of the co-occurences between diagnosis and a chosen medical procedure, and for a given patient. This way, a 3D tensor is composed, for which its element at position (a,b,c) is a simple counter of the number of times a procedure “a” was used on a patient “b” having symptoms “c.” Finally, the nonnegative rank-1 decomposition allows extraction of the procedure, patient, and symptom factors, respectively.

  3. The best all-rank decomposition, also called the best rank-(R1, R2, RP) decomposition (Lathauwer, Moor, and Vandewalle Citation2000b). For a P-dimensional tensor, this decomposition aims at constructing an approximating tensor that fulfills the P-rank constraint. The best-rank tensor decomposition can be used for data compression and classification (Cyganek and Wozniak Citation2015; Wang and Ahuja Citation2007). However, the algorithm is iterative and frequently initiated with the HOSVD method.

  4. Nonnegative tensor decomposition with optional sparsity and orthogonality constraints (Hazan, Polak, and Shashua Citation2005). These types of tensor decompositions assume tensor decomposition into only nonnegative factors. They found applications in biology and medicine due to clear interpretation of the decomposing factors. They write well into the idea of the purely additive influence of each of the components discovered in the data (Cichocki, Zdunek, and Amari Citation2008).

  5. Tensor optimal subspace projections aim at decompositions into tensor subspaces that exhibit properties suitable for optimal classification. The optimal conditions are obtained by embedding proper constraints, such as sparsity or locality preserving, during the optimization process (Liu, Liu, and Chan Citation2010).

Apart, there are many other methods and variants of tensor decompositions, which also can be considered for efficient processing of various EHR data (Chi and Kolda Citation2012; Kolda and Bader Citation2009).

Access and medical data protection

An important aspect of the health record system is to ensure the confidentiality of the patient personal data (Fernandez-Aleman et al. Citation2013). These data have today a more virtual-based existence than a physical one, therefore, protecting them is difficult. The most basic security technological requirements are related to data encryption and certification as well as identity authentication to reach data access. From a practical point of view, protection is provided by multilevel electronic technologies and computer science techniques. Integration of these services is governed in many countries (e.g., Australia, Japan, United States) by rule-based acts, which describe, among other issues, patients’ rights, obligatory information-protection steps by the medical staff, and policies of data storage. Because patients’ medical data are read, entered, and collected by many people (e.g., nurses, medical technicians, and doctors) the multilevel protection is required:

  • (1) The first, simplest level of protection:

    • (a) installation of specialized management software with username and PIN or password according to personalized access rules,

    • (b) execution of security measures (password or PIN changing after periodic intervals, number of characters in password, etc.).

  • (2) The second level of protection is based on biometric verification systems, where face, signature, retina, finger or palm recognition increase the level of security of the system (Doroz, Porwik, and Orczyk Citation2016).

Unfortunately, the aforementioned access restrictions are still insufficient. A serious data security issue paradigm is a one-time authorization made usually when starting work with a device or an IT system. This type of user authentication can result in a serious threat, especially in open-type areas such as medical facilities, causing systems and EHRs being vulnerable to an intruder attack. The nature of the work of medical staff is such that workstations with appropriate equipment are, from time to time, unsupervised, although earlier access to them had been granted correctly. It is a moment when access to a workstation can be taken over by illegitimate persons. To ensure a high level of security, especially of medical data, information systems should be continuously monitored. For this reason, the next security level is strongly recommended. Flexible and fine-grained data access rights of individual users can be achieved based on semantic analysis of the EHR semistructure data (Amato et al. Citation2015). Some works aim to assess the appropriate access control policies, in other words, access rights must be tailored to the actual job profile, which can be ascertained from the analysis of the social network structure derived from actual logs of EHR system access (Malin, Nyemba, and Paulett Citation2011).

  • (3) The third level of protection involves continuously monitoring users’ activities associated with keyboard-based interfaces. Using these interfaces, intruders can break into a system and gain unauthorized access to protected data. It can be averted by computer-user profiling. Such a profiling is performed by intrusion detection systems that constantly monitor all operations performed by users and then try to verify a user’s identity by comparing his/her activity to his/her profile. In this approach, a computer users’ activity is constantly analyzed by performing a real-time keyboard monitoring. The process of collecting a user’s activity data is carried out in the background, practically not involving any user attention. The encryption of the alphanumeric characters entered via a keyboard is performed in order to prevent an access to the user’s private and sensitive information (passwords, PIN codes). It means that unauthorized access is quickly detected and alarmed (Kudlacik, Porwik, and Wesolowski Citation2016).

Smartphones bring new challenges to the security and privacy of health-care data. It has been reported that 80% of health-care personnel use their smartphones for work-related tasks (Burns and Johnson Citation2015), so that the health information exchange system has become a bring-your-own-device (BYOD) environment, where a whole ecology of applications might be coexisting in the devices used for EHR data visualization and manipulation, opening the door to many unintentional but malicious threats. System managers must implement new policies to minimize these threats, such as installing approved security applications in personal devices, which implies intruding in the personal devices. Moreover, widespread mobile health-care applications lack security certifications, although standards are hurriedly being setup by the American National Institute of Standards and Technology (NIST) and companion institutions. Some works are directed toward building in the mobile ecosystems a secure layer based on the trustworthiness of entities, applying trust-assessment systems imported from e-commerce (Bahtiyar and Çağlayan Citation2014). Trust can be measured by direct communication and through indirect reputation measures of third parties. Trust assessment allows secure identity management systems without a central authority to overcome the difficulties found in establishing such central authority in health information ecosystems (Dolera Tormo et al. Citation2013). Solutions for security enhancement in mobile health-care ecosystems (Simplicio et al. Citation2015) must contemplate loss of connectivity or transmission delays, as well as theft and device sharing. Hence, offline authentication must be added, as well as secure storage and transmission by encryption with authenticated keys. Mobile information ecosystems can take the form of mobile health social networks, i.e., among patients suffering the same disease (Zhou et al. Citation2013), offering benefits such as opportunistic computing to carry out some big data operations, and communication links sharing, as well as social comfort. These social networks require new cryptographic security strategies as well as trust-assessment methods to avoid attacks at a diversity of levels, coming from the body area network to the communication grid. To this already complicated scenario, the advent of pervasive computing and the Internet of Things (IoT) adds new factors in a health-care system’s ecology (Trcek and Brodnik Citation2013), which require new, distributed authentication protocols, which must be robust even to power failures in the IoT devices.

A key technological element involved in the quest for achieving data privacy while granting access for research and public interest information processing is the De-identification (a.k.a. anonymization) of patient records. The National Institutes for Health (NIH) classifies research as Human Subjects Research (HSR) or Not Human Subjects Research (NHSR). The NHSR has much less administrative oversight and can be more effectively exploited. Privacy threads come from the ability to deduce direct identifiers, quasi-identifiers, and sensitive attributes from the data (Gkoulalas-Divanis, Loukides, and Sun Citation2014), which can be subjects of identity, membership, or attribute disclosures. Privacy models, such as k-Anonimity, allow designing algorithms and to assess the privacy risks of already anonymized data. Some processes achieve anonymization by statistical analysis, i.e., removing high-entropy variables (Chignell et al. Citation2013). AS well as structured information in fields, EHR might also contain free-form text, which, to some extent, might be cleared of identity information by some basic natural language processing (NPL) analysis, such as syntactic analysis (Huang et al. Citation2009). However, further guaranties need more sophisticated language-dependent NPL tools in addition to pattern recognition (Zuccon et al. Citation2014). For example, Chazard et al. (Citation2014) builds the list of forbidden words in French following an incremental process without a prior dictionary. Similarly, Hanauer et al. (Citation2013) uses an open source solution for statistical identity scrubbing with little human effort. Disassociation (Loukides et al. Citation2014) is another approach to impede the identification of the patient. It is achieved by partitioning the EHR in several pieces, allowing separate processing. When considering image data, the anonymization could require image processing to assess that no identity information is slipped into printed images (Newhauser et al. Citation2014). However, anonymous sharing and cooperative processing of clinical and signal data via a web service on a multicenter study on deep brain stimulation for Parkinson’s Disease has been demonstrated (Rossi et al. Citation2014). Most anonymization approaches scale badly, failing for large datasets. Recent works are achieving a big data scale of anonymization techniques (Zhang et al. Citation2015) by the use of standard big data tools.

Finally, we should discuss a scenario in which valuable EHR data is distributed among a number of servers or computing systems. In modern health care, such databases might belong to private medical companies that are not interested in sharing their resources. However, using information stored in multiple EHR warehouses might be much more efficient for data mining purposes, offering a wider perspective on analyzed trends and phenomena on a global scale. Therefore, privacy issues in EHR are directly related to distributed data mining from heterogeneous sources. The aim of such procedures is to extract knowledge and value from distributed data without releasing any information with regard to the nature of data stored in any of the sources. This is known as privacy-protecting data mining (Xu et al. Citation2014) and is of crucial importance in the big data scenarios of EHRs.

As EHR data mining resources should be oriented toward medical experts, machine learning methods that offer interpretable decisions and can explain their predictions are of primary focus. Here, one must mention privacy-preserving decision trees (Vaidya et al. Citation2014) and rule-based methods (Alwatban and Emam Citation2014). They offer a good trade-off between accuracy and ease of interpretation while offering advanced privacy protection techniques. However, these models have not been discussed in the context of big data. Therefore, there is a need for combining these techniques with efficient computing environments that would allow for fast induction of these classifiers when faced with massive, evolving, and nonstationary EHR data (del Río et al. Citation2015).

Here also, one should mention other privacy-preserving machine learning techniques, such as nearest neighbor methods (Krawczyk and Woźiak, 2013), Bayesian learning (Liu et al. Citation2016), or ensemble solutions (Li, Bai, and Reddy Citation2016). They all, however, still lack true efficient implementation for big data that will allow for real-time analytics, online learning, and constant model updating. New proposals on how to run privacy-preserving EHR mining tools in environments such as MapReduce or Spark (Fernádez et al. 2014) are crucial for taking a step toward efficient and secure EHR processing.

Final remarks

Though EHR implementation efforts started a couple of decades ago, their use is still problematic and questioned in some areas of health care, whereas other areas are very keen on it and use EHR data positively. One of the areas of work that is more promising is the creation of intelligent systems based on the EHR stored information, i.e., the realization of retrospective studies, once legal and ethical problems are solved by proper anonymization algorithms. In this area the following are the main challenges:

  • understanding doctors’ notes (unstructured text analysis);

  • handling huge volumes of medical images that are part of the EHR and continuing to increase storage requirements steadily; medical imaging is big data problem by itself, requiring sophisticated mathematical modeling and numerical algorithms at many levels;

  • backtracking the effect of medical decisions, i.e., treatments, so that the physicians can improve, over time, their clinical guidelines and procedures. Up to now, data analysis has been limited to the statistical inference over the data and/or the construction of some prognosis-predictive systems, which do not take into account medical decisions and how they interact with patient evolution.

Funding

This work was supported by the Polish National Science Center under the grant no. DEC-2013/09/B/ST6/02264 and by EC under FP7, Coordination and Support Action, Grant Agreement Number 316097, ENGINE - European Research Centre of Network Intelligence for Innovation Enhancement (http://engine.pwr.wroc.pl/).

Additional information

Funding

This work was supported by the Polish National Science Center under the grant no. DEC-2013/09/B/ST6/02264 and by EC under FP7, Coordination and Support Action, Grant Agreement Number 316097, ENGINE - European Research Centre of Network Intelligence for Innovation Enhancement (http://engine.pwr.wroc.pl/).

Notes

1. Jimeng Sun and Chandan K. Reddy, Big Data Analytics for Healthcare, Tutorial presentation at the SIAM International Conference on Data Mining, Austin, TX, 2013. http://dmkd.cs.wayne.edu/TUTORIAL/Healthcare/

References

  • Alpaydin, E. 2010. Introduction to machine learning (2nd ed). Cambridge, MA: The MIT Press.
  • Alwatban, I. S., and A. Z. Emam. 2014. Comprehensive survey on privacy preserving association rule mining: Models, approaches, techniques and algorithms. International Journal on Artificial Intelligence Tools 23:5. doi:10.1142/S0218213014500043.
  • Amato, F., G. D. Pietro, M. Esposito, and N. Mazzocca. 2015. An integrated framework for securing semi-structured health records. Knowledge-Based Systems 79:99–117. doi:10.1016/j.knosys.2015.02.004.
  • An, A., and N. Cercone, 1999. An empirical study on rule quality measures. In Proceedings of the 7th international workshop on new directions in rough sets, data mining, and granular-soft computing. RSFDGrC ’99, pp. 482–491. London, UK: Springer-Verlag.
  • Asri, H., H. Mousannif, H. A. Moatassime, and T. Noel. June 2015. Big data in healthcare: challenges and opportunities. Paper presented at the 2015 International Conference on Cloud Technologies and Applications (CloudTech), Marrakesh, Morocco, June 2–4, 2015, pp. 1–7.
  • Baer, H. J., I. Cho, R. A. Walmer, P. A. Bain, and D. W. Bates. 2013. Using electronic health records to address overweight and obesity: A systematic review. American Journal of Preventive Medicine 45 (4):494–500. doi:10.1016/j.amepre.2013.05.015.
  • Bahtiyar, Ş., and M. U. Çağlayan. 2014. Trust assessment of security for e-health systems. Electronic Commerce Research and Applications 13 (3):164–177. doi:10.1016/j.elerap.2013.10.003.
  • Ben-Assuli, O. 2015. Electronic health records, adoption, quality of care, legal and privacy issues and their implementation in emergency departments. Health Policy 119 (3):287–297. doi:10.1016/j.healthpol.2014.11.014.
  • Bergadano, F., S. Matwin, R. S. Michalski, and J. Zhang. 1988. Measuring quality of concept descriptions. In Proceedings of the 3rd european working session on learning, pp. 1–14. Glasgow, UK, October 1988.
  • Berger, M. L., C. Lipset, A. Gutteridge, K. Axelsen, P. Subedi, and D. Madigan. 2015. Optimizing the leveraging of real-world data to improve the development and use of medicines. Value in Health 18 (1):127–130. doi:10.1016/j.jval.2014.10.009.
  • Brown, C. L., M. B. Irby, T. T. Houle, and J. A. Skelton. 2015. Family-based obesity treatment in children with disabilities. Academic Pediatrics 15 (2):197–203. doi:10.1016/j.acap.2014.11.004.
  • Bruha, I., and S. Kocková. 1993. Quality of decision rules: Empirical and statistical approaches. Informatica (Slovenia) 17:3.
  • Burns, A., and M. Johnson. 2015. Securing health information. IT Professional 17 (1):23–29. doi:10.1109/MITP.2015.13.
  • Chazard, E., C. Mouret, G. Ficheur, A. Schaffar, J.-B. Beuscart, and R. Beuscart. 2014. Proposal and evaluation of fasdim, a fast and simple de-identification method for unstructured free-text clinical records. International Journal of Medical Informatics 83 (4):303–312. doi:10.1016/j.ijmedinf.2013.11.005.
  • Chi, E. C., and T. G. Kolda. 2012. On tensors, sparsity, and nonnegative factorizations. SIAM Journal on Matrix Analysis and Applications 33 (4):1272–1299. doi:10.1137/110859063.
  • Chignell, M., M. Rouzbahman, R. Kealey, R. Samavi, E. Yu, and T. Sieminowski. 2013. Nonconfidential patient types in emergency clinical decision support. Security Privacy IEEE 11 (6):12–18.
  • Chung, S., B. Zhao, D. Lauderdale, R. Linde, R. Stafford, and L. Palaniappan. 2015. Initiation of treatment for incident diabetes: Evidence from the electronic health records in an ambulatory care setting. Primary Care Diabetes 9 (1):23–30. doi:10.1016/j.pcd.2014.04.005.
  • Cichocki, A., R. I. Zdunek, and S. Amari. 2008. Nonnegative matrix and tensor factorization [Lecture Notes]. IEEE Signal Processing Magazine 25 (1):142–145. doi:10.1109/MSP.2008.4408452.
  • Cyganek, B. 2013. Object detection and recognition in digital images. Theory and practice. Chichester, United Kingdom: John Wiley & Sons Ltd.
  • Cyganek, B., M. Grana, A. Kasprzak, K. Walkowiak, and M. Wozniak 2015. Selected aspects of electronic health record analysis from the big data perspective. In 2015 IEEE international conference on bioinformatics and biomedicine (BIBM), pp. 1391–1396. IEEE.
  • Cyganek, B., and M. Wozniak. 2015. Tensor based representation and analysis of the electronic healthcare record data. In 2015 ieee international conference on bioinformatics and biomedicine (BIBM), pp. 1383–1390. IEEE.
  • Dawson, P., R. Gailis, and A. Meehan. 2015. Detecting disease outbreaks using a combined Bayesian network and particle filter approach. Journal of Theoretical Biology 370:171–183. doi:10.1016/j.jtbi.2015.01.023.
  • Dean, P., and A. Famili. 1997. Comparative performance of rule quality measures in an induction system. Applied Intelligence 7 (2):113–124. doi:10.1023/A:1008293727412.
  • del Río, S., V. López, J. M. Benítez, and F. Herrera. 2015. A MapReduce approach to address big data classification problems based on the fusion of linguistic fuzzy rules. International Journal of Computational Intelligence Systems 8 (3):422–437. doi:10.1080/18756891.2015.1017377.
  • Dolera Tormo, G., F. Gomez Marmol, J. Girao, and G. Martinez Perez. 2013. Identity management–in privacy we trust: Bridging the trust gap in ehealth environments. IEEE Security & Privacy 11 (6):34–41. doi:10.1109/MSP.2013.80.
  • Doroz, R., P. Porwik, and T. Orczyk. 2016. Dynamic signature verification method based on association of features with similarity measures. Neurocomputing 171:921–931. doi:10.1016/j.neucom.2015.07.026.
  • Duda, R. O., P. E. Hart, and D. G. Stork. 2001. Pattern classification (2nd ed.). New York, NY: Wiley.
  • El-Sappagh, S. H., and S. El-Masri. 2014. A distributed clinical decision support system architecture. Journal of King Saud University - Computer and Information Sciences 26 (1):69–78. doi:10.1016/j.jksuci.2013.03.005.
  • Erl, T., R. Puttini, and Z. Mahmood. 2013. Cloud computing: Concepts, technology & architecture. Upper Saddle River, NJ, USA: Prentice Hall Press.
  • Ermakova, T., and B. Fabian 2013. Secret sharing for health data in multi-provider clouds. In 2013 IEEE 15th conference on business informatics (CBI), pp. 93–100. IEEE.
  • Fernández, A., S. del Río, V. López, A. Bawakid, M. J. Del Jesús, J. M. Bentez, and F. Herrera. 2014. Big data with cloud computing: An insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4 (5):380–409.
  • Fernandez-Aleman, J. L., I. C. Senor, P. Á. O. Lozoya, and A. Toval. 2013. Security and privacy in electronic health records: A systematic literature review. Journal of Biomedical Informatics 46 (3):541–562. doi:10.1016/j.jbi.2012.12.003.
  • Flood, T. L., Y.-Q. Zhao, E. J. Tomayko, A. Tandias, A. L. Carrel, and L. P. Hanrahan. 2015. Electronic health records and community health surveillance of childhood obesity. American Journal of Preventive Medicine 48 (2):234–240. doi:10.1016/j.amepre.2014.10.020.
  • Gai, K., M. Qiu, L. C. Chen, and M. Liu 2015. Electronic health record error prevention approach using ontology in big data. In High performance computing and communications (hpcc), 2015 IEEE 7th international symposium on cyberspace safety and security (css), 2015 IEEE 12th international conference on embedded software and systems (icess), pp. 752–757. IEEE.
  • Gkoulalas-Divanis, A., G. Loukides, and J. Sun. 2014. Publishing data from electronic health records while preserving privacy: A survey of algorithms. Journal of Biomedical Informatics 50:4–19. doi:10.1016/j.jbi.2014.06.002.
  • Gotz, D., F. Wang, and A. Perer. 2014. A methodology for interactive mining and visual analysis of clinical event patterns using electronic health record data. Journal of Biomedical Informatics 48:148–159. doi:10.1016/j.jbi.2014.01.007.
  • Graña, M., and K. Jackwoski 2015. Electronic health record: A review. In 2015 IEEE international conference on bioinformatics and biomedicine (BIBM), pp. 1375–1382. IEEE.
  • Green, B. B., M. L. Anderson, A. J. Cook, S. Catz, P. A. Fishman, J. B. McClure, and R. Reid. 2012. Using body mass index data in the electronic health record to calculate cardiovascular risk. American Journal of Preventive Medicine 42 (4):342–347. doi:10.1016/j.amepre.2011.12.009.
  • Gur-Ali, O., and W. A. Wallace. 1993. Induction of rules subject to a quality constraint: Probabilistic inductive learning. IEEE Transactions on Knowledge and Data Engineering 5 (6):979–984. doi:10.1109/69.250081.
  • Hanauer, D., J. Aberdeen, S. Bayer, B. Wellner, C. Clark, K. Zheng, and L. Hirschman. 2013. Bootstrapping a de-identification system for narrative patient records: Cost-performance tradeoffs. International Journal of Medical Informatics 82 (9):821–831. doi:10.1016/j.ijmedinf.2013.03.005.
  • Hazan, T., S. Polak, and A. Shashua. 2005. Sparse image coding using a 3d non-negative tensor factorization. In Tenth IEEE international conference on computer vision (ICCV’05), vol. 1. pp. 50–57.
  • Ho, J. C., J. Ghosh, and J. Sun. 2014. Marble: High-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’14), pp.115–124. New York, NY, USA: ACM. doi:10.1145/2623330.2623658.
  • Hou, J. K., T. D. Imler, and T. F. Imperiale. 2014. Current and future applications of natural language processing in the field of digestive diseases. Clinical Gastroenterology and Hepatology 12 (8):1257–1261. doi:10.1016/j.cgh.2014.05.013.
  • Hu, Y., and G. Bai. 2014. A systematic literature review of cloud computing in ehealth. Health Informatics: An International Journal 3 (4). http://arxiv.org/abs/1412.2494.
  • Huang, L.-C., H.-C. Chu, C.-Y. Lien, C.-H. Hsiao, and T. Kao. 2009. Privacy preservation and information security protection for patients’ portable electronic health records. Computers in Biology and Medicine 39 (9):743–750. doi:10.1016/j.compbiomed.2009.06.004.
  • Kashfi, H. 2011. The intersection of clinical decision support and electronic health record: A literature review. In Proceedings of the federated conference on computer science and information systems (FedCIS) 2011, Szczecin, Poland, September 18–21, 2011, pp. 347–353.
  • Katt, B. 2014. A comprehensive overview of security monitoring solutions for e-health systems. In 2014 IEEE international conference on healthcare informatics (ICHI), pp. 364–364. IEEE.
  • Kim, D., S. Labkoff, and S. H. Holliday. 2008. Opportunities for electronic health record data to support business functions in the pharmaceutical industry—A case study from Pfizer, inc. Journal of the American Medical Informatics Association 15 (5):581–584. doi:10.1197/jamia.M2605.
  • Klinkowski, M., and K. Walkowiak. 2013. On the advantages of elastic optical networks for provisioning of cloud computing traffic. IEEE Network 27 (6):44–51. doi:10.1109/MNET.2013.6678926.
  • Kohn, L., J. Corrigan, and M. Donaldson. 2000. To err is human: Building a safer health system. Washington, DC: National Academy Press.
  • Kolda, T. G., and B. W. Bader. 2009. Tensor decompositions and applications. SIAM Review 51 (3):455–500. doi:10.1137/07070111X.
  • Krawczyk, B., and M. Woźniak. 2013. Distributed privacy-preserving minimal distance classification. Paper presented at the Proceedings of the Hybrid Artificial Intelligent Systems - 8th International Conference, HAIS 2013, Salamanca, Spain, September 11–13, 2013.
  • Kreuzthaler, M., S. Schulz, and A. Berghold. 2015. Secondary use of electronic health records for building cohort studies through top-down information extraction. Journal of Biomedical Informatics 53:188–195. doi:10.1016/j.jbi.2014.10.010.
  • Kudlacik, P., P. Porwik, and T. Wesolowski. 2016. Fuzzy approach for intrusion detection based on users commands. Soft Computing 20 (7):2705–2719. http://link.springer.com/article/10.1007/s00500-015-1669-6.
  • Kukafka, R., J. S. Ancker, C. Chan, J. Chelico, S. Khan, S. Mortoti, K. Natarajan, K. Presley, and K. Stephens. 2007. Redesigning electronic health record systems to support public health. Journal of Biomedical Informatics 40 (4):398–409. doi:10.1016/j.jbi.2007.07.001.
  • Lathauwer, L. D., B. D. Moor, and J. Vandewalle. 2000a. A multilinear singular value decomposition. SIAM Journal on Matrix Analysis and Applications 21 (4):1253–1278. doi:10.1137/S0895479896305696.
  • Lathauwer, L. D., B. D. Moor, and J. Vandewalle. 2000b. On the best rank-1 and rank-(r1, r2, …, rn) approximation of higher-order tensors. SIAM Journal on Matrix Analysis and Applications 21 (4):1324–1342. doi:10.1137/S0895479898346995.
  • Lauterbur, P. C. 1973. Image formation by induced local interactions: Examples employing nuclear magnetic resonance. Nature 242:190–191. doi:10.1038/242190a0.
  • Le Bihan, D., E. Breton, D. Lallemand, P. Grenier, and E. Cabanis, and M. Laval-Jeantet. 1986. Mr imaging of intravoxel incoherent motions: Application to diffusion and perfusion in neurologic disorders. Radiology 161 (2):401–407. doi:10.1148/radiology.161.2.3763909.
  • Li, Y., C. Bai, and C. K. Reddy. 2016. A distributed ensemble approach for mining healthcare data under privacy constraints. Information Sciences 330:245–259. doi:10.1016/j.ins.2015.10.011.
  • Lirov, Y., and O.-C. Yue. 1991. Automated network troubleshooting knowledge acquisition. Applied Intelligence 1:121–132. doi:10.1007/BF00058878.
  • Liu, X., R. Lu, J. Ma, L. Chen, and B. Qin. 2016. Privacy-preserving patient-centric clinical decision support system on naïve Bayesian classification. IEEE Journal of Biomedical and Health Informatics 20 (2):655–668. doi:10.1109/JBHI.2015.2407157.
  • Liu, Y., Y. Liu, and K. C. C. Chan. 2010. Tensor distance based multilinear locality-preserved maximum information embedding. IEEE Transactions on Neural Networks 21 (11):1848–1854. doi:10.1109/TNN.2010.2066574.
  • Loukides, G., J. Liagouris, A. Gkoulalas-Divanis, and M. Terrovitis. 2014. Disassociation for electronic health record privacy. Journal of Biomedical Informatics 50:46–61. doi:10.1016/j.jbi.2014.05.009.
  • Malin, B., S. Nyemba, and J. Paulett. 2011. Learning relational policies from electronic health record access logs. Journal of Biomedical Informatics 44 (2):333–342. doi:10.1016/j.jbi.2011.01.007.
  • Martínez-Pérez, B., I. de la Torre-Díez, M. López-Coronado, and J. J. P. C. Rodrigues. 2015. Are mobile health cloud apps better than native? In 2015 IEEE international conference on communications (ICC), pp. 518–523. IEEE.
  • Mashayekhi, M., F. Prescod, B. Shah, L. Dong, K. Keshavjee, and A. Guergachi. 2015. Evaluating the performance of the Framingham diabetes risk scoring model in Canadian electronic medical records. Canadian Journal of Diabetes 39 (2):152–156. doi:10.1016/j.jcjd.2014.10.006.
  • Mathew, P. S., and A. S. Pillai, 2015. Big data solutions in healthcare: Problems and perspectives. Paper presented at the 2015 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), Karpagam College of Engineering, Othakkalmandapam (PO), Coimbatore, Tamilnadu, India, March 17–18, 2016, pp. 1–6.
  • Mertz, L. 2014. Saving lives and money with smarter hospitals: Streaming analytics, other new tech help to balance costs and benefits. IEEE Pulse 5 (6):33–36. doi:10.1109/MPUL.2014.2355306.
  • Michelson, J. D., J. S. Pariseau, and W. C. Paganelli. 2014. Assessing surgical site infection risk factors using electronic medical records and text mining. American Journal of Infection Control 42 (3):333–336. doi:10.1016/j.ajic.2013.09.007.
  • Michel-Verkerke, M. B., R. A. Stegwee, and T. A. Spil. 2015. The six P’s of the next step in electronic patient records in the Netherlands. Health Policy and Technology 4:137–143. doi:10.1016/j.hlpt.2015.02.011.
  • Mohammed, N., S. Barouti, D. Alhadidi, and R. Chen 2015. Secure and private management of healthcare databases for data mining. In 2015 IEEE 28th international symposium on computer-based medical systems (CBMS), pp. 191–196. IEEE.
  • Moor, G. D., M. Sundgren, D. Kalra, A. Schmidt, M. Dugas, B. Claerhout, T. Karakoyun, et al. 2015. Using electronic health records for clinical research: The case of the EHR4CR project. Journal of Biomedical Informatics 53:162–173. doi:10.1016/j.jbi.2014.10.006.
  • Newhauser, W., T. Jones, S. Swerdloff, W. Newhauser, M. Cilia, R. Carver, A. Halloran, and R. Zhang. 2014. Anonymization of DICOM electronic medical records for radiation therapy. Computers in Biology and Medicine 53:134–140. doi:10.1016/j.compbiomed.2014.07.010.
  • Nguyen, L., E. Bellucci, and L. T. Nguyen. 2014. Electronic health records implementation: An evaluation of information system impact and contingency factors. International Journal of Medical Informatics 83 (11):779–796. doi:10.1016/j.ijmedinf.2014.06.011.
  • Nguyen, P. A., S. Syed-Abdul, U. Iqbal, M.-H. Hsu, C.-L. Huang, H.-C. Li, D. L. Clinciu, W.-S. Jian, and Y.-C. J. Li. 2013. A probabilistic model for reducing medication errors. PLoS ONE 8 (12):1–7. doi:10.1371/journal.pone.0082401.
  • Núñez, M. 1988. Economic induction: A case study. In Proceedings of the 3rd european working session on learning, pp. 139–145. Glasgow, UK, October 1988.
  • Oakkar, E. E., J. Stevens, P. T. Bradshaw, J. Cai, K. M. Perreira, B. M. Popkin, P. Gordon-Larsen, D. R. Young, N. R. Ghai, B. Caan, and V. P. Quinn. 2015. Longitudinal study of acculturation and BMI change among Asian American men. Preventive Medicine 73:15–21. doi:10.1016/j.ypmed.2015.01.009.
  • Peng, Y., Q. Huang, P. Jiang, and J. Jiang. 2005. Cost-sensitive ensemble of support vector machines for effective detection of microcalcification in breast cancer diagnosis. In Fuzzy systems and knowledge discovery, ed. L. Wang and Y. Jin, 483–483. Lecture Notes in Computer Science, 3614. Berlin /Heidelberg: Springer.
  • Reich, A. 2012. Disciplined doctors: The electronic medical record and physicians’ changing relationship to medical knowledge. Social Science & Medicine 74 (7):1021–1028. doi:10.1016/j.socscimed.2011.12.032.
  • Remen, V. M., and A. Grimsmo. 2011. Closing information gaps with shared electronic patient summaries—How much will it matter? International Journal of Medical Informatics 80 (11):775–781. doi:10.1016/j.ijmedinf.2011.08.008.
  • Rossi, E., M. Rosa, L. Rossi, A. Priori, and S. Marceglia. 2014. Webbiobank: A new platform for integrating clinical forms and shared neurosignal analyzes to support multi-centre studies in Parkinson’s disease. Journal of Biomedical Informatics 52:92–104. doi:10.1016/j.jbi.2014.08.014.
  • Rubbo, B., N. K. Fitzpatrick, S. Denaxas, M. Daskalopoulou, N. Yu, R. S. Patel, and H. Hemingway. 2015. Use of electronic health records to ascertain, validate and phenotype acute myocardial infarction: A systematic review and recommendations. International Journal of Cardiology 187:705–711. doi:10.1016/j.ijcard.2015.03.075.
  • Schapire, R. E. 2001. The boosting approach to machine learning: An overview. Paper presented at the MSRI Workshop on Nonlinear Estimation and Classification, Berkeley, CA, USA, March 19–29, 2001.
  • Shang, J., E. Larson, J. Liu, and P. Stone. 2015. Infection in home health care: Results from national outcome and assessment information set data. American Journal of Infection Control 43 (5):454–459. doi:10.1016/j.ajic.2014.12.017.
  • Simplicio, M., L. Iwaya, B. Barros, T. Carvalho, and M. Naslund. 2015. Secourhealth: A delay-tolerant security framework for mobile health data collection. IEEE Journal of Biomedical and Health Informatics 19 (2):761–772. doi:10.1109/JBHI.2014.2320444.
  • Tan, M., and J. C. Schlimmer. 1989. Cost-sensitive concept learning of sensor use in approach and recognition. In Proceedings of the sixth international workshop on machine learning. San Francisco, CA, USA: Morgan Kaufmann.
  • Tran, T., T. D. Nguyen, D. Phung, and S. Venkatesh. 2015. Learning vector representation of medical objects via emr-driven nonnegative restricted Boltzmann machines (enrbm). Journal of Biomedical Informatics 54:96–105. doi:10.1016/j.jbi.2015.01.012.
  • Trcek, D., and A. Brodnik. 2013. Hard and soft security provisioning for computationally weak pervasive computing systems in e-health. IEEE Wireless Communications 20 (4):22–29. doi:10.1109/MWC.2013.6590047.
  • Tucker, L. R. 1966. Some mathematical notes on three-mode factor analysis. Psychometrika 31 (3):279–311. doi:10.1007/BF02289464.
  • Vaidya, J., B. Shafiq, W. Fan, D. Mehmood, and D. Lorenzi. 2014. A random decision tree framework for privacy-preserving data mining. IEEE Transactions on Dependable and Secure Computing 11 (5):399–411. doi:10.1109/TDSC.2013.43.
  • Vedanthan, R., E. Blank, N. Tuikong, J. Kamano, L. Misoi, D. Tulienge, C. Hutchinson, et al. 2015. Usability and feasibility of a tablet-based decision-support and integrated record-keeping (desire) tool in the nurse management of hypertension in rural western Kenya. International Journal of Medical Informatics 84 (3):207–219. doi:10.1016/j.ijmedinf.2014.12.005.
  • Venters, W., and E. Whitley. 2012. A critical review of cloud computing: Researching desires and realities. Journal of Information Technology 27 (3):179–197. doi:10.1057/jit.2012.17.
  • Verdenius, F., 1991. A method for inductive cost optimization. In Proceedings of the European working session on machine learning. London, UK: Springer-Verlag.
  • Vijayakrishnan, R., S. R. Steinhubl, K. Ng, J. Sun, R. J. Byrd, Z. Daar, B. A. Williams, et al. 2014. Prevalence of heart failure signs and symptoms in a large primary care population identified through the use of text and data mining of the electronic health record. Journal of Cardiac Failure 20 (7):459–464. doi:10.1016/j.cardfail.2014.03.008.
  • Villa, L. B., and I. Cabezas 2014. A review on usability features for designing electronic health records. In 2014 IEEE 16th international conference on e-health networking, applications and services (Healthcom), pp. 49–54. IEEE.
  • Vivas-Consuelo, D., R. Usó-Talamantes, J. L. Trillo-Mata, M. Caballer-Tarazona, I. Barrachina-Martnez, and L. Buigues-Pastor. 2014. Predictability of pharmaceutical spending in primary health services using clinical risk groups. Health Policy 116 (2–3):188–195. doi:10.1016/j.healthpol.2014.01.012.
  • Walkowiak, K., M. Woźniak, M. Klinkowski, and W. Kmiecik. 2015. Optical networks for cost-efficient and scalable provisioning of big data traffic. International Journal of Parallel, Emergent and Distributed Systems 30 (1):15–28. doi:10.1080/17445760.2014.924516.
  • Wang, H., and N. Ahuja. 2007. A tensor approximation approach to dimensionality reduction. International Journal of Computer Vision 76 (3):217–229. 10.1007/s11263-007-0053-0.
  • Wang, Y., R. Chen, J. Ghosh, J. C. Denny, A. Kho, Y. Chen, B. A. Malin, and J. Sun. 2015. Rubik: Knowledge guided tensor factorization and completion for health data analytics. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’15. New York, NY, USA: ACM. 10.1145/2783258.2783395.
  • Widmer, G., and M. Kubat. 1996. Learning in the presence of concept drift and hidden contexts. Mach Learning 23 (1):69–101. doi:10.1007/BF00116900.
  • Wozniak, M., A. Kasprzak, and P. Cal. 2013. Application of combined classifiers to data stream classification. In Proceedings of the 10th international conference on flexible query answering systems (FQAS 2013). LNCS. Berlin, Heidelberg: Springer-Verlag. in press.
  • Xu, L., C. Jiang, J. Wang, J. Yuan, and Y. Ren. 2014. Information security in big data: Privacy and data mining. IEEE Access 2:1149–1176. doi:10.1109/ACCESS.2014.2362522.
  • Zhang, X., W. Dou, J. Pei, S. Nepal, C. Yang, C. Liu, and J. Chen. 2015. Proximity-aware local-recoding anonymization with MapReduce for scalable big data privacy preservation in cloud. IEEE Transactions on Computers 64 (8):2293–2307. doi:10.1109/TC.2014.2360516.
  • Zhong, Y., P.-J. Lin, J. T. Cohen, A. N. Winn, and P. J. Neumann. 2015. Cost-utility analyzes in diabetes: A systematic review and implications from real-world evidence. Value in Health 18 (2):308–314. doi:10.1016/j.jval.2014.12.004.
  • Zhou, J., Z. Cao, X. Dong, X. Lin, and A. Vasilakos. 2013. Securing m-healthcare social networks: Challenges, countermeasures and future directions. IEEE Wireless Communications 20 (4):12–21. doi:10.1109/MWC.2013.6590046.
  • Zuccon, G., D. Kotzur, A. Nguyen, and A. Bergheim. 2014. De-identification of health records using anonym: Effectiveness and robustness across datasets. Artificial Intelligence in Medicine 61 (3):145–151. doi:10.1016/j.artmed.2014.03.006.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.