463
Views
2
CrossRef citations to date
0
Altmetric
Editorial

Special issue on massive data processing by using machine learning

Pages 351-354 | Published online: 10 Mar 2011

Computing technology has achieved rapid development, including CPU frequency and data storage. High CPU frequency improves the efficiency of data processing and enables the computer to be widely used for dealing with massive data. With huge data storage, we can easily collect, backup, and explore the data. Under such a circumstance, data increase dramatically in various scientific fields. One case is web data: since 1995, the size of the average web page has increased by 22 times, and the number of objects per page has grown by 21.7 times.Footnote1 Massive data processing is becoming a hot and urgent topic in the machine learning field. It is a challenge as well as an opportunity to develop novel techniques to model the massive data. Continuing the same case of web search engine, Page and Brin invented page rank to search the web pages and created Google, which has become one of the biggest companies in the world. Google has had a great impact on the life of people in the world. There are also other novel machine learning techniques emerging from real-world problems, which lay the technical foundation of massive data processing.

The overall aim of this special issue is to bring together the latest innovative machine learning techniques and advances in handling massive data, which may be raised largely in fields from data mining, pattern recognition, bioinformatics, chemometrics, medical data analysis, information retrieval (IR), sensor networks, and so on. After a rigorous review process, seven papers were invited to contribute to this special issue, which covers techniques including semi-supervised learning, embedded feature selection, dimensionality reduction, support vector machines, genetic algorithms, neural networks, k-nearest neighbours, subspace classification, etc. This special issue also covers many novel applications including medical data analysis, microarray data analysis, IR, text classification, protein structure recognition, protein type recognition, speech emotion recognition, virtual reality, sensor networks, etc.

Many novel machine learning techniques are proposed to meet the challenge of massive data processing, of which semi-supervised learning is now a hot topic and should be one of the most important techniques. The cost of labelling the data is large because of expert experience or experiments, so only part of the data is labelled. Semi-supervised learning can utilize the unlabelled data. There are different methods to utilize the unlabelled data, of which tri-training is a state-of-the-art method. But it does not work when it meets huge data. The first paper, ‘Tri-training and MapReduce-based massive data learning’ by Guo et al., introduces a solution to solve this problem by integrating Google MapReduce parallel pattern and tri-training. In their solution, tri-training can run on clusters of commodity PCs to process huge data sets. The authors also propose a data editing operation to remove the newly mislabelled data during the learning process. Experiments on a UCI data set prove that the solution not only performs better than original version of tri-training, but also has higher scalability to handle massive data. This solution has been applied to a real-world problem to detect the small pulmonary nodes in CT chest images, which proves its practical usage. From the development process of semi-supervised learning, we can see that challenges also mean opportunities to the researchers in the machine learning field. We believe more novel techniques such as semi-supervised learning will appear.

Feature selection is a critical technique for massive data processing, which improves generalization performance of learning techniques, speeds up the learning process, and makes the results be easily understood. A great change has taken place in feature selection to meet the challenge of huge data. Time-consuming techniques such as the wrapper model are rarely used, while filter techniques are studied widely. Furthermore, embedded feature selection has been proposed in recent years by incorporating feature evaluation in the training of learning machines. Another great change is selecting features for complex problems such as multi-class problems, multi-label problems, semi-supervised learning, and others. The second paper, ‘Feature selection for multi-class problems by using pairwise-class and all-class techniques’ by You and Li, proposes a novel framework named framework on pairwise-class and all-class to treat the feature selection for multi-value classification problems. The strategy of round robin is embedded into the framework to select final features from the different rank lists. Experimental results on microarray data sets show that this framework helps to improve classification accuracy and balances the performance among classes. In comparison with dimension reduction, feature selection selects relevant features for the task. It can reveal the relationship between the features and the task and has received much attention in various scientific fields such as bioinformatics, chemometrics, and so on.

For massive data, especially for the unindexed, efficient retrieval is necessary. IR techniques are developing quickly in recent years. Besides text search, image and multimedia search engines are now becoming available. To effectively get the wanted, state-of-the-art machine learning techniques are used in the retrieval system. The third paper, ‘Intelligent information retrieval system using automatic thesaurus construction’ by Song et al., presents an intelligent IR system based on automatic thesaurus construction for classification and clustering of documents. They employ a fuzzy logic controller genetic algorithm and an adaptive back-propagation neural network to validly overcome the problems existing in their archetypes, e.g. slow evolution and being prone to trap into a local optimum. Furthermore, a well-constructed thesaurus has been recognised as a valuable tool in the effective operation of clustering and classification. The authors implement the system and perform benchmark experiments on Reuter-21578 document collection and 20-newsgroup corpus. The results prove that the proposed IR system enhances the performance in comparison with the previous works. As we have mentioned in the first paragraph of this editorial, more and more data are stored in different computers or other storages. Efficient and effective search is necessary to our life and brings great challenge to the researchers. Although we have done a good job on text search such as Google, image and multimedia search is still ongoing.

Bioinformatics is becoming an active field as collecting ability is increasing. As more and more data are collected, traditional statistical techniques cannot process the data effectively and efficiently, so more and more machine learning techniques such as artificial neural networks, support vector machines, and ensemble learning are applied in this field. The fourth paper, ‘A machine learning-based method for protein global model quality assessment’ by Dong et al., proposes to employ support vector regression to solve an important problem in bioinformatics, i.e. model quality assessment in protein structure prediction. They propose protein global model quality assessment (GMQA), which uses the structural features from the 3D coordinates as inputs and absolute quality scores to support vector regression models as outputs. Experimental results on the CASP7 and LKF data sets confirm that this method outperforms the previous. GMQA is a useful tool to assess and rank the model quality. The authors have built a web server, which you may try at http://www.iipl.fudan.edu.cn/gmqa/index.html. In the past, since the amount of biomedical data was small, statistics techniques were enough to build models. Only in recent years are data sets becoming large and patterns in the data sets are becoming complex, which brings chances to machine learning techniques, and more researchers from the machine learning field are involved in the study of bioinformatics.

As computing power has been increased, we try to extract more features from the recognised objects. Since some features are strongly relevant to their properties, while some others are weakly relevant, dimensionality reduction is important to obtain fewer features and to improve modelling accuracy. The fifth paper, ‘Prediction of membrane protein types using maximum variance projection’ by Wang and Yang, introduces how to quickly and efficiently annotate a type of an uncharacterised membrane protein using a novel dimension reduction technique, maximum variance projection (MVP). They propose a hybridization approach to represent a protein by fusing a position-specific score matrix and pseudo amino acid. MVP is employed to extract the essential features from the high-dimensional feature space. Then, the k-nearest neighbour classifier is employed to identify the types of membrane proteins based on their reduced low-dimensional features. Experimental results show that the proposed approach is very promising for predicting membrane proteins types. Dimension reduction is a critical step to improve the generalization performance of learning techniques, especially for high-dimensional data sets in bioinformatics and other pattern recognition applications such as face and speech recognition.

With the development of computing techniques, we try to simulate human intelligence, which is a great challenge to machine learning techniques. The fact that Deep Blue defeated world champion Garry Kasparov in 1997 fully exhibits the progress of computing technology. Though we can easily distinguish different basic behaviours such as laugh and cry, it is hard for a computer to learn. So, it is becoming a new emerging technique to build a human being model in a computer. The sixth paper, ‘Real-time speech-driven animation of expressive talking faces’ by Liu et al., presents a real-time facial animation system in which speech drives mouth movements and facial expressions synchronously. The system has two layers, of which the upper-layer establishes five types of emotion classification, whereas the under-layer classification at the sub-phonemic level has been modelled on the relationship between acoustic features of frames and audio labels in phonemes. The implemented system demonstrates that the two-layer structure succeeds in both emotion and sub-phonemic classifications, and the synthesised facial sequences reach a comparatively convincing quality. More efforts are needed to realize the dream of an intelligent robot with emotions, which is a great challenge for machine learning.

As collecting data by single sensors is becoming convenient and cheap, the sensor network becomes an emerging technology that promises fast and easy monitoring of the physical world. One case is environmental monitoring, which consists of a large number of sensors collecting massive data about the atmosphere. The accumulation of the amount of data makes centralized monitoring difficult. The last paper, ‘Sensor networks: decentralized monitoring and subspace classification of events’ by Yagnik et al., proposes decentralized monitoring of a sensor network that automatically identifies different event types and the related groups of sensors. The decentralized solution achieves equivalent event classification performance to the centralized solution, and it mines valuable information (event types and groups of sensors) useful for researchers studying events and sensor deployment strategies. The authors provide a thorough evaluation of the proposed solution by conducting extensive experiments on both benchmark and real-world sensor data, and observing consistent performance. With the development of wireless or wired sensor networks, we believe that there would be a great need for intelligent massive data processing from the machine learning field.

This special issue reveals state-of-the-art technologies and applications of massive data processing in the machine learning field. We believe more people will pay attention to the new trend and contribute to the development of massive data processing. More and more great works will appear on the foundation of this issue. Many thanks go to the authors for their contribution to this special issue, especially to the authors: Professors Mao-Zu Guo, Wei Song, Shuigeng Zhou, Jie Yang, Mingyu You, and Huan Liu, who as well helped to review the papers. Thanks also go to the reviewers: Drs Songnian Yu, Xingming Zhao, Yong Wang, Daoqiang Zhang, Ming Li, Hongbin Shen, Minling Zhang, Xue-Qiang Zeng and Junping Zhang, whose comments greatly improved the quality of this special issue. Last, but not least, my deepest gratitude goes to Professors George J. Klir (Editor-in-Chief) and William J. Tastle of International Journal of General Systems for their consideration, help and advice. This work was supported by the Natural Science Foundation of China under Grant Nos. 61005006 and 60873129, the Shanghai Leading Academic Discipline Project No. B004, the STCSM ‘Innovation action plan’ Project of China under Grant No. 07DZ19726, and the Shanghai Rising-Star Programme under Grant No. 08QA1403200.

Notes

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.