Remember Me
Or use your Academic/Social account:


Or use your Academic/Social account:


You have just completed your registration at OpenAire.

Before you can login to the site, you will need to activate your account. An e-mail will be sent to you with the proper instructions.


Please note that this site is currently undergoing Beta testing.
Any new content you create is not guaranteed to be present to the final version of the site upon release.

Thank you for your patience,
OpenAire Dev Team.

Close This Message


Verify Password:
Verify E-mail:
*All Fields Are Required.
Please Verify You Are Human:
fbtwitterlinkedinvimeoflicker grey 14rssslideshare1
Jungjit, Suwimol (2016)
Languages: English
Types: Doctoral thesis
Subjects: Q, T
The very large dimensionality of real world datasets is a challenging problem for classification algorithms, since often many features are redundant or irrelevant for classification. In addition, a very large number of features leads to a high computational time for classification algorithms. Feature selection methods are used to deal with the large dimensionality of data by selecting a relevant feature subset according to an evaluation criterion. The vast majority of research on feature selection involves conventional single-label classification problems, where each instance is assigned a single class label; but there has been growing research on more complex multi-label classification problems, where each instance can be assigned multiple class labels.\ud \ud This thesis proposes three types of new Multi-Label Correlation-based Feature Selection (ML-CFS) methods, namely: (a) methods based on hill-climbing search, (b) methods that exploit biological knowledge (still using hill-climbing search), and (c) methods based on genetic algorithms as the search method.\ud \ud Firstly, we proposed three versions of ML-CFS methods based on hill climbing search. In essence, these ML-CFS versions extend the original CFS method by extending the merit function (which evaluates candidate feature subsets) to the multi-label classification scenario, as well as modifying the merit function in other ways. A conventional search strategy, hill-climbing, was used to explore the space of candidate solutions (candidate feature subsets) for those three versions of ML-CFS. These ML-CFS versions are described in detail in Chapter 4.\ud \ud Secondly, in order to try to improve the performance of ML-CFS in cancer-related microarray gene expression datasets, we proposed three versions of the ML-CFS method that exploit biological knowledge. These ML-CFS versions are also based on hill-climbing search, but the merit function was modified in a way that favours the selection of genes (features) involved in pre-defined cancer-related pathways, as discussed in detail in Chapter 5.\ud \ud Lastly, we proposed two more sophisticated versions of ML-CFS based on Genetic Algorithms (rather than hill-climbing) as the search method. The first version of GA-based ML-CFS is based on a conventional single-objective GA, where there is only one objective to be optimized; while the second version of GA-based ML-CFS performs lexicographic multi-objective optimization, where there are two objectives to be optimized, as discussed in detail in Chapter 6.\ud \ud In this thesis, all proposed ML-CFS methods for multi-label classification problems were evaluated by measuring the predictive accuracies obtained by two well-known multi-label classification algorithms when using the selected featuresม namely: the Multi-Label K-Nearest neighbours (ML-kNN) algorithm and the Multi-Label Back Propagation Multi-Label Learning Neural Network (BPMLL) algorithm.\ud \ud In general, the results obtained by the best version of the proposed ML-CFS methods, namely a GA-based ML-CFS method, were competitive with the results of other multi-label feature selection methods and baseline approaches. More precisely, one of our GA-based methods achieved the second best predictive accuracy out of all methods being compared (both with ML-kNN and BPMLL used as classifiers), but there was no statistically significant difference between that GA-based ML-CFS and the best method in terms of predictive accuracy. In addition, in the experiment with ML-kNN (the most accurate) method selects about twice as many features as our GA-based ML-CFS; whilst in the experiments with BPMLL the most accurate method was a baseline method that does not perform any feature selection, and runs the classifier once (with all original features) for each of the many class labels, which is a very computationally expensive baseline approach.\ud \ud In summary, one of the proposed GA-based ML-CFS methods managed to achieve substantial data reduction, (selecting a smaller subset of relevant features) without a significant decrease in predictive accuracy with respect to the most accurate method.
  • The results below are discovered through our pilot algorithms. Let us know how we are doing!

    • [1] Aghdam, M. H., Ghasem-Aghaee, N., and Basiri, M. E. Text feature selection using ant colony optimization. Expert systems with applications 36, 3 (2009), 6843-6853.
    • [2] Aksoy, S., and Haralick, R. M. Feature normalization and likelihoodbased similarity measures for image retrieval. Pattern Recognition Letters 22, 5 (2001), 563-582.
    • [3] Al-Ani, A. Ant colony optimization for feature subset selection. In WEC (2) (2005), Citeseer, pp. 35-38.
    • [4] Babu, M. M. Introduction to microarray data analysis. Computational Genomics: Theory and Application (2004), 225-249.
    • [5] Bala, J., De Jong, K., Huang, J., Vafaie, H., and Wechsler, H. Using learning to facilitate the evolution of features for recognizing visual concepts. Evolutionary Computation 4, 3 (1996), 297-311.
    • [6] Bala, J., Huang, J., Vafaie, H., DeJong, K., and Wechsler, H. Hybrid learning using genetic algorithms and decision trees for pattern classification. In IJCAI (1) (1995), Citeseer, pp. 719-724.
    • [7] Bandyopadhyay, N., Kahveci, T., Goodison, S., Sun, Y., and Ranka, S. Pathway-based feature selection algorithm for cancer microarray data. Advances in bioinformatics 2009 (2010).
    • [8] Berry, M. J., and Linoff, G. S. Data mining techniques: for marketing, sales, and customer relationship management. John Wiley & Sons, 2004.
    • [9] Blickle, T. Tournament selection. Evolutionary computation 1 (2000), 181-186.
    • [10] Boutell, M. R., Luo, J., Shen, X., and Brown, C. M. Learning multi-label scene classification. Pattern Recognition 37, 9 (2004), 1757-1771.
    • [11] Chan, A., and Freitas, A. A new classification-rule pruning procedure for an ant colony algorithm. In Artificial Evolution (2006), Springer, pp. 25- 36.
    • [17] Coenen, F. Data mining: past, present and future. The Knowledge Engineering Review 26, 01 (2011), 25-29.
    • [18] Correa Goncalves, E., Plastino, A., Freitas, A., et al. A genetic algorithm for optimizing the label ordering in multi-label classifier chains. In Tools with Artificial Intelligence (ICTAI), 2013 IEEE 25th International Conference on (2013), IEEE, pp. 469-476.
    • [19] Dash, M., and Liu, H. Feature selection for classification. Intelligent data analysis 1, 1 (1997), 131-156.
    • [20] Dash, M., and Liu, H. Consistency-based search in feature selection. Artificial intelligence 151, 1 (2003), 155-176.
    • [21] de Carvalho, A. C., and Freitas, A. A. A tutorial on multi-label classification techniques. Foundations of Computational Intelligence Volume 5. Springer, 2009, pp. 177-195.
    • [22] Dess`ı, N., and Pes, B. An evolutionary method for combining different feature selection criteria in microarray data classification. Journal of Artificial Evolution and Applications 2009 (2009), 3.
    • [23] Dimou, A., Tsoumakas, G., Mezaris, V., Kompatsiaris, I., and Vlahavas, I. An empirical study of multi-label learning methods for video annotation. In Content-Based Multimedia Indexing, 2009. CBMI'09. Seventh International Workshop on (2009), IEEE, pp. 19-24.
    • [24] Dimou, A., Tsoumakas, G., Mezaris, V., Kompatsiaris, I., and Vlahavas, L. An empirical study of multi-label learning methods for video annotation. In Content-Based Multimedia Indexing, 2009. CBMI'09. Seventh International Workshop on (2009), IEEE, pp. 19-24.
    • [25] Ding, C., and Peng, H. Minimum redundancy feature selection from microarray gene expression data. Journal of bioinformatics and computational biology 3, 02 (2005), 185-205.
    • [26] Doquire, G., and Verleysen, M. Feature selection for multi-label classification problems. Advances in Computational Intelligence. Springer, 2011, pp. 9-16.
    • [27] Doquire, G., and Verleysen, M. Mutual information-based feature selection for multilabel classification. Neurocomputing 122 (2013), 148-155.
    • [28] Dorigo, M., Birattari, M., and Stu¨tzle, T. Ant colony optimization. Computational Intelligence Magazine, IEEE 1, 4 (2006), 28-39.
    • [29] Dorigo, M., and Blum, C. Ant colony optimization theory: A survey. Theoretical computer science 344, 2 (2005), 243-278.
    • [30] Dorigo, M., and Stu¨tzle, T. Ant colony optimization: overview and recent advances. In Handbook of metaheuristics. Springer, 2010, pp. 227-263.
    • [31] Dziuda, D. M. Data mining for genomics and proteomics: analysis of gene and protein expression data, vol. 1. John Wiley Sons, 2010.
    • [32] Eiben, A. E., and Smith, J. E. Introduction to evolutionary computing. Springer Science & Business Media, 2003.
    • [33] Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. From data mining to knowledge discovery in databases. AI magazine 17, 3 (1996), 37.
    • [37] Freitas, A. A. Evolutionary algorithms for data mining. Data Mining and Knowledge Discovery Handbook. Springer, 2005, pp. 435-467.
    • [38] Freitas, A. A., Parpinelli, R. S., and Lopes, H. S. Ant colony algorithms for data classification. Encyclopedia of Information Science and Technology 1 (2008), 154-159.
    • [40] Gheyas, I. A., and Smith, L. S. Feature subset selection in large dimensionality domains. Pattern recognition 43, 1 (2010), 5-13.
    • [41] Glaab, E., Garibaldi, J. M., and Krasnogor, N. Learning pathwaybased decision rules to classify microarray cancer samples.
    • [42] Grosan, C., and Abraham, A. Intelligent Systems: A Modern Approach. Intelligent Systems Reference Library. Springer Berlin Heidelberg, 2011.
    • [43] Guyon, I., and Elisseeff, A. An introduction to variable and feature selection. The Journal of Machine Learning Research 3 (2003), 1157-1182.
    • [57] Jungjit, S., Freitas, A. A., Michaelis, M., and Cinatl, J. A multilabel correlation-based feature selection method for the classification of neuroblastoma microarray data. In Advances in Data Mining: 12th Industrial Conference (ICDM 2012) Workshop Proceedings Aˆ? Workshop on Data Mining in Life Sciences (DMLS 2012). (2012), IBAI Publishing, pp. 149-157.
    • [58] Jungjit, S., Michaelis, M., Freitas, A. A., and Cinatl, J. Two extensions to multi-label correlation-based feature selection: A case study in bioinformatics. In Systems, Man, and Cybernetics (SMC), 2013 IEEE International Conference on (2013), IEEE, pp. 1519-1524.
    • [59] Jungjit, S., Michaelis, M., Freitas, A. A., and Cinatl, J. Extending multi-label feature selection with kegg pathway information for microarray data analysis. In Computational Intelligence in Bioinformatics and Computational Biology, 2014 IEEE Conference on (2014), IEEE, pp. 1-8.
    • [66] Kudo, M., and Sklansky, J. Comparison of algorithms that select features for pattern classifiers. Pattern Recognition 33, 1 (2000), 25-41.
    • [67] Langley, P., et al. Selection of relevant features in machine learning. Defense Technical Information Center, 1994.
    • [68] Lastra, G., Luaces, O., Quevedo, J. R., and Bahamonde, A. Graphical feature selection for multilabel classification tasks. Advances in Intelligent Data Analysis X. Springer, 2011, pp. 246-257.
    • [69] Lee, C.-P., and Leu, Y. A novel hybrid feature selection method for microarray data analysis. Applied Soft Computing 11, 1 (2011), 208-213.
    • [70] Lee, C.-P., Lin, W.-S., Chen, Y.-M., and Kuo, B.-J. Gene selection and sample classification on microarray data based on adaptive genetic algorithm/k-nearest neighbor method. Expert Systems with Applications 38, 5 (2011), 4661-4667.
    • [74] Li, L., Liu, H., Ma, Z., Mo, Y., Duan, Z., Zhou, J., and Zhao, J. Multi-label feature selection via information gain. In Advanced Data Mining and Applications. Springer, 2014, pp. 345-355.
    • [75] Li, S., Wu, X., and Hu, X. Gene selection using genetic algorithm and support vectors machines. Soft computing 12, 7 (2008), 693-698.
    • [84] Ogata, H., Goto, S., Fujibuchi, W., and Kanehisa, M. Computation with the kegg pathway database. Biosystems 47, 1 (1998), 119-128.
    • [85] Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H., and Kanehisa, M. Kegg: Kyoto encyclopedia of genes and genomes. Nucleic acids research 27, 1 (1999), 29-34.
    • [86] Oh, I.-S., Lee, J.-S., and Moon, B.-R. Hybrid genetic algorithms for feature selection. Pattern Analysis and Machine Intelligence, IEEE Transactions on 26, 11 (2004), 1424-1437.
    • [87] Otero, F. E., Freitas, A. A., and Johnson, C. G. cant-miner: an ant colony classification algorithm to cope with continuous attributes. In Ant colony optimization and swarm intelligence. Springer, 2008, pp. 48-59.
    • [88] Parpinelli, R. S., Lopes, H. S., and Freitas, A. A. An ant colony based system for data mining: applications to medical data. In Proceedings of the genetic and evolutionary computation conference (GECCO-2001) (2001), Citeseer, pp. 791-797.
    • [89] Parpinelli, R. S., Lopes, H. S., and Freitas, A. A. An ant colony algorithm for classification rule discovery. Data mining: A heuristic approach (2002), 191-208.
    • [93] Quinlan, J. R. C4. 5: programs for machine learning. Elsevier, 2014.
    • [94] Read, J. A pruned problem transformation method for multi-label classification. In Proc. 2008 New Zealand Computer Science Research Student Conference (NZCSRS 2008) (2008), pp. 143-150.
    • [95] Read, J., Pfahringer, B., and Holmes, G. Multi-label classification using ensembles of pruned sets. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on (2008), IEEE, pp. 995-1000.
    • [101] Sharpe, P. K., and Glover, R. P. Efficient ga based techniques for classification. Applied Intelligence 11, 3 (1999), 277-284.
    • [110] Tan, P.-N., Steinbach, M., and Kumar, V. Introduction to data mining, vol. 1. Pearson Addison Wesley Boston, 2006.
    • [118] Yang, C.-H., Chuang, L.-Y., Yang, C. H., et al. Ig-ga: a hybrid filter/wrapper method for feature selection of microarray data. Journal of Medical and Biological Engineering 30, 1 (2010), 23-28.
    • [119] Yu, L., and Liu, H. Feature selection for high-dimensional data: A fast correlation-based filter solution. In ICML (2003), vol. 3, pp. 856-863.
    • [120] Zhang, M.-L., Pen˜a, J. M., and Robles, V. Feature selection for multi-label naive bayes classification. Information Sciences 179, 19 (2009), 3218-3229.
  • No related research data.
  • No similar publications.

Share - Bookmark

Download from

Cite this article