Remember Me
Or use your Academic/Social account:


Or use your Academic/Social account:


You have just completed your registration at OpenAire.

Before you can login to the site, you will need to activate your account. An e-mail will be sent to you with the proper instructions.


Please note that this site is currently undergoing Beta testing.
Any new content you create is not guaranteed to be present to the final version of the site upon release.

Thank you for your patience,
OpenAire Dev Team.

Close This Message


Verify Password:
Verify E-mail:
*All Fields Are Required.
Please Verify You Are Human:
fbtwitterlinkedinvimeoflicker grey 14rssslideshare1
Groce, A.; Kulesza, T.; Zhang, C.; Shamasunder, S.; Burnett, M.; Wong, W-K; Stumpf, S.; Das, S.; Shinsel, A.; Bice, F.; McIntosh, K. (2014)
Publisher: Institute of Electrical and Electronics Engineers
Languages: English
Types: Article
Subjects: QA75
How do you test a program when only a single user, with no expertise in software testing, is able to determine if the program is performing correctly? Such programs are common today in the form of machine-learned classifiers. We consider the problem of testing this common kind of machine-generated program when the only oracle is an end user: e.g., only you can determine if your email is properly filed. We present test selection methods that provide very good failure rates even for small test suites, and show that these methods work in both large-scale random experiments using a “gold standard” and in studies with real users. Our methods are inexpensive and largely algorithm-independent. Key to our methods is an exploitation of properties of classifiers that is not possible in traditional software testing. Our results suggest that it is plausible for time-pressured end users to interactively detect failures—even very hard-to-find failures—without wading through a large number of successful (and thus less useful) tests. We additionally show that some methods are able to find the arguably most difficult-to-detect faults of classifiers: cases where machine learning algorithms have high confidence in an incorrect result.
  • The results below are discovered through our pilot algorithms. Let us know how we are doing!

    • [1] IEEE Std. Glossary Software Eng. Terminology. IEEE Press, 1990.
    • [2] S. Amershi, J. Fogarty, and D. Weld. Regroup: interactive machine learning for on-demand group creation in social networks. In CHI '12: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 21-30. ACM Request Permissions, May 2012.
    • [3] A. Arcuri, M. Iqbal, and L. Briand. Formal analysis of the effectiveness and predictability of random testing. In Intl. Symp. Software Testing and Analysis, pages 219-230, 2010.
    • [4] A. Asuncion and D. Newman. UCI machine learning repository, 2007.
    • [5] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. 1999.
    • [6] A. Blackwell. First steps in programming: A rationale for attention investment models. In IEEE Conf. Human-Centric Computing, pages 2-10, 2002.
    • [7] D. Brain and G. Webb. On the effect of data set size on bias and variance in classification learning. In D. Richards, G. Beydoun, A. Hoffmann, and P. Compton, editors, Proc. of the Fourth Australian Knowledge Acquisition Workshop, pages 117- 128. 1999.
    • [8] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
    • [9] T. Chen, T. Tse, and Z. Quan Zhou. Fault-based testing without the need of oracles. Information and Software Technology, 45(1):1- 9, 2003.
    • [10] T. Y. Chen, S. C. Cheung, and S. Yiu. Metamorphic testing: a new appraoch for generating next test cases. Technical Report HKUST-CS98-01, Hong Kong Univ. Sci. Tech., 1998.
    • [11] Y. Chen, A. Groce, C. Zhang, W.-K. Wong, X. Fern, E. Eide, and J. Regehr. Taming compiler fuzzers. In ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 197-208, 2013.
    • [12] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273-297, 1995.
    • [13] W. Dickinson, D. Leon, and A. Podgurski. Pursuing failure: The distribution of program failures in a profile space. In European Software Eng. Conf., pages 246-255, 2001.
    • [14] S. Elbaum, A. Malishevsky, and G. Rothermel. Test case prioritization: a family of empirical studies. IEEE Trans. Software Eng., 28, 2002.
    • [15] J. Fogarty, D. Tan, A. Kapoor, and S. Winder. CueFlik: interactive concept learning in image search. In CHI '08: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 29-38. ACM Request Permissions, Apr. 2008.
    • [16] P. Francis, D. Leon, M. Minch, and A. Podgurski. Tree-based methods for classifying software failures. In International Symposium on Software Reliability Engineering, pages 451-462, 2004.
    • [17] P. Frankl and S. Weiss. An experimental comparison of the effectiveness of branch testing and data flow testing. IEEE Trans. Software Eng., 19(3):202-213, 1993.
    • [18] P. Frankl, S. Weiss, and C. Hu. All-uses vs mutation testing: An experimental comparison of effectivness. J. Systems and Software, 38(3):235-253, 1997.
    • [19] A. Glass, D. McGuinness, and M. Wolverton. Toward establishing trust in adaptive agents. In Proc. IUI, pages 227-236, 2008.
    • [20] M. Gligoric, A. Groce, C. Zhang, R. Sharma, A. Alipour, and D. Marinov. Comparing non-adequate test suites using coverage criteria. In International Symposium on Software Testing and Analysis, pages 302-313, 2013.
    • [21] V. Grigoreanu, J. Cao, T. Kulesza, C. Bogart, K. Rector, M. Burnett, and S. Wiedenbeck. Can feature design reduce the gender gap in end-user software development environments? In IEEE Conf. VL/HCC, pages 149-156, 2008.
    • [22] A. Groce, A. Fern, J. Pinto, T. Bauer, A. Alipour, M. Erwig, and C. Lopez. Lightweight automated testing with adaptationbased programming. In IEEE International Symposium on Software Reliability Engineering, pages 161-170, 2012.
    • [23] M. Harman. The role of artificial intelligence in software engineering. In First International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering, pages 1-6, 2012.
    • [24] M. Harman, E. Burke, J. Clark, and X. Yao. Dynamic adaptive search based software engineering. In ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pages 1-8, 2012.
    • [25] M. Harman and P. McMinn. A theoretical and empirical study of search-based testing: Local, global, and hybrid search. volume 36, pages 226-247, 2010.
    • [26] S. Hart and L. Staveland. Development of a nasa-tlx (task load index): Results of empirical and theoretical research. In P. Hancock and N. Meshkati, editors, Human Mental Workload, pages 139-183. 1988.
    • [27] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, 2003.
    • [28] D. Isa, L. Lee, V. Kallimani, and R. RajKumar. Text document preprocessing with the bayes formula for classification using the support vector machine. IEEE Trans. Knowledge and Data Eng., 20:1264-1272, 2008.
    • [29] T. Jiang, L. Tan, and S. Kim. Personalized defect prediction. In ACM Conf. Automated Software Eng., pages 279-289, 2013.
    • [30] A. Kapoor, B. Lee, D. Tan, and E. Horvitz. Interactive optimization for steering machine classification. Proc. CHI, pages 1343-1352, 2010.
    • [31] G. Kniesel and T. Rho. Newsgroup data set http://www.ai. mit.edu/jrennie/20newsgroups, 2005.
    • [32] T. Kulesza, M. Burnett, S. Stumpf, W.-K. Wong, S. Das, A. Groce, A. Shinsel, F. Bice, and K. McIntosh. Where are my intelligent assistant's mistakes? a systematic testing approach. In Intl. Symp. End-User Development, pages 171-186, 2011.
    • [33] T. Kulesza, S. Stumpf, M. Burnett, W.-K. Wong, Y. Riche, T. Moore, I. Oberst, A. Shinsel, and K. McIntosh. Explanatory debugging: Supporting end-user debugging of machinelearned programs. In IEEE Symp. Visual Languages and HumanCentric Computing, pages 41-48, 2010.
    • [34] T. Kulesza, S. Stumpf, W.-K. Wong, M. Burnett, S. Perona, A. Ko, and I. Obsert. Why-Oriented End-User Debugging of Naive Bayes Text Classification. ACM Transactions on Interactive Intelligent Systems, 1(1), Oct. 2011.
    • [35] T. Kulesza, W.-K. Wong, S. Stumpf, S. Perona, R. White, M. Burnett, I. Oberst, and A. Ko. Fixing the program my computer learned: Barriers for end users, challenges for the machine. In ACM Intl. Conf. Intelligent User Interfaces, pages 187-196, 2009.
    • [36] K. Lang. Newsweeder: Learning to filter netnews. In Intl. Conf. Machine Learning, pages 331-339, 1995.
    • [37] D. Lewis and W. Gale. A sequential algorithm for training text classifiers. In ACM Conf. Research and Development in Information Retrieval, pages 3-12, 1994.
    • [38] B. Lim and A. Dey. Toolkit to support intelligibility in contextaware applications. In Proc. Int. Conf. Ubiquitous Computing, pages 13-22, 2010.
    • [39] B. Lim, A. Dey, and D. Avrahami. Why and why not explanations improve the intelligibility of context-aware intelligent systems. In ACM Conf. Human Factors in Computing Systems, pages 2119-2128, 2009.
    • [40] M. E. Maron. Automatic indexing: An experimental inquiry. J. ACM, 8(3):404-417, 1961.
    • [41] A. McCallum. Mallet: A machine learning for language toolkit. 2002. URL http://mallet. cs. umass. edu.
    • [42] R. Miller and B. Myers. Outlier finding: Focusing user attention on possible errors. In Proc. UIST, pages 81-90, 2001.
    • [43] C. Murphy, G. Kaiser, and M. Arias. An approach to software testing of machine learning applications. In Intl. Conf. Software Eng. and Knowledge Eng., pages 167-172, 2007.
    • [44] C. Murphy, K. Shen, and G. Kaiser. Automatic system testing of programs without test oracles. In Intl. Symp. Software Testing and Analysis, pages 189-200, 2009.
    • [45] R. Panko. What we know about spreadsheet errors http://reference.kfupm.edu.sa/content/w/h/what we know about spreadsheet errors 72956.pdf. Retreived Aug. 2010. Expanded version of article in J. End User Computing 19(2), Spring 1998, pp. 15-21.
    • [46] H. Raghavan, O. Madani, and R. Jones. Active learning with feedback on both features and instances. JMLR, 7:1655-1686, 2006.
    • [47] O. Raz, P. Koopman, and M. Shaw. Semantic anomaly detection in online data sources. In Proc. ICSE, pages 302-312, 2002.
    • [48] G. Rothermel, M. Burnett, L. Li, C. DuPois, and A. Sheretov. A methodology for testing spreadsheets. ACM Trans. Software Eng. and Methodology, 10(1):110-147, 2001.
    • [49] G. Rothermel, M. J. Harrold, J. Ostrin, and C. Hong. An empirical study of the effects of minimization on the fault detection capabilities of test suites. In Intl. Conf. Software Maintenance, 1998.
    • [50] R. K. Saha, M. Lease, S. Khurshid, and D. E. Perry. Improving bug localization using structured information retrieval. In ACM Conf. Automated Software Eng., pages 345-355, 2013.
    • [51] C. Scaffidi. Unsupervised inference of data formats in humanreadable notation. In Proc. Int. Conf. Enterprise Integration Systems, pages 236-241, 2007.
    • [52] J. Segal. Some problems of professional end user developers. In IEEE Symp. Visual Languages and Human-Centric Computing, 2007.
    • [53] B. Settles. Active learning literature survey. Technical Report Tech. Rpt. 1648, Univ. Wisc., Jan. 2010. http://pages.cs.wisc. edu/∼bsettles/pub/settles.activelearning.pdf.
    • [54] J. Shen and T. Dietterich. Active em to reduce noise in activity recognition. In Proc. IUI, pages 132-140, 2007.
    • [55] J. Shetty and J. Adibi. The Enron email dataset database schema and brief statistical report. Tech. Rpt., Univ. S. Calif., 2004.
    • [56] A. Shinsel, T. Kulesza, M. M. Burnett, W. Curan, A. Groce, S. Stumpf, and W.-K. Wong. Mini-crowdsourcing end-user assessment of intelligent assistants: A cost-benefit study. In IEEE Symposium on Visual Languages and Human-Centric Computing, pages 47-54, 2011.
    • [57] J. Talbot, B. Lee, A. Kapoor, and D. Tan. Ensemblematrix: Interactive visualization to support machine learning with multiple classifiers. In Proc. CHI, pages 1283-1292, 2009.
    • [58] J. Tullio, A. Dey, J. Chalecki, and J. Fogarty. How it works: A field study of non-technical users interacting with an intelligent system. In ACM Conf. Human Factors in Computing Systems, pages 31-40, 2007.
    • [59] L. Wasserman. All of Statistics. Springer, 2004.
    • [60] W.-K. Wong, I. Oberst, S. Das, T. Moore, S. Stumpf, K. McIntosh, and M. Burnett. End-user feature labeling: A locallyweighted regression approach. In Intl. Conf. Intell. User Interfaces, pages 115-124, 2011.
    • [61] T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimates for multi-class classification by pairwise coupling. J. Machine Learning Research, 5:975-1005, 2004.
    • [62] X. Xie, J. Ho, C. Murphy, B. Xu, and T. Y. Chen. Application of metamorphic testing to supervised classifiers. In Intl. Conf. Quality Software, pages 135-144, 2009.
    • [63] B. Zadrozny and C. Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. pages 609-616, 2001.
  • No related research data.
  • No similar publications.

Share - Bookmark

Funded by projects

  • NSF | HCC-Medium: End-user debugg...

Cite this article