编辑: 贾雷坪皮 | 2013-05-07 |
Reiss Department of Computer Science Brown University Providence, RI, USA {qx5,spr}@cs.brown.edu ABSTRACT A typical automatic program repair technique that uses a test suite as the correctness criterion can produce a patched program that is test-suite-over?tted, or over?tting, which passes the test suite but does not actually repair the bug. In this paper, we propose DiffT- Gen which identi?es a patched program to be over?tting by ?rst generating new test inputs that uncover semantic differences be- tween the original faulty program and the patched program, then testing the patched program based on the semantic differences, and ?nally generating test cases. Such a test case could be added to the original test suite to make it stronger and could prevent the re- pair technique from generating a similar over?tting patch again. We evaluated DiffTGen on
89 patches generated by four automatic repair techniques for Java with
79 of them being likely to be over?t- ting and incorrect. DiffTGen identi?es in total
39 (49.4%) over?t- ting patches and yields the corresponding test cases. With the ?xed version of a faulty program being the oracle, the average running time is about
7 minutes. We further show that an automatic repair technique, if con?gured with DiffTGen, could avoid yielding over- ?tting patches and potentially produce correct ones. Keywords patch over?tting;
test case generation;
automatic program repair 1. INTRODUCTION Given a faulty program and a fault-exposing test suite, an auto- matic program repair technique aims to produce a correct, patched program that passes the test suite. Being automatic, such a tech- nique could potentially save people signi?cant time and effort. Over the past decade, a variety of automatic repair techniques [5, 9, 14C 16,21C23,37C39] have been developed. Current repair techniques, however, are still far from maturity: they often yield an over?tting, patched program which passes the test suite but does not actually repair the bug. Studies [30, 33] have shown that early repair tech- niques suffer from severe over?tting problems. According to [30], the majority of patches generated by GenProg [5], AE [37] and RSRepair [29] are incorrect. More recent techniques look at many Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro?t or commercial advantage and that copies bear this notice and the full cita- tion on the ?rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior speci?c permission and/or a fee. Request permissions from [email protected]. c
2017 ACM. ISBN 978-1-4503-2138-9. DOI: 10.1145/1235 other methods (e.g., using human-written patches [10], repair tem- plates and condition synthesis [15], bug-?xing instances [14, 16] and forbidden modi?cations [34]) for repair. However, their repair performances are still relatively poor. Within a 12-hour time limit, the state-of-the-art repair techniques SPR [15] and Prophet [16] generated plausible patches that pass the test suite for less than 60% bugs in a dataset containing
69 bugs, with more than 60% of the plausible patches (the ?rst found ones) being incorrect. The low quality of a test suite is a critical reason why an over- ?tting patch might be generated. Unlike a formal speci?cation, the speci?cation encoded in a test suite is typically weak and incom- plete. For example, the fault-exposing test case in the test suite as- sociated with the bug Math_85 from the Defects4J dataset [8] sim- ply checks whether a method returns a result without any exception thrown, but does not check the correctness of the result. A patch generated by jGenProg (the Java version of GenProg [5]) simply re- moves the erroneous statement that triggers the exception without actually repairing it. The patch avoids the unexpected exception but deletes the expected functionality of the original program and thus introduces new bugs. It is not surprising that a test suite some- times contains such weak test cases since the test suite is designed for humans but not for machines, and a human seldomly makes an unreasonable patch by deleting the desirable functionality of a pro- gram. However, such a weak test suite harms the performance of an automatic repair technique, e.g., jGenProg. When a patched pro- gram that passes the test suite is generated, jGenProg would simply accept it as there is no extra knowledge other than the given test suite to validate its correctness. In this paper, we propose DiffTGen, a patch testing technique to be used in the context of automatic program repair. DiffTGen iden- ti?es over?tting patches generated by an automatic repair technique through test case generation. Based on the syntactic differences be- tween a faulty program and a patched program, DiffTGen employs an external test generator to generate test methods (test inputs) that could exercise at least one of the syntactic differences upon exe- cution. To actually ?nd any semantic difference, DiffTGen instru- ments the two programs, runs the programs against the generated test method, and compares ........