From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-7.2 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,RP_MATCHES_RCVD shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 563712018A for ; Fri, 1 Jul 2016 17:04:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752230AbcGAREj (ORCPT ); Fri, 1 Jul 2016 13:04:39 -0400 Received: from alum-mailsec-scanner-6.mit.edu ([18.7.68.18]:43122 "EHLO alum-mailsec-scanner-6.mit.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752201AbcGAREh (ORCPT ); Fri, 1 Jul 2016 13:04:37 -0400 X-AuditID: 12074412-0b3ff7000000491f-35-5776a2a35b7d Received: from outgoing-alum.mit.edu (OUTGOING-ALUM.MIT.EDU [18.7.68.33]) by (Symantec Messaging Gateway) with SMTP id E3.6C.18719.3A2A6775; Fri, 1 Jul 2016 13:04:36 -0400 (EDT) Received: from [192.168.69.130] (p508EA4A4.dip0.t-ipconnect.de [80.142.164.164]) (authenticated bits=0) (User authenticated as mhagger@ALUM.MIT.EDU) by outgoing-alum.mit.edu (8.13.8/8.12.4) with ESMTP id u61H4VZG014440 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NOT); Fri, 1 Jul 2016 13:04:32 -0400 Subject: diff heuristics dramatically improved by considering line indentation and blank lines To: Stefan Beller , Jeff King References: <20160616174620.1011-1-sbeller@google.com> <20160617153637.GA9314@sigill.intra.peff.net> <576C1803.5050905@alum.mit.edu> <57752478.1000302@alum.mit.edu> Cc: "git@vger.kernel.org" , Jacob Keller , Junio C Hamano From: Michael Haggerty Message-ID: <5776A29F.4020609@alum.mit.edu> Date: Fri, 1 Jul 2016 19:04:31 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Icedove/38.8.0 MIME-Version: 1.0 In-Reply-To: <57752478.1000302@alum.mit.edu> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFtrHKsWRmVeSWpSXmKPExsUixO6iqLtkUVm4wcGf4hZdV7qZLBp6rzBb 7F7cz2zxd888NosfLT3MFotuHWC1+Pj+PZvFuQu9zBabN7ezOHB6LJz1nMVj56y77B4LNpV6 POvdw+hx8ZKyx+dNcgFsUdw2SYklZcGZ6Xn6dgncGeee6hR8l6pYffc8SwPjecEuRk4OCQET ieWr1zB3MXJxCAlsZZTY0jGRFcI5zyTRvvsSE0iVsECCRP/Mb8wgtoiAo8SLZwvZIYreM0o8 P9zEBOIwC/QwSqzd+ZENpIpNQFdiUU8zWDevgLbErz2LwOIsAioSM+c8B4uLCoRInF+3lRWi RlDi5MwnLCA2p4COxLZ7EL3MAuoSf+ZdYoaw5SW2v53DPIGRfxaSlllIymYhKVvAyLyKUS4x pzRXNzcxM6c4NVm3ODkxLy+1SNdMLzezRC81pXQTIyT8hXYwrj8pd4hRgINRiYdXYHZpuBBr YllxZe4hRkkOJiVRXpfpZeFCfEn5KZUZicUZ8UWlOanFhxglOJiVRHgFgFEnxJuSWFmVWpQP k5LmYFES5/25WN1PSCA9sSQ1OzW1ILUIJivDwaEkwft9IVCjYFFqempFWmZOCUKaiYMTZDiX lEhxal5KalFiaUlGPCgq44uBcQmS4gHaexyknbe4IDEXKArReopRl+PI/ntrmYRY8vLzUqXE eZ+CFAmAFGWU5sGtgCW7V4ziQB8L8zKBvMADTJRwk14BLWECWsJcWgyypCQRISXVwKinc/3n Fg5VprnBs0KEr5+30pIJtue/8qd5pdr8lKSlk8/quq1z+f6X/WtNT8wak6CLh4r7JPt5ph+L C+pt8HkvnnL/s+yXwnqZzUYv9mTLG0kHJ3Y5q3DG9i5nZGw69UV6jvTW6aH+j3kf7zM/56xc 9Vm80OKAYBRfJYeDwxnbu+u9Di9dpMRSnJFoqMVcVJwIAJtEkUdRAwAA Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On 06/30/2016 03:54 PM, Michael Haggerty wrote: > On 06/23/2016 07:10 PM, Michael Haggerty wrote: >> On 06/17/2016 06:09 PM, Stefan Beller wrote: >>> I think before spending more time on discussing and implementing new >>> (hopefully better) heuristics, I'd want to step back and try to be a bit more >>> systematic, i.e. I'll want to collect lots of test cases in different languages >>> and use cases. A mini test suite, maybe? >> >> Stefan, >> >> I've also been playing around with diff heuristics and also trying to >> find some good test data. Have you put together some test cases yet? > > I put quite of work into tooling to evaluate diff heuristics, and just > published it on GitHub: > > https://github.com/mhagger/diff-slider-tools Today I hand-optimized about 2700 diff sliders in 12 different repositories and used that as a reference corpus against which I tested the accuracy of four different algorithms. The tools and the hand-optimized values are now in the GitHub repository mentioned above. See also this pull request [1]. If I didn't make any mistakes, the number of errors (i.e., cases where the algorithm choose a different slider shift than the one I picked by hand) look like this: > | repository | default | compaction | indent | indent-favor-edges | > | -------------------- | ------- | ---------- | ------ | ------------------ | > | ant | 225 | 102 | 7 | 7 | > | bugzilla | 208 | 81 | 14 | 14 | > | couchdb | 44 | 24 | 13 | 10 | > | docker | 180 | 160 | 29 | 18 | > | git | 446 | 59 | 27 | 19 | > | ipython | 358 | 163 | 61 | 11 | > | junit | 146 | 67 | 5 | 5 | > | nodejs | 489 | 78 | 12 | 12 | > | phpmyadmin | 330 | 49 | 1 | 0 | > | test-more | 15 | 2 | 2 | 0 | > | test-unit | 33 | 13 | 4 | 4 | > | -------------------- | ------- | ---------- | ------ | ------------------ | > | totals | 2474 | 798 | 175 | 100 | The algorithms are: * default -- the default behavior `git diff` on the current master * compaction -- `git diff --compaction-heuristic` * indent -- the indent-based algorithm as reported earlier in this thread * indent-favor-edges -- the indent-based algorithm, with some improved heuristics regarding sliders near the begin/end of file and probably one or two fixes I encourage other people to run the tests against some code samples in their favorite programming languages so that the testing can cover a wider range of inputs, and submit your data to the GitHub project. Also feel free to tweak the heuristic (there's probably still room for improvement). All the tools and raw data for my experiments are already there. I'll be going on vacation for the next two weeks so I probably can't follow up on this for a while. But I think we should build an algorithm like this into Git! Michael [1] https://github.com/mhagger/diff-slider-tools/pull/1