* Optimizing writes to unchanged files during merges? @ 2018-04-12 21:14 Linus Torvalds 2018-04-12 21:46 ` Junio C Hamano 2018-04-13 0:01 ` Elijah Newren 0 siblings, 2 replies; 29+ messages in thread From: Linus Torvalds @ 2018-04-12 21:14 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List So I just had an interesting experience that has happened before too, but this time I decided to try to figure out *why* it happened. I'm obviously in the latter part of the kernel merge window, and things are slowly calming down. I do the second XFS merge during this window, and it brings in updates just to the fs/xfs/ subdirectory, so I expect that my test build for the full kernel configuration should be about a minute. Instead of recompiles pretty much *everything*, and the test build takes 22 minutes. This happens occasionally, and I blame gremlins. But this time I decided to look at what the gremlins actually *are*. The diff that git shows for the pull was very clear: only fs/xfs/ changes. But doing ls -tr $(git ls-files '*.[chS]') | tail -10 gave the real reason: in between all the fs/xfs/xyz files was this: include/linux/mm.h and yeah, that rather core header file causes pretty much everything to be re-built. Now, the reason it was marked as changed is that the xfs branch _had_ in fact changed it, but the changes were already upstream and got merged away. But the file still got written out (with the same contents it had before the merge), and 'make' obviously only looks at modification time, so make rebuilt everything. Now, because it's still the merge window, I don't have much time to look into this, but I was hoping somebody in git land would like to give it a quick look. I'm sure I'm not the only one to have ever been hit by this, and I don't think the kernel is the only project to hit it either. Because it would be lovely if the merging logic would just notice "oh, that file doesn't change", and not even write out the end result. For testing, the merge that triggered this git introspection is kernel commit 80aa76bcd364 ("Merge tag 'xfs-4.17-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux"), which can act as a test-case. It's a clean merge, so no kernel knowledge necessary: just re-merge the parents and see if the modification time of include/linux/mm.h changes. I'm guessing some hack in update_file_flags() to say "if the new object matches the old one, don't do anything" might work. Although I didn't look if we perhaps end up writing to the working tree copy earlier already. Looking at the blame output of that function, most of it is really old, so this "problem" goes back a long long time. Just to clarify: this is absolutely not a correctness issue. It's not even a *git* performance issue. It's literally just a "not updating the working tree when things don't change results in better behavior for other tools". So if somebody gets interested in this problem, that would be great. And if not, I'll hopefully remember to come back to this next week when the merge window is over, and take a second look. Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-12 21:14 Optimizing writes to unchanged files during merges? Linus Torvalds @ 2018-04-12 21:46 ` Junio C Hamano 2018-04-12 23:17 ` Junio C Hamano 2018-04-12 23:18 ` Linus Torvalds 2018-04-13 0:01 ` Elijah Newren 1 sibling, 2 replies; 29+ messages in thread From: Junio C Hamano @ 2018-04-12 21:46 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List Linus Torvalds <torvalds@linux-foundation.org> writes: > Now, the reason it was marked as changed is that the xfs branch _had_ > in fact changed it, but the changes were already upstream and got > merged away. But the file still got written out (with the same > contents it had before the merge), and 'make' obviously only looks at > modification time, so make rebuilt everything. Thanks for a clear description of the issue. It does sound interesting. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-12 21:46 ` Junio C Hamano @ 2018-04-12 23:17 ` Junio C Hamano 2018-04-12 23:35 ` Linus Torvalds 2018-04-12 23:18 ` Linus Torvalds 1 sibling, 1 reply; 29+ messages in thread From: Junio C Hamano @ 2018-04-12 23:17 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List Junio C Hamano <gitster@pobox.com> writes: > Linus Torvalds <torvalds@linux-foundation.org> writes: > >> Now, the reason it was marked as changed is that the xfs branch _had_ >> in fact changed it, but the changes were already upstream and got >> merged away. But the file still got written out (with the same >> contents it had before the merge), and 'make' obviously only looks at >> modification time, so make rebuilt everything. > > Thanks for a clear description of the issue. It does sound > interesting. A bit of detour. "Change in side branch happened to be a subset of the change in trunk and gets subsumed, but we end up writing the same result" happens also with the simpler resolve strategy. Here is a fix. git-merge-one-file.sh | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/git-merge-one-file.sh b/git-merge-one-file.sh index 9879c59395..aa7392f7ff 100755 --- a/git-merge-one-file.sh +++ b/git-merge-one-file.sh @@ -137,11 +137,21 @@ case "${1:-.}${2:-.}${3:-.}" in ret=1 fi + # Does the merge result happen to be identical to what we + # already have? Then do not cause unnecessary recompilation. + # But don't bother the optimization if we need to chmod + if cmp -s "$4" "$src1" && test "$6" = "$7" + then + :; # happy + else + # Create the working tree file, using "our tree" version from the # index, and then store the result of the merge. git checkout-index -f --stage=2 -- "$4" && cat "$src1" >"$4" || exit 1 rm -f -- "$orig" "$src1" "$src2" + fi + if test "$6" != "$7" then if test -n "$msg" ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-12 23:17 ` Junio C Hamano @ 2018-04-12 23:35 ` Linus Torvalds 2018-04-12 23:41 ` Linus Torvalds 0 siblings, 1 reply; 29+ messages in thread From: Linus Torvalds @ 2018-04-12 23:35 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List On Thu, Apr 12, 2018 at 4:17 PM, Junio C Hamano <gitster@pobox.com> wrote: > > A bit of detour. "Change in side branch happened to be a subset of > the change in trunk and gets subsumed, but we end up writing the > same result" happens also with the simpler resolve strategy. > > Here is a fix. Heh. Except git-merge-one-file.sh is *only* used for that case, so this doesn't fix the normal case. I verified that the normal case situation is that "update_file_flags()" thing that checks out the end result. It's called by this case: } else if (a_oid && b_oid) { /* Case C: Added in both (check for same permissions) and */ /* case D: Modified in both, but differently. */ clean_merge = merge_content(o, path, o_oid, o_mode, a_oid, a_mode, b_oid, b_mode, NULL); in process_entry(), and I think we could just there add a test for if o_old,o_mod == a_oid,a_mode or something? Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-12 23:35 ` Linus Torvalds @ 2018-04-12 23:41 ` Linus Torvalds 2018-04-12 23:55 ` Linus Torvalds 0 siblings, 1 reply; 29+ messages in thread From: Linus Torvalds @ 2018-04-12 23:41 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List On Thu, Apr 12, 2018 at 4:35 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > in process_entry(), and I think we could just there add a test for if > o_old,o_mod == a_oid,a_mode or something? Actually, not process_entry, but merge_content(). Oddly, that *already* has the check: if (mfi.clean && !df_conflict_remains && oid_eq(&mfi.oid, a_oid) && mfi.mode == a_mode) { int path_renamed_outside_HEAD; output(o, 3, _("Skipped %s (merged same as existing)"), path); but that doesn't seem to actually trigger for some reason. But the code really seems to have the _intention_ of skipping the case where the result ended up the same as the source. Maybe I'm missing something. Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-12 23:41 ` Linus Torvalds @ 2018-04-12 23:55 ` Linus Torvalds 2018-04-13 0:01 ` Linus Torvalds 0 siblings, 1 reply; 29+ messages in thread From: Linus Torvalds @ 2018-04-12 23:55 UTC (permalink / raw) To: Junio C Hamano, Elijah Newren; +Cc: Git Mailing List [ Talking to myself ] On Thu, Apr 12, 2018 at 4:41 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Oddly, that *already* has the check: > > if (mfi.clean && !df_conflict_remains && > oid_eq(&mfi.oid, a_oid) && mfi.mode == a_mode) { > int path_renamed_outside_HEAD; > output(o, 3, _("Skipped %s (merged same as existing)"), path); > > but that doesn't seem to actually trigger for some reason. Actually, that check triggers just fine. > But the code really seems to have the _intention_ of skipping the case > where the result ended up the same as the source. > > Maybe I'm missing something. The later check that does /* * The content merge resulted in the same file contents we * already had. We can return early if those file contents * are recorded at the correct path (which may not be true * if the merge involves a rename). */ path_renamed_outside_HEAD = !path2 || !strcmp(path, path2); if (!path_renamed_outside_HEAD) { will see that 'path2' is NULL, and not trigger this early out case, and then this all falls back to the normal cases after all. So I think that 'path_renamed_outside_HEAD' logic is somehow broken. Did it perhaps mean to say path_renamed_outside_HEAD = path2 && !strcmp(path, path2); instead? See commit 5b448b853 ("merge-recursive: When we detect we can skip an update, actually skip it") which really implies we want to actually skip it (but then we don't anyway). Also see commit b2c8c0a76 ("merge-recursive: When we detect we can skip an update, actually skip it") which was an earlier version, and which *actually* skipped it, but it was reverted because of that rename issue. Adding Elijah Newren to the cc, because he was working on this back then, an dis still around, and still working on merging ;) Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-12 23:55 ` Linus Torvalds @ 2018-04-13 0:01 ` Linus Torvalds 2018-04-13 7:02 ` Elijah Newren 0 siblings, 1 reply; 29+ messages in thread From: Linus Torvalds @ 2018-04-13 0:01 UTC (permalink / raw) To: Junio C Hamano, Elijah Newren; +Cc: Git Mailing List [ Still talking to myself. Very soothing. ] On Thu, Apr 12, 2018 at 4:55 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > [ Talking to myself ] > > Did it perhaps mean to say > > path_renamed_outside_HEAD = path2 && !strcmp(path, path2); > > instead? Probably not correct, but making that change makes my test-case work. It probably breaks something important, though. I didn't look at why that code happened in the first place. But it's nice to know I'm at least looking at the right code while I'm talking to myself. Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-13 0:01 ` Linus Torvalds @ 2018-04-13 7:02 ` Elijah Newren 2018-04-13 17:14 ` Linus Torvalds 0 siblings, 1 reply; 29+ messages in thread From: Elijah Newren @ 2018-04-13 7:02 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List On Thu, Apr 12, 2018 at 5:01 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > [ Still talking to myself. Very soothing. ] > > On Thu, Apr 12, 2018 at 4:55 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> [ Talking to myself ] >> >> Did it perhaps mean to say >> >> path_renamed_outside_HEAD = path2 && !strcmp(path, path2); >> >> instead? > > Probably not correct, but making that change makes my test-case work. > > It probably breaks something important, though. I didn't look at why > that code happened in the first place. > > But it's nice to know I'm at least looking at the right code while I'm > talking to myself. I hope you don't mind me barging into your conversation, but I figured out some interesting details. What we want here is that if there are no content conflicts and the contents match the version of the file from HEAD, and there are no mode conflicts and the final mode matches what was in HEAD, and there are no path conflicts (e.g. a rename/rename(1to2) issue or a D/F conflict putting a directory in the way) and the final path matches what was already in HEAD, then we can skip the update. What does this look like in code? Well, most of it was already there: if (mfi.clean && !df_conflict_remains && oid_eq(&mfi.oid, a_oid) && mfi.mode == a_mode) { That covers everything except "the final path matches what was already in HEAD". How do we check for that? The current code is just making an approximation; your modification improves the approximation. However, it turns out we have this awesome function called "was_tracked(const char *path)" that was intended for answering this exact question. So, assuming was_tracked() isn't buggy, the correct patch for this problem would look like: - path_renamed_outside_HEAD = !path2 || !strcmp(path, path2); - if (!path_renamed_outside_HEAD) { + if (was_tracked(path)) { Turns out that patch was already submitted as c5b761fb ("merge-recursive: ensure we write updates for directory-renamed file", 2018-02-14). While that patch was for a different bug, it is identical to what I would have proposed to fix this bug. A big series including that patch was merged to master two days ago, but unfortunately that exact patch was the one that caused some impressively awful fireworks[1]. So it, along with the rest of the series it was part of, was reverted just a few hours later. As it turns out, the cause of the problem is that was_tracked() can lie when renames are involved. It just hadn't ever bit us yet. I have a fix for was_tracked(), and 10 additional testcases covering interesting should-be-skipped and should-not-be-skipped-but-is-close-to-a-should-be-skipped scenarios, verifying the skipping actually happens if and only if it should happen. That should fix your bug, and the bug with my series. Rough WIP can be seen at the testme branch in github.com/newren/git for the curious, but I need to sleep and have a bunch of other things on my plate for the next few days. But I thought I'd at least mention what I've found so far. Elijah [1] https://public-inbox.org/git/xmqqefjm3w1h.fsf@gitster-ct.c.googlers.com/ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-13 7:02 ` Elijah Newren @ 2018-04-13 17:14 ` Linus Torvalds 2018-04-13 17:39 ` Stefan Beller ` (2 more replies) 0 siblings, 3 replies; 29+ messages in thread From: Linus Torvalds @ 2018-04-13 17:14 UTC (permalink / raw) To: Elijah Newren; +Cc: Junio C Hamano, Git Mailing List [-- Attachment #1: Type: text/plain, Size: 2434 bytes --] On Fri, Apr 13, 2018 at 12:02 AM, Elijah Newren <newren@gmail.com> wrote: > > I hope you don't mind me barging into your conversation I was getting tired of my own rambling anyway.. > However, it turns out we have this awesome function called > "was_tracked(const char *path)" that was intended for answering this > exact question. So, assuming was_tracked() isn't buggy, the correct > patch for this problem would look like: Apparently that causes problems, for some odd reason. I like the notion of checking the index, but it's not clear that the index is reliable in the presence of renames either. > A big series > including that patch was merged to master two days ago, but > unfortunately that exact patch was the one that caused some > impressively awful fireworks[1]. Yeah, so this code is fragile. How about we take a completely different approach? Instead of relying on fragile (but clever) tests, why not rely on stupid brute force? Yeah, yeah, it's bad to be stupid, but sometimes simple and stupid really does work. See the attached patch. It gets rid of all the subtle "has this been renamed" tests entirely, and just _always_ does that final update_file(). But instead, it makes update_file() a bit smarter, and says "before we write this file out, let's see if it's already there and already has the expected contents"? Now, it really shouldn't be _quite_ as stupid as that: we should probably set a flag in the "hey, the oid matches, maybe it's worth checking", so that it doesn't do the check in the cases where we know the merge has done things. But it's actually *fairly* low cost, because before it reads the file it at least checks that file length matches the expected length (and that the user permission bits match the expected mode). So if the file doesn't match, most of the time the real cost will just be an extra 'open/fstat/close()' sequence. That's pretty cheap. So even the completely stupid approach is probably not too bad, and it could be made smarter. NOTE! I have *NOT* tested this on anything interesting. I tested the patch on my stupid test-case, but that'[s it. I didn't even bother re-doing the kernel merge that started this. Comments? Because considering the problems this code has had, maybe "stupid" really is the right approach... [ Ok, I lied. I just tested this on the kernel merge. It worked fine, and avoided modifying <linux/mm.h> ] Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 2221 bytes --] merge-recursive.c | 42 +++++++++++++++++++++++++++++------------- 1 file changed, 29 insertions(+), 13 deletions(-) diff --git a/merge-recursive.c b/merge-recursive.c index 0c0d48624..ed2200065 100644 --- a/merge-recursive.c +++ b/merge-recursive.c @@ -815,6 +815,32 @@ static int make_room_for_path(struct merge_options *o, const char *path) return err(o, msg, path, _(": perhaps a D/F conflict?")); } +static int working_tree_matches(const char *path, const char *buf, unsigned long size, unsigned mode) +{ + int fd, matches; + struct stat st; + + fd = open(path, O_RDONLY); + if (fd < 0) + return 0; + matches = 0; + if (!fstat(fd, &st) && st.st_size == size && S_ISREG(st.st_mode) && !(0700 & (st.st_mode ^ mode))) { + char tmpbuf[1024]; + while (size) { + int n = read(fd, tmpbuf, sizeof(tmpbuf)); + if (n <= 0 || n > size) + break; + if (memcmp(tmpbuf, buf, n)) + break; + buf += n; + size -= n; + } + matches = !size; + } + close(fd); + return matches; +} + static int update_file_flags(struct merge_options *o, const struct object_id *oid, unsigned mode, @@ -856,6 +882,8 @@ static int update_file_flags(struct merge_options *o, size = strbuf.len; buf = strbuf_detach(&strbuf, NULL); } + if (working_tree_matches(path, buf, size, mode)) + goto free_buf; } if (make_room_for_path(o, path) < 0) { @@ -1782,20 +1810,8 @@ static int merge_content(struct merge_options *o, if (mfi.clean && !df_conflict_remains && oid_eq(&mfi.oid, a_oid) && mfi.mode == a_mode) { - int path_renamed_outside_HEAD; output(o, 3, _("Skipped %s (merged same as existing)"), path); - /* - * The content merge resulted in the same file contents we - * already had. We can return early if those file contents - * are recorded at the correct path (which may not be true - * if the merge involves a rename). - */ - path_renamed_outside_HEAD = !path2 || !strcmp(path, path2); - if (!path_renamed_outside_HEAD) { - add_cacheinfo(o, mfi.mode, &mfi.oid, path, - 0, (!o->call_depth), 0); - return mfi.clean; - } + /* We could set a flag here and pass it to "update_file()" */ } else output(o, 2, _("Auto-merging %s"), path); ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-13 17:14 ` Linus Torvalds @ 2018-04-13 17:39 ` Stefan Beller 2018-04-13 17:53 ` Linus Torvalds 2018-04-13 20:04 ` Elijah Newren 2018-04-16 1:44 ` Junio C Hamano 2 siblings, 1 reply; 29+ messages in thread From: Stefan Beller @ 2018-04-13 17:39 UTC (permalink / raw) To: Linus Torvalds; +Cc: Elijah Newren, Junio C Hamano, Git Mailing List Hi Linus, On Fri, Apr 13, 2018 at 10:14 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Comments? Because considering the problems this code has had, maybe > "stupid" really is the right approach... Would s/read/xread/ make sense in working_tree_matches ? If read returns unexpectedly due to EINTR or EAGAIN, this is no correctness issue, but you'd have to recompile the kernel. Stefan ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-13 17:39 ` Stefan Beller @ 2018-04-13 17:53 ` Linus Torvalds 0 siblings, 0 replies; 29+ messages in thread From: Linus Torvalds @ 2018-04-13 17:53 UTC (permalink / raw) To: Stefan Beller; +Cc: Elijah Newren, Junio C Hamano, Git Mailing List On Fri, Apr 13, 2018 at 10:39 AM, Stefan Beller <sbeller@google.com> wrote: > > Would s/read/xread/ make sense in working_tree_matches ? Makes sense, yes. That patch was really more of a RFD than anything that should be applied. I would like to see the "might be same" flag pushed down so that we don't do this comparison when it makes no sense, even if the stat info makes that less of an issue than it might otherwise be, And maybe that whole function should be taught to do the "verify CE against file", and do the proper full cache state check (ie time etc) instead of doing the read. So maybe the best option is a combination of the two patches: my "verify the working tree state" _together_ with the was_tracked() logic that looks up a cache entry. Again, the problem with "look up the index state" is about files possibly being renamed as part of the merge, but if we then check the index state against the actual contents of the working directory, that isn't an issue. Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-13 17:14 ` Linus Torvalds 2018-04-13 17:39 ` Stefan Beller @ 2018-04-13 20:04 ` Elijah Newren 2018-04-13 22:27 ` Junio C Hamano 2018-04-16 1:44 ` Junio C Hamano 2 siblings, 1 reply; 29+ messages in thread From: Elijah Newren @ 2018-04-13 20:04 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List On Fri, Apr 13, 2018 at 10:14 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Apr 13, 2018 at 12:02 AM, Elijah Newren <newren@gmail.com> wrote: >> However, it turns out we have this awesome function called >> "was_tracked(const char *path)" that was intended for answering this >> exact question. So, assuming was_tracked() isn't buggy, the correct >> patch for this problem would look like: > > Apparently that causes problems, for some odd reason. > > I like the notion of checking the index, but it's not clear that the > index is reliable in the presence of renames either. Yes, precisely. Checking the *current* index is not reliable in the presence of renames. Trying to use the current index as a proxy for what was in the index before the merge started is a problem. But we had a copy of the index before the merge started; we just discarded it at the end of unpack_trees(). We could keep it around instead. That would also have the benefits of making the was_dirty() checks more accurate too, as using the mtime's in the current index as a proxy for what was in the original index has the potential for the same kinds of problems. >> A big series >> including that patch was merged to master two days ago, but >> unfortunately that exact patch was the one that caused some >> impressively awful fireworks[1]. > > Yeah, so this code is fragile. > > How about we take a completely different approach? Instead of relying > on fragile (but clever) tests, why not rely on stupid brute force? > > Yeah, yeah, it's bad to be stupid, but sometimes simple and stupid > really does work. > <snip> > Comments? Because considering the problems this code has had, maybe > "stupid" really is the right approach... It's certainly tempting as an interim solution. I have an alternative interim solution that I think explains well why the code here had been fragile, and how to just implement what we want to know rather than making approximations to it, which I just posted at [2]. But I can see the draw of just gutting the code and replacing with simple brute force. Long term, I think the correct solution is still Junio's suggested rewrite[1]. My alternative is slightly closer to that end-state, so I favor it over simple brute-force, but if others have strong preferences here, I can be happy with either. Elijah [1] https://public-inbox.org/git/xmqqd147kpdm.fsf@gitster.mtv.corp.google.com/ [2] https://public-inbox.org/git/20180413195607.18091-1-newren@gmail.com/ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-13 20:04 ` Elijah Newren @ 2018-04-13 22:27 ` Junio C Hamano 0 siblings, 0 replies; 29+ messages in thread From: Junio C Hamano @ 2018-04-13 22:27 UTC (permalink / raw) To: Elijah Newren; +Cc: Linus Torvalds, Git Mailing List Elijah Newren <newren@gmail.com> writes: > Yes, precisely. Checking the *current* index is not reliable in the > presence of renames. > > Trying to use the current index as a proxy for what was in the index > before the merge started is a problem. But we had a copy of the index > before the merge started; we just discarded it at the end of > unpack_trees(). We could keep it around instead. That would also > have the benefits of making the was_dirty() checks more accurate too, > as using the mtime's in the current index as a proxy for what was in > the original index has the potential for the same kinds of problems. Yeah, I agree that you are going in the right direction with the above. > It's certainly tempting as an interim solution. I have an alternative > interim solution that I think explains well why the code here had been > fragile, and how to just implement what we want to know rather than > making approximations to it, which I just posted at [2]. But I can > see the draw of just gutting the code and replacing with simple brute > force. Thanks; I'd love to reserve a block of time to think this through myself, but I am a bit occupied otherwise this weekend, so I'll try to catch up after that. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-13 17:14 ` Linus Torvalds 2018-04-13 17:39 ` Stefan Beller 2018-04-13 20:04 ` Elijah Newren @ 2018-04-16 1:44 ` Junio C Hamano 2018-04-16 2:03 ` Linus Torvalds 2018-04-16 23:03 ` Elijah Newren 2 siblings, 2 replies; 29+ messages in thread From: Junio C Hamano @ 2018-04-16 1:44 UTC (permalink / raw) To: Linus Torvalds; +Cc: Elijah Newren, Git Mailing List Linus Torvalds <torvalds@linux-foundation.org> writes: > How about we take a completely different approach? Instead of relying > on fragile (but clever) tests, why not rely on stupid brute force? > > Yeah, yeah, it's bad to be stupid, but sometimes simple and stupid > really does work. > > See the attached patch. It gets rid of all the subtle "has this been > renamed" tests entirely, and just _always_ does that final > update_file(). I think Elijah's corrected was_tracked() also does not care "has this been renamed". One thing that makes me curious is what happens (and what we want to happen) when such a "we already have the changes the side branch tries to bring in" path has local (i.e. not yet in the index) changes. For a dirty file that trivially merges (e.g. a path we modified since our histories forked, while the other side didn't do anything, has local changes in the working tree), we try hard to make the merge succeed while keeping the local changes, and we should be able to do the same in this case, too. Your illustration patch that reads from the working tree to compare with the contents we are about to write out of course won't cope with this case ;-). If we do things in the index like the approach to fix was_tracked(), I suspect that we would just say "as long as the index and the HEAD matches, i.e. we are keeping the pathas it is found in HEAD as the merge result, we do not touch the working tree" and leave such a local modification alone. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-16 1:44 ` Junio C Hamano @ 2018-04-16 2:03 ` Linus Torvalds 2018-04-16 16:07 ` Lars Schneider 2018-04-16 22:55 ` Elijah Newren 2018-04-16 23:03 ` Elijah Newren 1 sibling, 2 replies; 29+ messages in thread From: Linus Torvalds @ 2018-04-16 2:03 UTC (permalink / raw) To: Junio C Hamano; +Cc: Elijah Newren, Git Mailing List On Sun, Apr 15, 2018 at 6:44 PM, Junio C Hamano <gitster@pobox.com> wrote: > > I think Elijah's corrected was_tracked() also does not care "has > this been renamed". I'm perfectly happy with the slightly smarter patches. My patch was really just an RFC and because I had tried it out. > One thing that makes me curious is what happens (and what we want to > happen) when such a "we already have the changes the side branch > tries to bring in" path has local (i.e. not yet in the index) > changes. For a dirty file that trivially merges (e.g. a path we > modified since our histories forked, while the other side didn't do > anything, has local changes in the working tree), we try hard to > make the merge succeed while keeping the local changes, and we > should be able to do the same in this case, too. I think it might be nice, but probably not really worth it. I find the "you can merge even if some files are dirty" to be really convenient, because I often keep stupid test patches in my tree that I may not even intend to commit, and I then use the same tree for merging. For example, I sometimes end up editing the Makefile for the release version early, but I won't *commit* that until I actually cut the release. But if I pull some branch that has also changed the Makefile, it's not worth any complexity to try to be nice about the dirty state. If it's a file that actually *has* been changed in the branch I'm merging, and I'm more than happy to just stage the patch (or throw it away - I think it's about 50:50 for me). So I don't think it's a big deal, and I'd rather have the merge fail very early with "that file has seen changes in the branch you are merging" than add any real complexity to the merge logic. Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-16 2:03 ` Linus Torvalds @ 2018-04-16 16:07 ` Lars Schneider 2018-04-16 17:04 ` Ævar Arnfjörð Bjarmason ` (3 more replies) 2018-04-16 22:55 ` Elijah Newren 1 sibling, 4 replies; 29+ messages in thread From: Lars Schneider @ 2018-04-16 16:07 UTC (permalink / raw) To: Linus Torvalds Cc: Junio C Hamano, Elijah Newren, Git Mailing List, mgorny, rtc, winserver.support, tytso > On 16 Apr 2018, at 04:03, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sun, Apr 15, 2018 at 6:44 PM, Junio C Hamano <gitster@pobox.com> wrote: >> >> I think Elijah's corrected was_tracked() also does not care "has >> this been renamed". > > I'm perfectly happy with the slightly smarter patches. My patch was > really just an RFC and because I had tried it out. > >> One thing that makes me curious is what happens (and what we want to >> happen) when such a "we already have the changes the side branch >> tries to bring in" path has local (i.e. not yet in the index) >> changes. For a dirty file that trivially merges (e.g. a path we >> modified since our histories forked, while the other side didn't do >> anything, has local changes in the working tree), we try hard to >> make the merge succeed while keeping the local changes, and we >> should be able to do the same in this case, too. > > I think it might be nice, but probably not really worth it. > > I find the "you can merge even if some files are dirty" to be really > convenient, because I often keep stupid test patches in my tree that I > may not even intend to commit, and I then use the same tree for > merging. > > For example, I sometimes end up editing the Makefile for the release > version early, but I won't *commit* that until I actually cut the > release. But if I pull some branch that has also changed the Makefile, > it's not worth any complexity to try to be nice about the dirty state. > > If it's a file that actually *has* been changed in the branch I'm > merging, and I'm more than happy to just stage the patch (or throw it > away - I think it's about 50:50 for me). > > So I don't think it's a big deal, and I'd rather have the merge fail > very early with "that file has seen changes in the branch you are > merging" than add any real complexity to the merge logic. I am happy to see this discussion and the patches, because long rebuilds are a constant annoyance for us. We might have been bitten by the exact case discussed here, but more often, we have a slightly different situation: An engineer works on a task branch and runs incremental builds — all is good. The engineer switches to another branch to review another engineer's work. This other branch changes a low-level header file, but no rebuild is triggered. The engineer switches back to the previous task branch. At this point, the incremental build will rebuild everything, as the compiler thinks that the low-level header file has been changed (because the mtime is different). Of course, this problem can be solved with a separate worktree. However, our engineers forget about that sometimes, and then, they are annoyed by a 4h rebuild. Is this situation a problem for others too? If yes, what do you think about the following approach: What if Git kept a LRU list that contains file path, content hash, and mtime of any file that is removed or modified during a checkout. If a file is checked out later with the exact same path and content hash, then Git could set the mtime to the previous value. This way the compiler would not think that the content has been changed since the last rebuild. I think that would fix the problem that our engineers run into and also the problem that Linus experienced during the merge, wouldn't it? Thanks, Lars ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-16 16:07 ` Lars Schneider @ 2018-04-16 17:04 ` Ævar Arnfjörð Bjarmason 2018-04-17 17:23 ` Lars Schneider 2018-04-16 17:43 ` Jacob Keller ` (2 subsequent siblings) 3 siblings, 1 reply; 29+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-04-16 17:04 UTC (permalink / raw) To: Lars Schneider Cc: Linus Torvalds, Junio C Hamano, Elijah Newren, Git Mailing List, mgorny, rtc, winserver.support, tytso On Mon, Apr 16 2018, Lars Schneider wrote: >> On 16 Apr 2018, at 04:03, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> >> On Sun, Apr 15, 2018 at 6:44 PM, Junio C Hamano <gitster@pobox.com> wrote: >>> >>> I think Elijah's corrected was_tracked() also does not care "has >>> this been renamed". >> >> I'm perfectly happy with the slightly smarter patches. My patch was >> really just an RFC and because I had tried it out. >> >>> One thing that makes me curious is what happens (and what we want to >>> happen) when such a "we already have the changes the side branch >>> tries to bring in" path has local (i.e. not yet in the index) >>> changes. For a dirty file that trivially merges (e.g. a path we >>> modified since our histories forked, while the other side didn't do >>> anything, has local changes in the working tree), we try hard to >>> make the merge succeed while keeping the local changes, and we >>> should be able to do the same in this case, too. >> >> I think it might be nice, but probably not really worth it. >> >> I find the "you can merge even if some files are dirty" to be really >> convenient, because I often keep stupid test patches in my tree that I >> may not even intend to commit, and I then use the same tree for >> merging. >> >> For example, I sometimes end up editing the Makefile for the release >> version early, but I won't *commit* that until I actually cut the >> release. But if I pull some branch that has also changed the Makefile, >> it's not worth any complexity to try to be nice about the dirty state. >> >> If it's a file that actually *has* been changed in the branch I'm >> merging, and I'm more than happy to just stage the patch (or throw it >> away - I think it's about 50:50 for me). >> >> So I don't think it's a big deal, and I'd rather have the merge fail >> very early with "that file has seen changes in the branch you are >> merging" than add any real complexity to the merge logic. > > I am happy to see this discussion and the patches, because long rebuilds > are a constant annoyance for us. We might have been bitten by the exact > case discussed here, but more often, we have a slightly different > situation: > > An engineer works on a task branch and runs incremental builds — all > is good. The engineer switches to another branch to review another > engineer's work. This other branch changes a low-level header file, > but no rebuild is triggered. The engineer switches back to the previous > task branch. At this point, the incremental build will rebuild > everything, as the compiler thinks that the low-level header file has > been changed (because the mtime is different). > > Of course, this problem can be solved with a separate worktree. However, > our engineers forget about that sometimes, and then, they are annoyed by > a 4h rebuild. > > Is this situation a problem for others too? > If yes, what do you think about the following approach: > > What if Git kept a LRU list that contains file path, content hash, and > mtime of any file that is removed or modified during a checkout. If a > file is checked out later with the exact same path and content hash, > then Git could set the mtime to the previous value. This way the > compiler would not think that the content has been changed since the > last rebuild. > > I think that would fix the problem that our engineers run into and also > the problem that Linus experienced during the merge, wouldn't it? Could what you're describing be prototyped as a post-checkout hook that looks at the reflog? It sounds to me like it could, but perhaps I've missed some subtlety. Not re-writing out a file that hasn't changed is one thing, but I think for more complex behaviors (such as the "I want everything to have the same mtime" mentioned in another thread on-list), and this, it makes sense to provide some hook mechanism than have git itself do all the work. I also don't see how what you're describing could be generalized, or even be made to work reliably in the case you're describing. If the engineer runs "make" on this branch he's testing out that might produce an object file that'll get used as-is once he switches back, since you've set the mtime in the past for that file because you re-checked it out. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-16 17:04 ` Ævar Arnfjörð Bjarmason @ 2018-04-17 17:23 ` Lars Schneider 0 siblings, 0 replies; 29+ messages in thread From: Lars Schneider @ 2018-04-17 17:23 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Linus Torvalds, Junio C Hamano, Elijah Newren, Git Mailing List, mgorny, rtc, winserver.support, tytso > On 16 Apr 2018, at 19:04, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > > On Mon, Apr 16 2018, Lars Schneider wrote: > >>> On 16 Apr 2018, at 04:03, Linus Torvalds <torvalds@linux-foundation.org> wrote: >>> >>> On Sun, Apr 15, 2018 at 6:44 PM, Junio C Hamano <gitster@pobox.com> wrote: >>>> >>>> I think Elijah's corrected was_tracked() also does not care "has >>>> this been renamed". >>> >>> I'm perfectly happy with the slightly smarter patches. My patch was >>> really just an RFC and because I had tried it out. >>> >>>> One thing that makes me curious is what happens (and what we want to >>>> happen) when such a "we already have the changes the side branch >>>> tries to bring in" path has local (i.e. not yet in the index) >>>> changes. For a dirty file that trivially merges (e.g. a path we >>>> modified since our histories forked, while the other side didn't do >>>> anything, has local changes in the working tree), we try hard to >>>> make the merge succeed while keeping the local changes, and we >>>> should be able to do the same in this case, too. >>> >>> I think it might be nice, but probably not really worth it. >>> >>> I find the "you can merge even if some files are dirty" to be really >>> convenient, because I often keep stupid test patches in my tree that I >>> may not even intend to commit, and I then use the same tree for >>> merging. >>> >>> For example, I sometimes end up editing the Makefile for the release >>> version early, but I won't *commit* that until I actually cut the >>> release. But if I pull some branch that has also changed the Makefile, >>> it's not worth any complexity to try to be nice about the dirty state. >>> >>> If it's a file that actually *has* been changed in the branch I'm >>> merging, and I'm more than happy to just stage the patch (or throw it >>> away - I think it's about 50:50 for me). >>> >>> So I don't think it's a big deal, and I'd rather have the merge fail >>> very early with "that file has seen changes in the branch you are >>> merging" than add any real complexity to the merge logic. >> >> I am happy to see this discussion and the patches, because long rebuilds >> are a constant annoyance for us. We might have been bitten by the exact >> case discussed here, but more often, we have a slightly different >> situation: >> >> An engineer works on a task branch and runs incremental builds — all >> is good. The engineer switches to another branch to review another >> engineer's work. This other branch changes a low-level header file, >> but no rebuild is triggered. The engineer switches back to the previous >> task branch. At this point, the incremental build will rebuild >> everything, as the compiler thinks that the low-level header file has >> been changed (because the mtime is different). >> >> Of course, this problem can be solved with a separate worktree. However, >> our engineers forget about that sometimes, and then, they are annoyed by >> a 4h rebuild. >> >> Is this situation a problem for others too? >> If yes, what do you think about the following approach: >> >> What if Git kept a LRU list that contains file path, content hash, and >> mtime of any file that is removed or modified during a checkout. If a >> file is checked out later with the exact same path and content hash, >> then Git could set the mtime to the previous value. This way the >> compiler would not think that the content has been changed since the >> last rebuild. >> >> I think that would fix the problem that our engineers run into and also >> the problem that Linus experienced during the merge, wouldn't it? > > Could what you're describing be prototyped as a post-checkout hook that > looks at the reflog? It sounds to me like it could, but perhaps I've > missed some subtlety. Yeah, probably. You don't even need the reflog I think. I just wanted to get a sense if other people run into this problem too. > Not re-writing out a file that hasn't changed is one thing, but I think > for more complex behaviors (such as the "I want everything to have the > same mtime" mentioned in another thread on-list), and this, it makes > sense to provide some hook mechanism than have git itself do all the > work. > > I also don't see how what you're describing could be generalized, or > even be made to work reliably in the case you're describing. If the > engineer runs "make" on this branch he's testing out that might produce > an object file that'll get used as-is once he switches back, since > you've set the mtime in the past for that file because you re-checked it > out. Ohh... you're right. I thought Visual Studio looks *just* at ctime/mtime of the files. But this seems not to be true [1]: "MSBuild to build it quickly checks if any of a project’s input files are modified later than any of the project’s outputs" In that case my idea outlined above is garbage. Thanks, Lars [1] https://blogs.msdn.microsoft.com/kirillosenkov/2014/08/04/how-to-investigate-rebuilding-in-visual-studio-when-nothing-has-changed/ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-16 16:07 ` Lars Schneider 2018-04-16 17:04 ` Ævar Arnfjörð Bjarmason @ 2018-04-16 17:43 ` Jacob Keller 2018-04-16 17:45 ` Jacob Keller 2018-04-16 17:47 ` Phillip Wood 2018-04-16 20:09 ` Stefan Haller 3 siblings, 1 reply; 29+ messages in thread From: Jacob Keller @ 2018-04-16 17:43 UTC (permalink / raw) To: Lars Schneider Cc: Linus Torvalds, Junio C Hamano, Elijah Newren, Git Mailing List, mgorny, Peter Backes, winserver.support, Theodore Ts'o On Mon, Apr 16, 2018 at 9:07 AM, Lars Schneider <larsxschneider@gmail.com> wrote: > >> On 16 Apr 2018, at 04:03, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> >> On Sun, Apr 15, 2018 at 6:44 PM, Junio C Hamano <gitster@pobox.com> wrote: >>> >>> I think Elijah's corrected was_tracked() also does not care "has >>> this been renamed". >> >> I'm perfectly happy with the slightly smarter patches. My patch was >> really just an RFC and because I had tried it out. >> >>> One thing that makes me curious is what happens (and what we want to >>> happen) when such a "we already have the changes the side branch >>> tries to bring in" path has local (i.e. not yet in the index) >>> changes. For a dirty file that trivially merges (e.g. a path we >>> modified since our histories forked, while the other side didn't do >>> anything, has local changes in the working tree), we try hard to >>> make the merge succeed while keeping the local changes, and we >>> should be able to do the same in this case, too. >> >> I think it might be nice, but probably not really worth it. >> >> I find the "you can merge even if some files are dirty" to be really >> convenient, because I often keep stupid test patches in my tree that I >> may not even intend to commit, and I then use the same tree for >> merging. >> >> For example, I sometimes end up editing the Makefile for the release >> version early, but I won't *commit* that until I actually cut the >> release. But if I pull some branch that has also changed the Makefile, >> it's not worth any complexity to try to be nice about the dirty state. >> >> If it's a file that actually *has* been changed in the branch I'm >> merging, and I'm more than happy to just stage the patch (or throw it >> away - I think it's about 50:50 for me). >> >> So I don't think it's a big deal, and I'd rather have the merge fail >> very early with "that file has seen changes in the branch you are >> merging" than add any real complexity to the merge logic. > > I am happy to see this discussion and the patches, because long rebuilds > are a constant annoyance for us. We might have been bitten by the exact > case discussed here, but more often, we have a slightly different > situation: > > An engineer works on a task branch and runs incremental builds — all > is good. The engineer switches to another branch to review another > engineer's work. This other branch changes a low-level header file, > but no rebuild is triggered. The engineer switches back to the previous > task branch. At this point, the incremental build will rebuild > everything, as the compiler thinks that the low-level header file has > been changed (because the mtime is different). > > Of course, this problem can be solved with a separate worktree. However, > our engineers forget about that sometimes, and then, they are annoyed by > a 4h rebuild. > > Is this situation a problem for others too? > If yes, what do you think about the following approach: > > What if Git kept a LRU list that contains file path, content hash, and > mtime of any file that is removed or modified during a checkout. If a > file is checked out later with the exact same path and content hash, > then Git could set the mtime to the previous value. This way the > compiler would not think that the content has been changed since the > last rebuild. That would only work until they actuall *did* a build on the second branch, and upon changing back, how would this detect that it needs to update mtime again? I don't think this solution really works. Ultimately, the problem is that the build tool relies on the mtime to determine what to rebuild. I think this would cause worse problems because we *wouldn't* rebuild in the case. How is git supposed to know that we rebuilt when switching branches or not? Thanks, Jake > > I think that would fix the problem that our engineers run into and also > the problem that Linus experienced during the merge, wouldn't it? > > Thanks, > Lars ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-16 17:43 ` Jacob Keller @ 2018-04-16 17:45 ` Jacob Keller 2018-04-16 22:34 ` Junio C Hamano 2018-04-17 17:27 ` Lars Schneider 0 siblings, 2 replies; 29+ messages in thread From: Jacob Keller @ 2018-04-16 17:45 UTC (permalink / raw) To: Lars Schneider Cc: Linus Torvalds, Junio C Hamano, Elijah Newren, Git Mailing List, mgorny, Peter Backes, winserver.support, Theodore Ts'o On Mon, Apr 16, 2018 at 10:43 AM, Jacob Keller <jacob.keller@gmail.com> wrote: > On Mon, Apr 16, 2018 at 9:07 AM, Lars Schneider > <larsxschneider@gmail.com> wrote: >> What if Git kept a LRU list that contains file path, content hash, and >> mtime of any file that is removed or modified during a checkout. If a >> file is checked out later with the exact same path and content hash, >> then Git could set the mtime to the previous value. This way the >> compiler would not think that the content has been changed since the >> last rebuild. > > That would only work until they actuall *did* a build on the second > branch, and upon changing back, how would this detect that it needs to > update mtime again? I don't think this solution really works. > Ultimately, the problem is that the build tool relies on the mtime to > determine what to rebuild. I think this would cause worse problems > because we *wouldn't* rebuild in the case. How is git supposed to know > that we rebuilt when switching branches or not? > > Thanks, > Jake I think a better solution for your problem would be to extend the build system you're using to avoid rebuilding when the contents haven't changed since last build (possibly by using hashes?). At the very least, I would not want this to be default, as it could possibly result in *no* build when there should be one, which is far more confusing to debug. Thanks, Jake ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-16 17:45 ` Jacob Keller @ 2018-04-16 22:34 ` Junio C Hamano 2018-04-17 17:27 ` Lars Schneider 1 sibling, 0 replies; 29+ messages in thread From: Junio C Hamano @ 2018-04-16 22:34 UTC (permalink / raw) To: Jacob Keller Cc: Lars Schneider, Linus Torvalds, Elijah Newren, Git Mailing List, mgorny, Peter Backes, winserver.support, Theodore Ts'o Jacob Keller <jacob.keller@gmail.com> writes: > On Mon, Apr 16, 2018 at 10:43 AM, Jacob Keller <jacob.keller@gmail.com> wrote: > ... > I think a better solution for your problem would be to extend the > build system you're using to avoid rebuilding when the contents > haven't changed since last build (possibly by using hashes?). At the > very least, I would not want this to be default, as it could possibly > result in *no* build when there should be one, which is far more > confusing to debug. Yup. Even though I take many optional features that I do not see need to support on the core side and I do not plan to use personally, as long as the implementation is cleanly separated to make it clear that those who do not use them won't negatively affected, I would not want to have this in a system I maintain and get blamed by those who get burned by it. Something like ccache, not a version control system, is meant to serve this audience. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-16 17:45 ` Jacob Keller 2018-04-16 22:34 ` Junio C Hamano @ 2018-04-17 17:27 ` Lars Schneider 2018-04-17 17:43 ` Jacob Keller 1 sibling, 1 reply; 29+ messages in thread From: Lars Schneider @ 2018-04-17 17:27 UTC (permalink / raw) To: Jacob Keller Cc: Linus Torvalds, Junio C Hamano, Elijah Newren, Git Mailing List, mgorny, Peter Backes, winserver.support, Theodore Ts'o > On 16 Apr 2018, at 19:45, Jacob Keller <jacob.keller@gmail.com> wrote: > > On Mon, Apr 16, 2018 at 10:43 AM, Jacob Keller <jacob.keller@gmail.com> wrote: >> On Mon, Apr 16, 2018 at 9:07 AM, Lars Schneider >> <larsxschneider@gmail.com> wrote: >>> What if Git kept a LRU list that contains file path, content hash, and >>> mtime of any file that is removed or modified during a checkout. If a >>> file is checked out later with the exact same path and content hash, >>> then Git could set the mtime to the previous value. This way the >>> compiler would not think that the content has been changed since the >>> last rebuild. >> >> That would only work until they actuall *did* a build on the second >> branch, and upon changing back, how would this detect that it needs to >> update mtime again? I don't think this solution really works. >> Ultimately, the problem is that the build tool relies on the mtime to >> determine what to rebuild. I think this would cause worse problems >> because we *wouldn't* rebuild in the case. How is git supposed to know >> that we rebuilt when switching branches or not? >> >> Thanks, >> Jake > > I think a better solution for your problem would be to extend the > build system you're using to avoid rebuilding when the contents > haven't changed since last build (possibly by using hashes?). At the > very least, I would not want this to be default, as it could possibly > result in *no* build when there should be one, which is far more > confusing to debug. I am 100% with you that this is a build system issue. But changing the build system for many teams in a large organization is really hard. That's why I wondered if Git could help with a shortcut. Looks like there is no shortcut (see my other reply in this thread). Thanks Lars ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-17 17:27 ` Lars Schneider @ 2018-04-17 17:43 ` Jacob Keller 0 siblings, 0 replies; 29+ messages in thread From: Jacob Keller @ 2018-04-17 17:43 UTC (permalink / raw) To: Lars Schneider Cc: Linus Torvalds, Junio C Hamano, Elijah Newren, Git Mailing List, Michał Górny, Peter Backes, winserver.support, Theodore Ts'o On Tue, Apr 17, 2018 at 10:27 AM, Lars Schneider <larsxschneider@gmail.com> wrote: > >> On 16 Apr 2018, at 19:45, Jacob Keller <jacob.keller@gmail.com> wrote: >> >> On Mon, Apr 16, 2018 at 10:43 AM, Jacob Keller <jacob.keller@gmail.com> wrote: >>> On Mon, Apr 16, 2018 at 9:07 AM, Lars Schneider >>> <larsxschneider@gmail.com> wrote: >>>> What if Git kept a LRU list that contains file path, content hash, and >>>> mtime of any file that is removed or modified during a checkout. If a >>>> file is checked out later with the exact same path and content hash, >>>> then Git could set the mtime to the previous value. This way the >>>> compiler would not think that the content has been changed since the >>>> last rebuild. >>> >>> That would only work until they actuall *did* a build on the second >>> branch, and upon changing back, how would this detect that it needs to >>> update mtime again? I don't think this solution really works. >>> Ultimately, the problem is that the build tool relies on the mtime to >>> determine what to rebuild. I think this would cause worse problems >>> because we *wouldn't* rebuild in the case. How is git supposed to know >>> that we rebuilt when switching branches or not? >>> >>> Thanks, >>> Jake >> >> I think a better solution for your problem would be to extend the >> build system you're using to avoid rebuilding when the contents >> haven't changed since last build (possibly by using hashes?). At the >> very least, I would not want this to be default, as it could possibly >> result in *no* build when there should be one, which is far more >> confusing to debug. > > I am 100% with you that this is a build system issue. But changing > the build system for many teams in a large organization is really > hard. That's why I wondered if Git could help with a shortcut. > Looks like there is no shortcut (see my other reply in this thread). > > Thanks > Lars Right. I think that solutions involving hooks or scripts which "fix" the mtimes are the best bet for this problem then, given that building it into git would cause problems for other users. (And personally I would always ere on the side of causing rebuilds unless we're 100% sure) Thanks, Jake ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-16 16:07 ` Lars Schneider 2018-04-16 17:04 ` Ævar Arnfjörð Bjarmason 2018-04-16 17:43 ` Jacob Keller @ 2018-04-16 17:47 ` Phillip Wood 2018-04-16 20:09 ` Stefan Haller 3 siblings, 0 replies; 29+ messages in thread From: Phillip Wood @ 2018-04-16 17:47 UTC (permalink / raw) To: Lars Schneider, Linus Torvalds Cc: Junio C Hamano, Elijah Newren, Git Mailing List, mgorny, rtc, winserver.support, tytso On 16/04/18 17:07, Lars Schneider wrote: > > I am happy to see this discussion and the patches, because long rebuilds > are a constant annoyance for us. We might have been bitten by the exact > case discussed here, but more often, we have a slightly different > situation: > > An engineer works on a task branch and runs incremental builds — all > is good. The engineer switches to another branch to review another > engineer's work. This other branch changes a low-level header file, > but no rebuild is triggered. The engineer switches back to the previous > task branch. At this point, the incremental build will rebuild > everything, as the compiler thinks that the low-level header file has > been changed (because the mtime is different). > > Of course, this problem can be solved with a separate worktree. However, > our engineers forget about that sometimes, and then, they are annoyed by > a 4h rebuild. > > Is this situation a problem for others too? > If yes, what do you think about the following approach: > > What if Git kept a LRU list that contains file path, content hash, and > mtime of any file that is removed or modified during a checkout. If a > file is checked out later with the exact same path and content hash, > then Git could set the mtime to the previous value. This way the > compiler would not think that the content has been changed since the > last rebuild. Hi Lars But if there has been rebuild between the checkouts then you want the compiler to rebuild. I've been using the script below recently to save and restore mtimes around running rebase to squash fixup commits. To avoid restoring the mtimes if there has been a rebuild since they were stored it takes a list of build sentinels and stores their mtimes too - if any of those change then it will refuse to restore the original mtimes of the tracked files (if you give a path that does exist when the mtimes are stored then it will refuse to restore the mtimes if that path exists when you run 'git mtimes restore'). The sentinels can be specified on the commandline when running 'git mtimes save' or stored in multiple mtimes.sentinal config keys. Best Wishes Phillip --->8--- #!/usr/bin/perl # Copyright (C) 2018 Phillip Wood <phillip.wood@dunelm.org.uk> # # git-mtimes.perl # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, see <http://www.gnu.org/licenses/>. use 5.008; use strict; use warnings; use File::Copy (qw(copy)); use File::Spec::Functions (qw(abs2rel catfile file_name_is_absolute rel2abs)); use Storable (); sub git; my $GIT_DIR = git(qw(rev-parse --git-dir)) or exit 1; $GIT_DIR = rel2abs($GIT_DIR); my $mtimes_path = "$GIT_DIR/mtimes"; sub git { my @lines; # in a scalar context slurp removing any trailing $/ # in an array context return a list of lines { local $/ = wantarray ? $/ : undef; local $,=' '; open my $fh, '-|', 'git', @_ or die "git @_ failed $!"; @lines = <$fh>; chomp @lines; unless (close $fh) { $? == -1 and die "git @_ not found"; my $code = $? >> 8; $_[0] eq 'config' and $code == 1 or die "git @_ failed with exit code $code" } } wantarray and return @lines; @lines and chomp @lines; return $lines[0]; } sub ls_files { # mode, uid, gid, mtime and maybe atime my @stat_indices = $_[0] ? (2, 4, 5, 9, 8) : (2, 4, 5, 9); local $_; local $/ = "\0"; my @files; for (git(qw(ls-files --stage -z))) { if (/^[^ ]+ ([^\t]+) 0\t(.*)/) { my @stat = stat($2); # store name, hash, mode, uid, gid, mtime and maybe atime push @files, [ $2, $1, @stat[@stat_indices] ]; } } return @files; } sub get_config { local $/ = "\0"; my $get = wantarray ? '--get-all' : '--get'; git(qw(config -z), $get, @_); } sub save { local $_; my @sentinels = get_config('mtimes.sentinel'); push @sentinels, @ARGV; @sentinels or die "No sentinels given"; @sentinels = map { [ $_, [ stat $_ ] ] } @sentinels; my @files = ls_files(); Storable::nstore( [ [ @sentinels ] , [ @files ] ], $mtimes_path) or die "unable to store mtimes $!"; } sub match_sentinel_data { local $_; my ($old, $new, $trustctime) = @_; if (!@$old) { return (@$new) ? undef : 1; } else { @$new or return undef; } # Skip hardlink count, atime, blksize for (0..2,4..7,9,10,12) { next if ($_ == 10 and ! $trustctime); $old->[$_] == $new->[$_] or return undef; } return 1; } sub needs_update { local $_; my ($old, $new) = @_; for (0..1) { $old->[$_] eq $new->[$_] or return undef; } for (2..4) { $old->[$_] == $new->[$_] or return undef; } $old->[5] != $new->[5]; } sub restore { local $_; my $stored = Storable::retrieve($mtimes_path) or die "unable to load stored data"; my $trustctime = get_config('--bool', 'core.trustctime'); $trustctime = defined($trustctime) ? $trustctime eq 'true' : 1; my ($sentinels, $oldfiles) = @$stored; for (@$sentinels) { match_sentinel_data( [ stat($_->[0]) ], $_->[1], $trustctime) or die "Unable to restore mtimes, stat data for sentinel '$_->[0]' does not match"; } my @newfiles = ls_files(1); my ($i, $restored) = (0, 0); for (@$oldfiles) { while ($newfiles[$i]->[0] lt $_->[0] and $i < @newfiles) { $i++; } if (needs_update($_, $newfiles[$i])) { utime($newfiles[$i]->[6], $_->[5], $_->[0]); $restored = 1; } } if ($restored) { print "restored mtimes\n"; } } my $cmd = shift; # Keep relative paths relative in case repository directory is renamed # between saving and restoring mtimes. if ($ENV{GIT_PREFIX}) { @ARGV = map { file_name_is_absolute($_) ? $_ : catfile($ENV{GIT_PREFIX}, $_); } @ARGV; } my $up = git(qw(rev-parse --show-cdup)); if ($up) { @ARGV = map { file_name_is_absolute($_) ? $_ : abs2rel(rel2abs($_), $up); } @ARGV; chdir $up; } my $tmp_index = catfile($GIT_DIR, "mtimes-index"); my $src_index = $ENV{GIT_INDEX_FILE} ? $ENV{GIT_INDEX_FILE} : catfile($GIT_DIR, "index"); copy($src_index, $tmp_index) or die "cannot create temporary index '$tmp_index'\n"; $ENV{GIT_INDEX_FILE} = $tmp_index; git(qw(add -u)); if ($cmd eq 'save') { save(); } elsif ($cmd eq 'restore' and ! @ARGV) { restore(); } else { print STDERR "usage: git mtimes <save [sentinels ...] | restore>\n"; } END { unlink $tmp_index; } > I think that would fix the problem that our engineers run into and also > the problem that Linus experienced during the merge, wouldn't it? > > Thanks, > Lars > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-16 16:07 ` Lars Schneider ` (2 preceding siblings ...) 2018-04-16 17:47 ` Phillip Wood @ 2018-04-16 20:09 ` Stefan Haller 3 siblings, 0 replies; 29+ messages in thread From: Stefan Haller @ 2018-04-16 20:09 UTC (permalink / raw) To: Lars Schneider; +Cc: Git Mailing List Lars Schneider <larsxschneider@gmail.com> wrote: > An engineer works on a task branch and runs incremental builds — all > is good. The engineer switches to another branch to review another > engineer's work. This other branch changes a low-level header file, > but no rebuild is triggered. The engineer switches back to the previous > task branch. At this point, the incremental build will rebuild > everything, as the compiler thinks that the low-level header file has > been changed (because the mtime is different). If you are using C or C++, look into ccache. We're using it to work around this problem, and it's a near-perfect solution, I'd say. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-16 2:03 ` Linus Torvalds 2018-04-16 16:07 ` Lars Schneider @ 2018-04-16 22:55 ` Elijah Newren 1 sibling, 0 replies; 29+ messages in thread From: Elijah Newren @ 2018-04-16 22:55 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List On Sun, Apr 15, 2018 at 7:03 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, Apr 15, 2018 at 6:44 PM, Junio C Hamano <gitster@pobox.com> wrote: >> One thing that makes me curious is what happens (and what we want to >> happen) when such a "we already have the changes the side branch >> tries to bring in" path has local (i.e. not yet in the index) >> changes. For a dirty file that trivially merges (e.g. a path we >> modified since our histories forked, while the other side didn't do >> anything, has local changes in the working tree), we try hard to >> make the merge succeed while keeping the local changes, and we >> should be able to do the same in this case, too. > > I think it might be nice, but probably not really worth it. <snip> > So I don't think it's a big deal, and I'd rather have the merge fail > very early with "that file has seen changes in the branch you are > merging" than add any real complexity to the merge logic. That's actually problematic, and not yet possible with the current merge-recursive code. The fail-very-early code is in unpack_trees(), and it doesn't understand renames. By the time we get to the rest of the logic of merge-recursive, unpack_trees() has already written updates to lots of files throughout the tree, so bailing and leaving the tree in a half-merged state is no longer an option. (The logic in merge-recursive.c is excessively complex in part due to this issue, making it implement what amounts to a four-way merge instead of just a three-way merge. It's gross.) So, if we were to use the brute-force scheme here, when renames are involved, we'd need to have some special logic for handling dirty files. That logic would probably include checking the original index for both modification times (to determine if the file is dirty), and for comparison of contents. In short, we'd still need all the logic that this scheme was trying to get rid of in the first place. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-16 1:44 ` Junio C Hamano 2018-04-16 2:03 ` Linus Torvalds @ 2018-04-16 23:03 ` Elijah Newren 1 sibling, 0 replies; 29+ messages in thread From: Elijah Newren @ 2018-04-16 23:03 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, Git Mailing List On Sun, Apr 15, 2018 at 6:44 PM, Junio C Hamano <gitster@pobox.com> wrote: > One thing that makes me curious is what happens (and what we want to > happen) when such a "we already have the changes the side branch > tries to bring in" path has local (i.e. not yet in the index) > changes. For a dirty file that trivially merges (e.g. a path we > modified since our histories forked, while the other side didn't do > anything, has local changes in the working tree), we try hard to > make the merge succeed while keeping the local changes, and we > should be able to do the same in this case, too. > > Your illustration patch that reads from the working tree to compare > with the contents we are about to write out of course won't cope > with this case ;-). If we do things in the index like the approach > to fix was_tracked(), I suspect that we would just say "as long as > the index and the HEAD matches, i.e. we are keeping the path as it is > found in HEAD as the merge result, we do not touch the working tree" > and leave such a local modification alone. I could see it going either way (fail early, or succeed because we don't need to overwrite anything), but my suspicion is also towards letting the merge succeed. It looks like my patches need a little more fixing up (the was_dirty() calls would be mixing both the old and the new index -- it used the original index via was_tracked(), but used the current index via cache_file_exists(); it should be more consistent). I'm fixing that up, will add this as another testcase to t6046, and will submit a newer RFC in a day or so. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-12 21:46 ` Junio C Hamano 2018-04-12 23:17 ` Junio C Hamano @ 2018-04-12 23:18 ` Linus Torvalds 1 sibling, 0 replies; 29+ messages in thread From: Linus Torvalds @ 2018-04-12 23:18 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List On Thu, Apr 12, 2018 at 2:46 PM, Junio C Hamano <gitster@pobox.com> wrote: > > Thanks for a clear description of the issue. It does sound > interesting. I decided to show it with a simpler case that could be scripted and doesn't need the kernel. NOTE! This obviously doesn't happen for files with the trivial merge cases (ie the ones that don't require any real merging at all). There *is* a three-way merge going on, it's just that the end result is the same as the original file. Example script just appended here at the end. NOTE: The script uses "ls -l --full-time", which afaik is a GNU ls extension. So the script below is not portable. Linus --- # Create throw-away test repository mkdir merge-test cd merge-test git init # Create silly baseline file with 10 lines of numbers in it for i in $(seq 1 10); do echo $i; done > a git add a git commit -m"Original" # Make a branch that changes '9' to 'nine' git checkout -b branch sed -i 's/9/nine/' a git commit -m "Nine" a # On the master, change '2' to 'two' _and_ '9' to 'nine' git checkout master sed -i 's/9/nine/' a sed -i 's/2/two/' a git commit -m "Two and nine" a # sleep to show the time difference sleep 1 # show the date on 'a' and do the merge ls -l --full-time a git merge -m "Merge contents" branch # The merge didn't change the contents, but did rewrite the file ls -l --full-time a ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Optimizing writes to unchanged files during merges? 2018-04-12 21:14 Optimizing writes to unchanged files during merges? Linus Torvalds 2018-04-12 21:46 ` Junio C Hamano @ 2018-04-13 0:01 ` Elijah Newren 1 sibling, 0 replies; 29+ messages in thread From: Elijah Newren @ 2018-04-13 0:01 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List On Thu, Apr 12, 2018 at 2:14 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > So I just had an interesting experience that has happened before too, > but this time I decided to try to figure out *why* it happened. > <snip> > include/linux/mm.h > > and yeah, that rather core header file causes pretty much everything > to be re-built. > > Now, the reason it was marked as changed is that the xfs branch _had_ > in fact changed it, but the changes were already upstream and got > merged away. But the file still got written out (with the same > contents it had before the merge), and 'make' obviously only looks at > modification time, so make rebuilt everything. <snip> Wow, timing. I've been looking at this independently -- it turns out to be in the exact same code area involved in the breakage my recent series caused. The bug you are observing here will happen with current git (going back about seven years, at which time it had different bugs) whenever no rename is involved and the modifications on the other branch are a subset of the changes made on HEAD's side. My series (reverted yesterday morning) made this particular code break in a new, totally different way. This is a code path that would be trivial to get right with Junio's suggested re-design of merge-recursive[1], but the current design just makes this a bit of a mess, thus resulting in various code changes over the years with different breakages, each with its own curious flavor of incorrectness. Anyway, I'm writing a bunch of test cases, and will try to get a patch (or patches) out soon. Pretty busy this week, but I should definitely have something out next week. Elijah [1] https://public-inbox.org/git/xmqqd147kpdm.fsf@gitster.mtv.corp.google.com/ ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2018-04-17 17:43 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-04-12 21:14 Optimizing writes to unchanged files during merges? Linus Torvalds 2018-04-12 21:46 ` Junio C Hamano 2018-04-12 23:17 ` Junio C Hamano 2018-04-12 23:35 ` Linus Torvalds 2018-04-12 23:41 ` Linus Torvalds 2018-04-12 23:55 ` Linus Torvalds 2018-04-13 0:01 ` Linus Torvalds 2018-04-13 7:02 ` Elijah Newren 2018-04-13 17:14 ` Linus Torvalds 2018-04-13 17:39 ` Stefan Beller 2018-04-13 17:53 ` Linus Torvalds 2018-04-13 20:04 ` Elijah Newren 2018-04-13 22:27 ` Junio C Hamano 2018-04-16 1:44 ` Junio C Hamano 2018-04-16 2:03 ` Linus Torvalds 2018-04-16 16:07 ` Lars Schneider 2018-04-16 17:04 ` Ævar Arnfjörð Bjarmason 2018-04-17 17:23 ` Lars Schneider 2018-04-16 17:43 ` Jacob Keller 2018-04-16 17:45 ` Jacob Keller 2018-04-16 22:34 ` Junio C Hamano 2018-04-17 17:27 ` Lars Schneider 2018-04-17 17:43 ` Jacob Keller 2018-04-16 17:47 ` Phillip Wood 2018-04-16 20:09 ` Stefan Haller 2018-04-16 22:55 ` Elijah Newren 2018-04-16 23:03 ` Elijah Newren 2018-04-12 23:18 ` Linus Torvalds 2018-04-13 0:01 ` Elijah Newren
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).