* [PATCH 0/2] line-log: avoid unnecessary full tree diffs @ 2019-08-21 11:04 SZEDER Gábor 2019-08-21 11:04 ` [PATCH 1/2] line-log: extract pathspec parsing from line ranges into a helper function SZEDER Gábor 2019-08-21 11:04 ` [PATCH 2/2] line-log: avoid unnecessary full tree diffs SZEDER Gábor 0 siblings, 2 replies; 13+ messages in thread From: SZEDER Gábor @ 2019-08-21 11:04 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Thomas Rast, Derrick Stolee, SZEDER Gábor Line-level log can be very slow with rename detection enabled (and its enabled by default), which is caused by it using the diff machinery very inefficiently. The second patch fixes the issue and make it much faster, especially in large repositories. This patch series is independent from my other patch series making line-level log incremental [1]. The two can be merged easily, as the changes to 'line-log.c' don't overlap, and the conflict in 't4211-line-log.sh' is trivial (both series add new tests at the end of that test script). [1] https://public-inbox.org/git/6a576e13-79e6-43be-c4a8-065e7a8310ea@gmail.com/T/ SZEDER Gábor (2): line-log: extract pathspec parsing from line ranges into a helper function line-log: avoid unnecessary full tree diffs line-log.c | 71 ++++++++++++++++++++++++++++----------- t/t4211-line-log.sh | 82 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 134 insertions(+), 19 deletions(-) -- 2.23.0.352.gebb2b55eae ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/2] line-log: extract pathspec parsing from line ranges into a helper function 2019-08-21 11:04 [PATCH 0/2] line-log: avoid unnecessary full tree diffs SZEDER Gábor @ 2019-08-21 11:04 ` SZEDER Gábor 2019-08-21 11:04 ` [PATCH 2/2] line-log: avoid unnecessary full tree diffs SZEDER Gábor 1 sibling, 0 replies; 13+ messages in thread From: SZEDER Gábor @ 2019-08-21 11:04 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Thomas Rast, Derrick Stolee, SZEDER Gábor A helper function to parse the paths involved in the line ranges and to turn them into a pathspec will be useful in the next patch. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- line-log.c | 32 ++++++++++++++++++-------------- 1 file changed, 18 insertions(+), 14 deletions(-) diff --git a/line-log.c b/line-log.c index 3aff1849e7..fddd91f060 100644 --- a/line-log.c +++ b/line-log.c @@ -737,6 +737,22 @@ static struct line_log_data *lookup_line_range(struct rev_info *revs, return ret; } +static void parse_pathspec_from_ranges(struct pathspec *pathspec, + struct line_log_data *range) +{ + struct line_log_data *r; + struct argv_array array = ARGV_ARRAY_INIT; + const char **paths; + + for (r = range; r; r = r->next) + argv_array_push(&array, r->path); + paths = argv_array_detach(&array); + + parse_pathspec(pathspec, 0, PATHSPEC_PREFER_FULL, "", paths); + /* strings are now owned by pathspec */ + free(paths); +} + void line_log_init(struct rev_info *rev, const char *prefix, struct string_list *args) { struct commit *commit = NULL; @@ -746,20 +762,8 @@ void line_log_init(struct rev_info *rev, const char *prefix, struct string_list range = parse_lines(rev->diffopt.repo, commit, prefix, args); add_line_range(rev, commit, range); - if (!rev->diffopt.detect_rename) { - struct line_log_data *r; - struct argv_array array = ARGV_ARRAY_INIT; - const char **paths; - - for (r = range; r; r = r->next) - argv_array_push(&array, r->path); - paths = argv_array_detach(&array); - - parse_pathspec(&rev->diffopt.pathspec, 0, - PATHSPEC_PREFER_FULL, "", paths); - /* strings are now owned by pathspec */ - free(paths); - } + if (!rev->diffopt.detect_rename) + parse_pathspec_from_ranges(&rev->diffopt.pathspec, range); } static void move_diff_queue(struct diff_queue_struct *dst, -- 2.23.0.352.gebb2b55eae ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/2] line-log: avoid unnecessary full tree diffs 2019-08-21 11:04 [PATCH 0/2] line-log: avoid unnecessary full tree diffs SZEDER Gábor 2019-08-21 11:04 ` [PATCH 1/2] line-log: extract pathspec parsing from line ranges into a helper function SZEDER Gábor @ 2019-08-21 11:04 ` SZEDER Gábor 2019-08-21 15:53 ` Derrick Stolee 2019-08-21 17:29 ` Junio C Hamano 1 sibling, 2 replies; 13+ messages in thread From: SZEDER Gábor @ 2019-08-21 11:04 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Thomas Rast, Derrick Stolee, SZEDER Gábor With rename detection enabled the line-level log is able to trace the evolution of line ranges across whole-file renames [1]. Alas, to achieve that it uses the diff machinery very inefficiently, making the operation very slow [2]. And since rename detection is enabled by default, the line-level log is very slow by default. When the line-level log processes a commit with rename detection enabled, it currently does the following (see queue_diffs()): 1. Computes a full tree diff between the commit and (one of) its parent(s), i.e. invokes diff_tree_oid() with an empty 'diffopt->pathspec'. 2. Checks whether any paths in the line ranges were modified. 3. Checks whether any modified paths in the line ranges are missing in the parent commit's tree. 4. If there is such a missing path, then calls diffcore_std() to figure out whether the path was indeed renamed based on the previously computed full tree diff. 5. Continues doing stuff that are unrelated to the slowness. So basically the line-level log computes a full tree diff for each commit-parent pair in step (1) to be used for rename detection in step (4) in the off chance that an interesting path is missing from the parent. Avoid these expensive and mostly unnecessary full tree diffs by limiting the diffs to paths in the line ranges. This is much cheaper, and makes step (2) unnecessary. If it turns out that an interesting path is missing from the parent, then fall back and compute a full tree diff, so the rename detection will still work. Care must be taken when to update the pathspec used to limit the diff in case of renames. A path might be renamed on one branch and modified on several parallel running branches, and while processing commits on these branches the line-level log might have to alternate between looking at a path's new and old name. However, at any one time there is only a single 'diffopt->pathspec'. So add a step (0) to the above to ensure that the paths in the pathspec match the paths in the line ranges associated with the currently processed commit, and re-parse the pathspec from the paths in the line ranges if they differ. The new test cases include a specially crafted piece of history with two merged branches and two files, where each branch modifies both files, renames on of them, and then modifies both again. Then two separate 'git log -L' invocations check the line-level log of each of those two files, which ensures that at least one of those invocations have to do that back-and-forth between the file's old and new name (no matter which branch is traversed first). 't/t4211-line-log.sh' already contains two tests involving renames, they don't don't trigger this back-and-forth. Avoiding these unnecessary full tree diffs can have huge impact on performance, especially in big repositories with big trees and mergy history. Tracing the evolution of a function through the whole history: # git.git $ time git --no-pager log -L:read_alternate_refs:sha1-file.c v2.23.0 Before: real 0m8.874s user 0m8.816s sys 0m0.057s After: real 0m2.516s user 0m2.456s sys 0m0.060s # linux.git $ time ~/src/git/git --no-pager log \ -L:build_restore_work_registers:arch/mips/mm/tlbex.c v5.2 Before: real 3m50.033s user 3m48.041s sys 0m0.300s After: real 0m2.599s user 0m2.466s sys 0m0.157s That's just over 88x speedup. [1] Line-level log's rename following is quite similar to 'git log --follow path', with the notable differences that it does handle multiple paths at once as well, and that it doesn't show the commit performing the rename if it's an exact rename. [2] This slowness might not have been apparent initially, because back when the line-level log feature was introduced rename detection was not yet enabled by default; 12da1d1f6f (Implement line-history search (git log -L), 2013-03-28) and 5404c116aa (diff: activate diff.renames by default, 2016-02-25). Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> --- line-log.c | 43 ++++++++++++++++++++---- t/t4211-line-log.sh | 82 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 118 insertions(+), 7 deletions(-) diff --git a/line-log.c b/line-log.c index fddd91f060..9010e00950 100644 --- a/line-log.c +++ b/line-log.c @@ -737,6 +737,22 @@ static struct line_log_data *lookup_line_range(struct rev_info *revs, return ret; } +static int same_paths_in_pathspec_and_range(struct pathspec *pathspec, + struct line_log_data *range) +{ + int i; + struct line_log_data *r; + + for (i = 0, r = range; i < pathspec->nr && r; i++, r = r->next) + if (strcmp(pathspec->items[i].match, r->path)) + return 0; + if (i < pathspec->nr || r) + /* different number of pathspec items and ranges */ + return 0; + + return 1; +} + static void parse_pathspec_from_ranges(struct pathspec *pathspec, struct line_log_data *range) { @@ -762,8 +778,7 @@ void line_log_init(struct rev_info *rev, const char *prefix, struct string_list range = parse_lines(rev->diffopt.repo, commit, prefix, args); add_line_range(rev, commit, range); - if (!rev->diffopt.detect_rename) - parse_pathspec_from_ranges(&rev->diffopt.pathspec, range); + parse_pathspec_from_ranges(&rev->diffopt.pathspec, range); } static void move_diff_queue(struct diff_queue_struct *dst, @@ -821,15 +836,29 @@ static void queue_diffs(struct line_log_data *range, struct diff_queue_struct *queue, struct commit *commit, struct commit *parent) { + struct object_id *tree_oid, *parent_tree_oid; + assert(commit); + tree_oid = get_commit_tree_oid(commit); + parent_tree_oid = parent ? get_commit_tree_oid(parent) : NULL; + + if (opt->detect_rename && + !same_paths_in_pathspec_and_range(&opt->pathspec, range)) { + clear_pathspec(&opt->pathspec); + parse_pathspec_from_ranges(&opt->pathspec, range); + } DIFF_QUEUE_CLEAR(&diff_queued_diff); - diff_tree_oid(parent ? get_commit_tree_oid(parent) : NULL, - get_commit_tree_oid(commit), "", opt); - if (opt->detect_rename) { + diff_tree_oid(parent_tree_oid, tree_oid, "", opt); + if (opt->detect_rename && diff_might_be_rename()) { + /* must look at the full tree diff to detect renames */ + clear_pathspec(&opt->pathspec); + DIFF_QUEUE_CLEAR(&diff_queued_diff); + + diff_tree_oid(parent_tree_oid, tree_oid, "", opt); + filter_diffs_for_paths(range, 1); - if (diff_might_be_rename()) - diffcore_std(opt); + diffcore_std(opt); filter_diffs_for_paths(range, 0); } move_diff_queue(queue, &diff_queued_diff); diff --git a/t/t4211-line-log.sh b/t/t4211-line-log.sh index 1db7bd0f59..8319163744 100755 --- a/t/t4211-line-log.sh +++ b/t/t4211-line-log.sh @@ -132,4 +132,86 @@ test_expect_success '--raw is forbidden' ' test_must_fail git log -L1,24:b.c --raw ' +test_expect_success 'setup for checking fancy rename following' ' + git checkout --orphan moves-start && + git reset --hard && + + printf "%s\n" 12 13 14 15 b c d e >file-1 && + printf "%s\n" 22 23 24 25 B C D E >file-2 && + git add file-1 file-2 && + test_tick && + git commit -m "Add file-1 and file-2" && + oid_add_f1_f2=$(git rev-parse --short HEAD) && + + git checkout -b moves-main && + printf "%s\n" 11 12 13 14 15 b c d e >file-1 && + git commit -a -m "Modify file-1 on main" && + oid_mod_f1_main=$(git rev-parse --short HEAD) && + + printf "%s\n" 21 22 23 24 25 B C D E >file-2 && + git commit -a -m "Modify file-2 on main #1" && + oid_mod_f2_main_1=$(git rev-parse --short HEAD) && + + git mv file-1 renamed-1 && + git commit -m "Rename file-1 to renamed-1 on main" && + + printf "%s\n" 11 12 13 14 15 b c d e f >renamed-1 && + git commit -a -m "Modify renamed-1 on main" && + oid_mod_r1_main=$(git rev-parse --short HEAD) && + + printf "%s\n" 21 22 23 24 25 B C D E F >file-2 && + git commit -a -m "Modify file-2 on main #2" && + oid_mod_f2_main_2=$(git rev-parse --short HEAD) && + + git checkout -b moves-side moves-start && + printf "%s\n" 12 13 14 15 16 b c d e >file-1 && + git commit -a -m "Modify file-1 on side #1" && + oid_mod_f1_side_1=$(git rev-parse --short HEAD) && + + printf "%s\n" 22 23 24 25 26 B C D E >file-2 && + git commit -a -m "Modify file-2 on side" && + oid_mod_f2_side=$(git rev-parse --short HEAD) && + + git mv file-2 renamed-2 && + git commit -m "Rename file-2 to renamed-2 on side" && + + printf "%s\n" 12 13 14 15 16 a b c d e >file-1 && + git commit -a -m "Modify file-1 on side #2" && + oid_mod_f1_side_2=$(git rev-parse --short HEAD) && + + printf "%s\n" 22 23 24 25 26 A B C D E >renamed-2 && + git commit -a -m "Modify renamed-2 on side" && + oid_mod_r2_side=$(git rev-parse --short HEAD) && + + git checkout moves-main && + git merge moves-side && + oid_merge=$(git rev-parse --short HEAD) +' + +test_expect_success 'fancy rename following #1' ' + cat >expect <<-EOF && + $oid_merge Merge branch '\''moves-side'\'' into moves-main + $oid_mod_f1_side_2 Modify file-1 on side #2 + $oid_mod_f1_side_1 Modify file-1 on side #1 + $oid_mod_r1_main Modify renamed-1 on main + $oid_mod_f1_main Modify file-1 on main + $oid_add_f1_f2 Add file-1 and file-2 + EOF + git log -L1:renamed-1 --oneline --no-patch >actual && + test_cmp expect actual +' + +test_expect_success 'fancy rename following #2' ' + cat >expect <<-EOF && + $oid_merge Merge branch '\''moves-side'\'' into moves-main + $oid_mod_r2_side Modify renamed-2 on side + $oid_mod_f2_side Modify file-2 on side + $oid_mod_f2_main_2 Modify file-2 on main #2 + $oid_mod_f2_main_1 Modify file-2 on main #1 + $oid_add_f1_f2 Add file-1 and file-2 + EOF + git log -L1:renamed-2 --oneline --no-patch >actual && + test_cmp expect actual +' + test_done -- 2.23.0.352.gebb2b55eae ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] line-log: avoid unnecessary full tree diffs 2019-08-21 11:04 ` [PATCH 2/2] line-log: avoid unnecessary full tree diffs SZEDER Gábor @ 2019-08-21 15:53 ` Derrick Stolee 2019-08-21 17:35 ` SZEDER Gábor 2019-08-21 17:29 ` Junio C Hamano 1 sibling, 1 reply; 13+ messages in thread From: Derrick Stolee @ 2019-08-21 15:53 UTC (permalink / raw) To: SZEDER Gábor, git; +Cc: Junio C Hamano, Thomas Rast On 8/21/2019 7:04 AM, SZEDER Gábor wrote: > With rename detection enabled the line-level log is able to trace the > evolution of line ranges across whole-file renames [1]. Alas, to > achieve that it uses the diff machinery very inefficiently, making the > operation very slow [2]. And since rename detection is enabled by > default, the line-level log is very slow by default. > > When the line-level log processes a commit with rename detection > enabled, it currently does the following (see queue_diffs()): > > 1. Computes a full tree diff between the commit and (one of) its > parent(s), i.e. invokes diff_tree_oid() with an empty > 'diffopt->pathspec'. > 2. Checks whether any paths in the line ranges were modified. > 3. Checks whether any modified paths in the line ranges are missing > in the parent commit's tree. > 4. If there is such a missing path, then calls diffcore_std() to > figure out whether the path was indeed renamed based on the > previously computed full tree diff. > 5. Continues doing stuff that are unrelated to the slowness. > > So basically the line-level log computes a full tree diff for each > commit-parent pair in step (1) to be used for rename detection in step > (4) in the off chance that an interesting path is missing from the > parent. > > Avoid these expensive and mostly unnecessary full tree diffs by > limiting the diffs to paths in the line ranges. This is much cheaper, > and makes step (2) unnecessary. If it turns out that an interesting > path is missing from the parent, then fall back and compute a full > tree diff, so the rename detection will still work. I applied your patches and tried them on our VFS-enabled version of Git (see [1]). Unfortunately, the new logic is still triggering rename detection, as measured by the number of objects being downloaded. [1] https://github.com/microsoft/git/pull/182 My *guess* is that the repo has a lot of merge commits, and for many of those, the file does not exist in the first parent. Since we are essentially doing a --full-history, this means that edge tries a rename detection. If we used the file-history simplification route of traveling along a treesame edge instead of caring about both parents, then maybe this would be avoided. I could also be completely wrong about how this line-log code works with regards to --full-history. > Care must be taken when to update the pathspec used to limit the diff > in case of renames. A path might be renamed on one branch and > modified on several parallel running branches, and while processing > commits on these branches the line-level log might have to alternate > between looking at a path's new and old name. However, at any one > time there is only a single 'diffopt->pathspec'. > > So add a step (0) to the above to ensure that the paths in the > pathspec match the paths in the line ranges associated with the > currently processed commit, and re-parse the pathspec from the paths > in the line ranges if they differ. > > The new test cases include a specially crafted piece of history with > two merged branches and two files, where each branch modifies both > files, renames on of them, and then modifies both again. Then two > separate 'git log -L' invocations check the line-level log of each of > those two files, which ensures that at least one of those invocations > have to do that back-and-forth between the file's old and new name (no > matter which branch is traversed first). 't/t4211-line-log.sh' > already contains two tests involving renames, they don't don't trigger > this back-and-forth. > > Avoiding these unnecessary full tree diffs can have huge impact on > performance, especially in big repositories with big trees and mergy > history. Tracing the evolution of a function through the whole > history: > > # git.git > $ time git --no-pager log -L:read_alternate_refs:sha1-file.c v2.23.0 > > Before: > > real 0m8.874s > user 0m8.816s > sys 0m0.057s > > After: > > real 0m2.516s > user 0m2.456s > sys 0m0.060s > > # linux.git > $ time ~/src/git/git --no-pager log \ > -L:build_restore_work_registers:arch/mips/mm/tlbex.c v5.2 > > Before: > > real 3m50.033s > user 3m48.041s > sys 0m0.300s > > After: > > real 0m2.599s > user 0m2.466s > sys 0m0.157s > > That's just over 88x speedup. These performance numbers are great! Please don't let my complaints of "it doesn't work for my particularly bad example" be a deterrent to this change. If I figure out what is going on in my case, then I can create an update on top of your changes. > diff --git a/line-log.c b/line-log.c > index fddd91f060..9010e00950 100644 > --- a/line-log.c > +++ b/line-log.c > @@ -737,6 +737,22 @@ static struct line_log_data *lookup_line_range(struct rev_info *revs, > return ret; > } > > +static int same_paths_in_pathspec_and_range(struct pathspec *pathspec, > + struct line_log_data *range) > +{ > + int i; > + struct line_log_data *r; > + > + for (i = 0, r = range; i < pathspec->nr && r; i++, r = r->next) > + if (strcmp(pathspec->items[i].match, r->path)) > + return 0; > + if (i < pathspec->nr || r) > + /* different number of pathspec items and ranges */ > + return 0; > + > + return 1; > +} This method is easy to digest. Looks correct. > @@ -762,8 +778,7 @@ void line_log_init(struct rev_info *rev, const char *prefix, struct string_list > range = parse_lines(rev->diffopt.repo, commit, prefix, args); > add_line_range(rev, commit, range); > > - if (!rev->diffopt.detect_rename) > - parse_pathspec_from_ranges(&rev->diffopt.pathspec, range); > + parse_pathspec_from_ranges(&rev->diffopt.pathspec, range); > } So we always parse the pathspec, even if we don't do detect renames. > @@ -821,15 +836,29 @@ static void queue_diffs(struct line_log_data *range, > struct diff_queue_struct *queue, > struct commit *commit, struct commit *parent) > { > + struct object_id *tree_oid, *parent_tree_oid; > + > assert(commit); > > + tree_oid = get_commit_tree_oid(commit); > + parent_tree_oid = parent ? get_commit_tree_oid(parent) : NULL; > + > + if (opt->detect_rename && > + !same_paths_in_pathspec_and_range(&opt->pathspec, range)) { > + clear_pathspec(&opt->pathspec); > + parse_pathspec_from_ranges(&opt->pathspec, range); > + } If we are detecting renames and our pathspec is not up-to-date with the range, then clear the pathspec and reparse. Makes sense. > DIFF_QUEUE_CLEAR(&diff_queued_diff); > - diff_tree_oid(parent ? get_commit_tree_oid(parent) : NULL, > - get_commit_tree_oid(commit), "", opt); > + diff_tree_oid(parent_tree_oid, tree_oid, "", opt); (I rearranged a pair of -/+ lines in the diff to highlight this change.) Makes sense, parent_tree_oid above was set using the same conditional. > - if (opt->detect_rename) { > + if (opt->detect_rename && diff_might_be_rename()) { Here is the crux of the matter: diff_might_be_rename() can prevent the full tree diff. > + /* must look at the full tree diff to detect renames */ > + clear_pathspec(&opt->pathspec); > + DIFF_QUEUE_CLEAR(&diff_queued_diff); > + > + diff_tree_oid(parent_tree_oid, tree_oid, "", opt); > + > filter_diffs_for_paths(range, 1); > - if (diff_might_be_rename()) > - diffcore_std(opt); > + diffcore_std(opt); > filter_diffs_for_paths(range, 0); > } So before, diff_might_be_rename() already prevented diffcore_std(), but now it also prevents clearing the pathspec. diff_might_be_rename() has a simple implementation: static inline int diff_might_be_rename(void) { int i; for (i = 0; i < diff_queued_diff.nr; i++) if (!DIFF_FILE_VALID(diff_queued_diff.queue[i]->one)) { /* fprintf(stderr, "diff_might_be_rename found creation of: %s\n", */ /* diff_queued_diff.queue[i]->two->path); */ return 1; } return 0; } So yes, it is triggered by any path appearing in the child but not a parent. > move_diff_queue(queue, &diff_queued_diff); > diff --git a/t/t4211-line-log.sh b/t/t4211-line-log.sh > index 1db7bd0f59..8319163744 100755 > --- a/t/t4211-line-log.sh > +++ b/t/t4211-line-log.sh > @@ -132,4 +132,86 @@ test_expect_success '--raw is forbidden' ' > test_must_fail git log -L1,24:b.c --raw > ' > > +test_expect_success 'setup for checking fancy rename following' ' > + git checkout --orphan moves-start && > + git reset --hard && > + > + printf "%s\n" 12 13 14 15 b c d e >file-1 && > + printf "%s\n" 22 23 24 25 B C D E >file-2 && > + git add file-1 file-2 && > + test_tick && > + git commit -m "Add file-1 and file-2" && > + oid_add_f1_f2=$(git rev-parse --short HEAD) && > + > + git checkout -b moves-main && > + printf "%s\n" 11 12 13 14 15 b c d e >file-1 && > + git commit -a -m "Modify file-1 on main" && > + oid_mod_f1_main=$(git rev-parse --short HEAD) && > + > + printf "%s\n" 21 22 23 24 25 B C D E >file-2 && > + git commit -a -m "Modify file-2 on main #1" && > + oid_mod_f2_main_1=$(git rev-parse --short HEAD) && > + > + git mv file-1 renamed-1 && > + git commit -m "Rename file-1 to renamed-1 on main" && > + > + printf "%s\n" 11 12 13 14 15 b c d e f >renamed-1 && > + git commit -a -m "Modify renamed-1 on main" && > + oid_mod_r1_main=$(git rev-parse --short HEAD) && > + > + printf "%s\n" 21 22 23 24 25 B C D E F >file-2 && > + git commit -a -m "Modify file-2 on main #2" && > + oid_mod_f2_main_2=$(git rev-parse --short HEAD) && > + > + git checkout -b moves-side moves-start && > + printf "%s\n" 12 13 14 15 16 b c d e >file-1 && > + git commit -a -m "Modify file-1 on side #1" && > + oid_mod_f1_side_1=$(git rev-parse --short HEAD) && > + > + printf "%s\n" 22 23 24 25 26 B C D E >file-2 && > + git commit -a -m "Modify file-2 on side" && > + oid_mod_f2_side=$(git rev-parse --short HEAD) && > + > + git mv file-2 renamed-2 && > + git commit -m "Rename file-2 to renamed-2 on side" && > + > + printf "%s\n" 12 13 14 15 16 a b c d e >file-1 && > + git commit -a -m "Modify file-1 on side #2" && > + oid_mod_f1_side_2=$(git rev-parse --short HEAD) && > + > + printf "%s\n" 22 23 24 25 26 A B C D E >renamed-2 && > + git commit -a -m "Modify renamed-2 on side" && > + oid_mod_r2_side=$(git rev-parse --short HEAD) && > + > + git checkout moves-main && > + git merge moves-side && > + oid_merge=$(git rev-parse --short HEAD) > +' > + > +test_expect_success 'fancy rename following #1' ' > + cat >expect <<-EOF && > + $oid_merge Merge branch '\''moves-side'\'' into moves-main > + $oid_mod_f1_side_2 Modify file-1 on side #2 > + $oid_mod_f1_side_1 Modify file-1 on side #1 > + $oid_mod_r1_main Modify renamed-1 on main > + $oid_mod_f1_main Modify file-1 on main > + $oid_add_f1_f2 Add file-1 and file-2 > + EOF > + git log -L1:renamed-1 --oneline --no-patch >actual && > + test_cmp expect actual > +' > + > +test_expect_success 'fancy rename following #2' ' > + cat >expect <<-EOF && > + $oid_merge Merge branch '\''moves-side'\'' into moves-main > + $oid_mod_r2_side Modify renamed-2 on side > + $oid_mod_f2_side Modify file-2 on side > + $oid_mod_f2_main_2 Modify file-2 on main #2 > + $oid_mod_f2_main_1 Modify file-2 on main #1 > + $oid_add_f1_f2 Add file-1 and file-2 > + EOF > + git log -L1:renamed-2 --oneline --no-patch >actual && > + test_cmp expect actual > +' These look to be suitably interesting test cases. Thanks! Looking at your patch, I can mostly follow the logic, but my unfamiliarity with the code is keeping me from being confident in full understanding. I hope someone who is familiar can chime in, because I really like the direction here. Hopefully I will have time in the next few weeks to revisit this and work to resolve my abnormal case. -Stolee ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] line-log: avoid unnecessary full tree diffs 2019-08-21 15:53 ` Derrick Stolee @ 2019-08-21 17:35 ` SZEDER Gábor 2019-08-21 18:12 ` Derrick Stolee 2019-08-22 8:41 ` SZEDER Gábor 0 siblings, 2 replies; 13+ messages in thread From: SZEDER Gábor @ 2019-08-21 17:35 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, Junio C Hamano, Thomas Rast On Wed, Aug 21, 2019 at 11:53:28AM -0400, Derrick Stolee wrote: > On 8/21/2019 7:04 AM, SZEDER Gábor wrote: > > With rename detection enabled the line-level log is able to trace the > > evolution of line ranges across whole-file renames [1]. Alas, to > > achieve that it uses the diff machinery very inefficiently, making the > > operation very slow [2]. And since rename detection is enabled by > > default, the line-level log is very slow by default. > > > > When the line-level log processes a commit with rename detection > > enabled, it currently does the following (see queue_diffs()): > > > > 1. Computes a full tree diff between the commit and (one of) its > > parent(s), i.e. invokes diff_tree_oid() with an empty > > 'diffopt->pathspec'. > > 2. Checks whether any paths in the line ranges were modified. > > 3. Checks whether any modified paths in the line ranges are missing > > in the parent commit's tree. > > 4. If there is such a missing path, then calls diffcore_std() to > > figure out whether the path was indeed renamed based on the > > previously computed full tree diff. > > 5. Continues doing stuff that are unrelated to the slowness. > > > > So basically the line-level log computes a full tree diff for each > > commit-parent pair in step (1) to be used for rename detection in step > > (4) in the off chance that an interesting path is missing from the > > parent. > > > > Avoid these expensive and mostly unnecessary full tree diffs by > > limiting the diffs to paths in the line ranges. This is much cheaper, > > and makes step (2) unnecessary. If it turns out that an interesting > > path is missing from the parent, then fall back and compute a full > > tree diff, so the rename detection will still work. > > I applied your patches and tried them on our VFS-enabled version of Git > (see [1]). Unfortunately, the new logic is still triggering rename > detection, as measured by the number of objects being downloaded. Well, the goal of this patch was to avoid full tree diffs if possible, not to avoid rename detection :) Anyway, I wonder how does 'git log -L1:your-evil-path --no-renames' fare as a baseline? > My *guess* is that the repo has a lot of merge commits, and for many > of those, the file does not exist in the first parent. Yeah, actual renames are only one of the possible causes that trigger rename detection. Surprisingly (well, to me, at least), in git.git rename detection in line-level log is most often triggered by the subtree merges of gitk and git-gui, not by actual renames. > Since we are > essentially doing a --full-history, this means that edge tries a > rename detection. If we used the file-history simplification route of > traveling along a treesame edge instead of caring about both parents, > then maybe this would be avoided. > > I could also be completely wrong about how this line-log code works > with regards to --full-history. Line-level log doesn't do '--full-history', though it does seem to compute more diffs than a simple 'git log -- path'. I applied the following diff on current master to add a bit of ad-hoc tracing to see how many and which commit-parent diffs are computed: diff --git a/line-log.c b/line-log.c index 9010e00950..64f2c4d216 100644 --- a/line-log.c +++ b/line-log.c @@ -839,6 +839,8 @@ static void queue_diffs(struct line_log_data *range, struct object_id *tree_oid, *parent_tree_oid; assert(commit); + fprintf(stderr, "%s %s\n", oid_to_hex(&commit->object.oid), + parent ? oid_to_hex(&parent->object.oid) : "-"); tree_oid = get_commit_tree_oid(commit); parent_tree_oid = parent ? get_commit_tree_oid(parent) : NULL; diff --git a/revision.c b/revision.c index 07412297f0..3f2182e32a 100644 --- a/revision.c +++ b/revision.c @@ -627,6 +627,8 @@ static int rev_compare_tree(struct rev_info *revs, struct tree *t1 = get_commit_tree(parent); struct tree *t2 = get_commit_tree(commit); + fprintf(stderr, "%s %s\n", oid_to_hex(&commit->object.oid), + parent ? oid_to_hex(&parent->object.oid) : "-"); if (!t1) return REV_TREE_NEW; if (!t2) $ ./git log v2.23.0 -- builtin/rev-list.c >/dev/null 2>P $ ./git log --full-history v2.23.0 -- builtin/rev-list.c >/dev/null 2>FH $ ./git log -L1:builtin/rev-list.c v2.23.0 >/dev/null 2>LL $ wc -l P FH LL 17230 P 70842 FH 25995 LL 114067 total So line-level log clearly computes a lot less diffs than '--full-history', though still about 50% more than a regular pathspec-limited history traversal. Looking at the commit-parent pairs in the output, it appears that the difference comes mostly from merge commits, because line-level log compares a merge commit with all of its parents. The number of processed commits is close enough, though: $ cut -d' ' -f1 P |sort -u |wc -l 17164 $ cut -d' ' -f1 LL |sort -u |wc -l 17894 It seems there is still more room for improvements by avoiding commit-non_first_parent diffs when the first parent is TREESAME, and doing so could hopefully avoid triggering rename detection in those subtree merges or in case of your evil path. > > @@ -762,8 +778,7 @@ void line_log_init(struct rev_info *rev, const char *prefix, struct string_list > > range = parse_lines(rev->diffopt.repo, commit, prefix, args); > > add_line_range(rev, commit, range); > > > > - if (!rev->diffopt.detect_rename) > > - parse_pathspec_from_ranges(&rev->diffopt.pathspec, range); > > + parse_pathspec_from_ranges(&rev->diffopt.pathspec, range); > > } > > So we always parse the pathspec, even if we don't do detect renames. Erm, no: we always parse the pathspec, even if we DO detect renames. This condition made 'git log -L... --no-renames' fast. > > @@ -821,15 +836,29 @@ static void queue_diffs(struct line_log_data *range, > > struct diff_queue_struct *queue, > > struct commit *commit, struct commit *parent) > > { > > + struct object_id *tree_oid, *parent_tree_oid; > > + > > assert(commit); > > > > + tree_oid = get_commit_tree_oid(commit); > > + parent_tree_oid = parent ? get_commit_tree_oid(parent) : NULL; > > + > > + if (opt->detect_rename && > > + !same_paths_in_pathspec_and_range(&opt->pathspec, range)) { > > + clear_pathspec(&opt->pathspec); > > + parse_pathspec_from_ranges(&opt->pathspec, range); > > + } > > If we are detecting renames and our pathspec is not up-to-date with the > range, then clear the pathspec and reparse. Makes sense. Renames (and subtree merges, etc.) are rare, so the paths rarely change, and clearing and parsing pathspecs involves 2-3 memory allocations and frees per path, so it's worth checking first. It had a slight but measurable performance impact, about 2% in the Linux repository. ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] line-log: avoid unnecessary full tree diffs 2019-08-21 17:35 ` SZEDER Gábor @ 2019-08-21 18:12 ` Derrick Stolee 2019-08-22 8:41 ` SZEDER Gábor 1 sibling, 0 replies; 13+ messages in thread From: Derrick Stolee @ 2019-08-21 18:12 UTC (permalink / raw) To: SZEDER Gábor; +Cc: git, Junio C Hamano, Thomas Rast On 8/21/2019 1:35 PM, SZEDER Gábor wrote: > On Wed, Aug 21, 2019 at 11:53:28AM -0400, Derrick Stolee wrote: >> On 8/21/2019 7:04 AM, SZEDER Gábor wrote: >>> With rename detection enabled the line-level log is able to trace the >>> evolution of line ranges across whole-file renames [1]. Alas, to >>> achieve that it uses the diff machinery very inefficiently, making the >>> operation very slow [2]. And since rename detection is enabled by >>> default, the line-level log is very slow by default. >>> >>> When the line-level log processes a commit with rename detection >>> enabled, it currently does the following (see queue_diffs()): >>> >>> 1. Computes a full tree diff between the commit and (one of) its >>> parent(s), i.e. invokes diff_tree_oid() with an empty >>> 'diffopt->pathspec'. >>> 2. Checks whether any paths in the line ranges were modified. >>> 3. Checks whether any modified paths in the line ranges are missing >>> in the parent commit's tree. >>> 4. If there is such a missing path, then calls diffcore_std() to >>> figure out whether the path was indeed renamed based on the >>> previously computed full tree diff. >>> 5. Continues doing stuff that are unrelated to the slowness. >>> >>> So basically the line-level log computes a full tree diff for each >>> commit-parent pair in step (1) to be used for rename detection in step >>> (4) in the off chance that an interesting path is missing from the >>> parent. >>> >>> Avoid these expensive and mostly unnecessary full tree diffs by >>> limiting the diffs to paths in the line ranges. This is much cheaper, >>> and makes step (2) unnecessary. If it turns out that an interesting >>> path is missing from the parent, then fall back and compute a full >>> tree diff, so the rename detection will still work. >> >> I applied your patches and tried them on our VFS-enabled version of Git >> (see [1]). Unfortunately, the new logic is still triggering rename >> detection, as measured by the number of objects being downloaded. > > Well, the goal of this patch was to avoid full tree diffs if possible, > not to avoid rename detection :) > > Anyway, I wonder how does 'git log -L1:your-evil-path --no-renames' > fare as a baseline? Yeah, adding --no-renames does really well, comparatively. Perhaps I'll just recommend to users to use that flag for now. Thanks, -Stolee ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] line-log: avoid unnecessary full tree diffs 2019-08-21 17:35 ` SZEDER Gábor 2019-08-21 18:12 ` Derrick Stolee @ 2019-08-22 8:41 ` SZEDER Gábor 2019-08-22 14:53 ` Derrick Stolee ` (2 more replies) 1 sibling, 3 replies; 13+ messages in thread From: SZEDER Gábor @ 2019-08-22 8:41 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, Junio C Hamano, Thomas Rast On Wed, Aug 21, 2019 at 07:35:15PM +0200, SZEDER Gábor wrote: > So line-level log clearly computes a lot less diffs than > '--full-history', though still about 50% more than a regular > pathspec-limited history traversal. Looking at the commit-parent > pairs in the output, it appears that the difference comes mostly from > merge commits, because line-level log compares a merge commit with all > of its parents. > It seems there is still more room for improvements by avoiding > commit-non_first_parent diffs when the first parent is TREESAME, and > doing so could hopefully avoid triggering rename detection in those > subtree merges or in case of your evil path. Well, that fruit hung much lower than I though, just look at the size of the WIP patch below. I just hope that there are no unexpected surprises, but FWIW it produces the exact same output for all files up to 't/t5515' in v2.23.0 as the previous patch. Can't wait to see how it fares with that evil Windows path :) --- >8 --- Subject: [PATCH 3/2] WIP line-log: stop diff-ing after first TREESAME merge parent # git.git, ~25% of all commits are merges $ time git --no-pager log -L:read_alternate_refs:sha1-file.c v2.23.0 Before: real 0m2.516s user 0m2.456s sys 0m0.060s After: real 0m1.132s user 0m1.096s sys 0m0.036s # linux.git, ~7% of all commits are merges $ time ~/src/git/git --no-pager log \ -L:build_restore_work_registers:arch/mips/mm/tlbex.c v5.2 Before: real 0m2.599s user 0m2.466s sys 0m0.157s After: real 0m1.976s user 0m1.856s sys 0m0.121s [TODO: get rid of unnecessary arrays, tests?, write commit message...] --- line-log.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/line-log.c b/line-log.c index 9010e00950..a4b032f83a 100644 --- a/line-log.c +++ b/line-log.c @@ -1184,13 +1184,11 @@ static int process_ranges_merge_commit(struct rev_info *rev, struct commit *comm p = commit->parents; for (i = 0; i < nparents; i++) { + int changed; parents[i] = p->item; p = p->next; queue_diffs(range, &rev->diffopt, &diffqueues[i], commit, parents[i]); - } - for (i = 0; i < nparents; i++) { - int changed; cand[i] = NULL; changed = process_all_files(&cand[i], rev, &diffqueues[i], range); if (!changed) { @@ -1203,7 +1201,7 @@ static int process_ranges_merge_commit(struct rev_info *rev, struct commit *comm commit_list_append(parents[i], &commit->parents); free(parents); free(cand); - free_diffqueues(nparents, diffqueues); + free_diffqueues(i, diffqueues); /* NEEDSWORK leaking like a sieve */ return 0; } -- 2.23.0.352.gebb2b55eae ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] line-log: avoid unnecessary full tree diffs 2019-08-22 8:41 ` SZEDER Gábor @ 2019-08-22 14:53 ` Derrick Stolee 2019-08-22 16:01 ` Junio C Hamano 2019-08-23 10:04 ` SZEDER Gábor 2 siblings, 0 replies; 13+ messages in thread From: Derrick Stolee @ 2019-08-22 14:53 UTC (permalink / raw) To: SZEDER Gábor; +Cc: git, Junio C Hamano, Thomas Rast On 8/22/2019 4:41 AM, SZEDER Gábor wrote: > On Wed, Aug 21, 2019 at 07:35:15PM +0200, SZEDER Gábor wrote: >> So line-level log clearly computes a lot less diffs than >> '--full-history', though still about 50% more than a regular >> pathspec-limited history traversal. Looking at the commit-parent >> pairs in the output, it appears that the difference comes mostly from >> merge commits, because line-level log compares a merge commit with all >> of its parents. > >> It seems there is still more room for improvements by avoiding >> commit-non_first_parent diffs when the first parent is TREESAME, and >> doing so could hopefully avoid triggering rename detection in those >> subtree merges or in case of your evil path. > > Well, that fruit hung much lower than I though, just look at the size > of the WIP patch below. I just hope that there are no unexpected > surprises, but FWIW it produces the exact same output for all files up > to 't/t5515' in v2.23.0 as the previous patch. > > Can't wait to see how it fares with that evil Windows path :) Thanks for this! With this patch, we finally have the time down to ~20s. This is a HUGE improvement, especially considering there is only one result for the particular section, so the entire history is explored in that time. > --- >8 --- > > Subject: [PATCH 3/2] WIP line-log: stop diff-ing after first TREESAME merge parent > > # git.git, ~25% of all commits are merges > $ time git --no-pager log -L:read_alternate_refs:sha1-file.c v2.23.0 > > Before: > > real 0m2.516s > user 0m2.456s > sys 0m0.060s > > After: > > real 0m1.132s > user 0m1.096s > sys 0m0.036s > > # linux.git, ~7% of all commits are merges > $ time ~/src/git/git --no-pager log \ > -L:build_restore_work_registers:arch/mips/mm/tlbex.c v5.2 > > Before: > > real 0m2.599s > user 0m2.466s > sys 0m0.157s > > After: > > real 0m1.976s > user 0m1.856s > sys 0m0.121s > > [TODO: get rid of unnecessary arrays, tests?, write commit message...] > --- > line-log.c | 6 ++---- > 1 file changed, 2 insertions(+), 4 deletions(-) > > diff --git a/line-log.c b/line-log.c > index 9010e00950..a4b032f83a 100644 > --- a/line-log.c > +++ b/line-log.c > @@ -1184,13 +1184,11 @@ static int process_ranges_merge_commit(struct rev_info *rev, struct commit *comm > > p = commit->parents; > for (i = 0; i < nparents; i++) { > + int changed; > parents[i] = p->item; > p = p->next; > queue_diffs(range, &rev->diffopt, &diffqueues[i], commit, parents[i]); > - } > > - for (i = 0; i < nparents; i++) { > - int changed; > cand[i] = NULL; > changed = process_all_files(&cand[i], rev, &diffqueues[i], range); > if (!changed) { Interesting. The old logic computed ALL the diffs, then started navigating. By navigating before computing all the diffs, we are now avoiding the rename logic on the SECOND parent, and there will be a lot of second parents that do not include the file (depending on the number of parallel topics being merged independently). That's why git.git has a better performance difference than linux.git. > @@ -1203,7 +1201,7 @@ static int process_ranges_merge_commit(struct rev_info *rev, struct commit *comm > commit_list_append(parents[i], &commit->parents); > free(parents); > free(cand); > - free_diffqueues(nparents, diffqueues); > + free_diffqueues(i, diffqueues); Good point here, as we haven't initialized all of the queues. Thanks, -Stolee ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] line-log: avoid unnecessary full tree diffs 2019-08-22 8:41 ` SZEDER Gábor 2019-08-22 14:53 ` Derrick Stolee @ 2019-08-22 16:01 ` Junio C Hamano 2019-08-22 16:26 ` SZEDER Gábor 2019-08-23 10:04 ` SZEDER Gábor 2 siblings, 1 reply; 13+ messages in thread From: Junio C Hamano @ 2019-08-22 16:01 UTC (permalink / raw) To: SZEDER Gábor; +Cc: Derrick Stolee, git, Thomas Rast SZEDER Gábor <szeder.dev@gmail.com> writes: > Well, that fruit hung much lower than I though, just look at the size > of the WIP patch below. I just hope that there are no unexpected > surprises, but FWIW it produces the exact same output for all files up > to 't/t5515' in v2.23.0 as the previous patch. > > Can't wait to see how it fares with that evil Windows path :) > > --- >8 --- > > Subject: [PATCH 3/2] WIP line-log: stop diff-ing after first TREESAME merge parent A quick question. That we need "stop diffing after first treesame" patch in the first place means we have always been attempting to follow all the parents of a merge? I'd expect that to happen when "--full-history" was given to "git log -L..." invocation. When we are simplifying side branches without "--full-history", I agree that we should see if any parent is treesame with respect to the paths we are interested in, and if so ignore all other parents. Or am I misunderstanding the issue here? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] line-log: avoid unnecessary full tree diffs 2019-08-22 16:01 ` Junio C Hamano @ 2019-08-22 16:26 ` SZEDER Gábor 2019-08-22 16:51 ` Derrick Stolee 0 siblings, 1 reply; 13+ messages in thread From: SZEDER Gábor @ 2019-08-22 16:26 UTC (permalink / raw) To: Junio C Hamano; +Cc: Derrick Stolee, git, Thomas Rast On Thu, Aug 22, 2019 at 09:01:44AM -0700, Junio C Hamano wrote: > SZEDER Gábor <szeder.dev@gmail.com> writes: > > > Well, that fruit hung much lower than I though, just look at the size > > of the WIP patch below. I just hope that there are no unexpected > > surprises, but FWIW it produces the exact same output for all files up > > to 't/t5515' in v2.23.0 as the previous patch. > > > > Can't wait to see how it fares with that evil Windows path :) > > > > --- >8 --- > > > > Subject: [PATCH 3/2] WIP line-log: stop diff-ing after first TREESAME merge parent > > A quick question. That we need "stop diffing after first treesame" > patch in the first place means we have always been attempting to > follow all the parents of a merge? To follow, no. But there are two subsequent loops: the first loop computed the diffs between the merge and each of its parents, while the second processed those diffs, and returned as soon as it found a treesame parent, without following the others. This patch unified those two loops so it computes the diff with the first parent, and then processes that diff right away, and returns if treesame, thereby avoding diffing the remaining parents. > I'd expect that to happen when > "--full-history" was given to "git log -L..." invocation. Oh, right, I didn't consider '--full-history'. In that case it should not stop at the first parent. Hmm, looking into this, it seems that line-level log doesn't work with '--full-history' to begin with: # Each commit does what the subject says. $ git log --oneline --graph * f9bf557 (HEAD -> master) Merge branch 'branch' |\ | * 1b573fb (branch) Revert "Modify file" | * 3634cf3 Modify file |/ * 8842c18 Add file $ git log --oneline file 8842c18 Add file $ git log --oneline --full-history file 1b573fb (branch) Revert "Modify file" 3634cf3 Modify file 8842c18 Add file $ ~/src/git/BUILDS/v2.23.0/bin/git log -L1:file --oneline --no-patch 8842c18 Add file $ ~/src/git/BUILDS/v2.23.0/bin/git log -L1:file --oneline --no-patch --full-history 8842c18 Add file ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] line-log: avoid unnecessary full tree diffs 2019-08-22 16:26 ` SZEDER Gábor @ 2019-08-22 16:51 ` Derrick Stolee 0 siblings, 0 replies; 13+ messages in thread From: Derrick Stolee @ 2019-08-22 16:51 UTC (permalink / raw) To: SZEDER Gábor, Junio C Hamano; +Cc: git, Thomas Rast On 8/22/2019 12:26 PM, SZEDER Gábor wrote: > On Thu, Aug 22, 2019 at 09:01:44AM -0700, Junio C Hamano wrote: >> SZEDER Gábor <szeder.dev@gmail.com> writes: >> >>> Well, that fruit hung much lower than I though, just look at the size >>> of the WIP patch below. I just hope that there are no unexpected >>> surprises, but FWIW it produces the exact same output for all files up >>> to 't/t5515' in v2.23.0 as the previous patch. >>> >>> Can't wait to see how it fares with that evil Windows path :) >>> >>> --- >8 --- >>> >>> Subject: [PATCH 3/2] WIP line-log: stop diff-ing after first TREESAME merge parent >> >> A quick question. That we need "stop diffing after first treesame" >> patch in the first place means we have always been attempting to >> follow all the parents of a merge? > > To follow, no. > > But there are two subsequent loops: the first loop computed the diffs > between the merge and each of its parents, while the second processed > those diffs, and returned as soon as it found a treesame parent, > without following the others. > > This patch unified those two loops so it computes the diff with the > first parent, and then processes that diff right away, and returns if > treesame, thereby avoding diffing the remaining parents. The change you've proposed could be made a bit better in the following way: first look for a treesame parent by computing the diffs without rename detection. Re-enable rename detection only if no treesame parent is found. That would be sure to avoid the rename detection even if the merge has a treesame second parent. Thanks, -Stolee ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] line-log: avoid unnecessary full tree diffs 2019-08-22 8:41 ` SZEDER Gábor 2019-08-22 14:53 ` Derrick Stolee 2019-08-22 16:01 ` Junio C Hamano @ 2019-08-23 10:04 ` SZEDER Gábor 2 siblings, 0 replies; 13+ messages in thread From: SZEDER Gábor @ 2019-08-23 10:04 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, Junio C Hamano, Thomas Rast On Thu, Aug 22, 2019 at 10:41:58AM +0200, SZEDER Gábor wrote: > On Wed, Aug 21, 2019 at 07:35:15PM +0200, SZEDER Gábor wrote: > Subject: [PATCH 3/2] WIP line-log: stop diff-ing after first TREESAME merge parent > # linux.git, ~7% of all commits are merges > $ time ~/src/git/git --no-pager log \ > -L:build_restore_work_registers:arch/mips/mm/tlbex.c v5.2 > > Before: > > real 0m2.599s > user 0m2.466s > sys 0m0.157s > > After: > > real 0m1.976s > user 0m1.856s > sys 0m0.121s So, to recap, the above command follows the given line range through the whole history, and the timings were done with these three patches on top of v2.23.0 and with a commit-graph file present and used. When merged with my more responsive line-level log series, the same command takes about 12% longer: real 0m2.216s user 0m2.108s sys 0m0.109s Looking at the flame graphs generated from the perf profiles I notice the following: - (line_log_)process_ranges_arbitrary_commit(), i.e. the function responsible for processing all commits for the line-level log, shows up in 2408 samples before the merge and in 2387 samples after the merge. I'm inclined to write it off as noise. - Before the merge limit_list() and sort_in_topological_order(), the two functions responsible for topo-ordering without generation numbers, show up in 2484 and 707 samples, respectively, which is 3191 samples combined. - After the merge expand_topo_walk() alone shows up in 4341 samples, with next_topo_commit() in 72 samples. So it appears that the generation numbers-based topo-ordering is almost 40% slower than "traditional" topo-ordering. It came as a surprise to me, but perhaps to you it's expected? Anyway, the time to show the first commit is still much better with generarion numbers, so overall I think it more than offsets this ~10% difference. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] line-log: avoid unnecessary full tree diffs 2019-08-21 11:04 ` [PATCH 2/2] line-log: avoid unnecessary full tree diffs SZEDER Gábor 2019-08-21 15:53 ` Derrick Stolee @ 2019-08-21 17:29 ` Junio C Hamano 1 sibling, 0 replies; 13+ messages in thread From: Junio C Hamano @ 2019-08-21 17:29 UTC (permalink / raw) To: SZEDER Gábor; +Cc: git, Thomas Rast, Derrick Stolee SZEDER Gábor <szeder.dev@gmail.com> writes: > So basically the line-level log computes a full tree diff for each > commit-parent pair in step (1) to be used for rename detection in step > (4) in the off chance that an interesting path is missing from the > parent. Good explanation. As we are not supporting swapping of two (or more) files, we only need rename processing when a path we have been inspecting disappears, at which point it is worth spending cycles to see where the path used to be in the parent commit. > [1] Line-level log's rename following is quite similar to 'git log > --follow path', with the notable differences that it does handle > multiple paths at once as well, and that it doesn't show the > commit performing the rename if it's an exact rename. Yeah, it's one of the reasons why "log --follow" is not (yet) a serious "feature" but merely a "checkbox item". > [2] This slowness might not have been apparent initially, because back > when the line-level log feature was introduced rename detection s/introduced/&,/ > was not yet enabled by default; 12da1d1f6f (Implement line-history > search (git log -L), 2013-03-28) and 5404c116aa (diff: activate > diff.renames by default, 2016-02-25). s/\.$/ are about 3 years apart&/ or something like that to make what follows the semicolon a full sentence? ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2019-08-23 10:04 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-08-21 11:04 [PATCH 0/2] line-log: avoid unnecessary full tree diffs SZEDER Gábor 2019-08-21 11:04 ` [PATCH 1/2] line-log: extract pathspec parsing from line ranges into a helper function SZEDER Gábor 2019-08-21 11:04 ` [PATCH 2/2] line-log: avoid unnecessary full tree diffs SZEDER Gábor 2019-08-21 15:53 ` Derrick Stolee 2019-08-21 17:35 ` SZEDER Gábor 2019-08-21 18:12 ` Derrick Stolee 2019-08-22 8:41 ` SZEDER Gábor 2019-08-22 14:53 ` Derrick Stolee 2019-08-22 16:01 ` Junio C Hamano 2019-08-22 16:26 ` SZEDER Gábor 2019-08-22 16:51 ` Derrick Stolee 2019-08-23 10:04 ` SZEDER Gábor 2019-08-21 17:29 ` Junio C Hamano
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).