git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: "SZEDER Gábor" <szeder.dev@gmail.com>
Cc: "Junio C Hamano" <gitster@pobox.com>,
	"Derrick Stolee" <stolee@gmail.com>,
	"Git List" <git@vger.kernel.org>,
	"Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
Subject: Re: We should add a "git gc --auto" after "git clone" due to commit graph
Date: Thu, 11 Oct 2018 01:01:51 +0200	[thread overview]
Message-ID: <87lg75e78g.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <20181010220739.GF23446@szeder.dev>


On Wed, Oct 10 2018, SZEDER Gábor wrote:

> On Thu, Oct 04, 2018 at 11:09:58PM -0700, Junio C Hamano wrote:
>> SZEDER Gábor <szeder.dev@gmail.com> writes:
>>
>> >>     git-gc - Cleanup unnecessary files and optimize the local repository
>> >>
>> >> Creating these indexes like the commit-graph falls under "optimize the
>> >> local repository",
>> >
>> > But it doesn't fall under "cleanup unnecessary files", which the
>> > commit-graph file is, since, strictly speaking, it's purely
>> > optimization.
>>
>> I won't be actively engaged in this discussion soon, but I must say
>> that "git gc" doing "garbage collection" is merely an implementation
>> detail of optimizing the repository for further use.  And from that
>> point of view, what needs to be updated is the synopsis of the
>> git-gc doc.  It states "X and Y" above, but it actually is "Y by
>> doing X and other things".
>
> Well, then perhaps the name of the command should be updated, too, to
> better reflect what it actually does...

I don't disagree, but between "git gc" being a longstanding thing called
not just by us , but third parties expecting it to a be a general swiss
army knife for "make repo better" (so for a new tool they'd need to
update their code), and general name bikeshedding I think it's best if
we just proceed for the sake of argument with the assumption that none
of us find the name confusing in this context.

At least that's the discussion I'm interested in. I.e. whether it makes
conceptual / structural sense for this stuff to live in the same place /
function in the code. We can always argue about the name of the function
as a separate topic.

>> I understand your "by definition there is no garbage immediately
>> after clone" position, and also I would understand if you find it
>> (perhaps philosophically) disturbing that "git clone" may give users
>> a suboptimal repository that immediately needs optimizing [*1*].
>>
>> But that bridge was crossed long time ago ever since pack transfer
>> was invented.  The data source sends only the pack data stream, and
>> the receiving end is responsible for spending cycles to build .idx
>> file.  Theoretically, .pack should be all that is needed---you
>> should be able to locate any necessary object by parsing the .pack
>> file every time you open it, and .idx is mere optimization.  You can
>> think of the .midx and graph files the same way.
>
> I don't think this is a valid comparison, because, practically, Git
> just didn't work after I deleted all pack index files.  So while they
> can be easily (re)generated, they are essential to make pack files
> usable.  The commit-graph and .midx files, however, can be safely
> deleted, and everything keeps working as before.

For "things that would run in 20ms now run in 30 seconds" (actual
numbers on a repo I have) values of "keeps working".

So I think this line gets blurred somewhat. In practice if you're
expecting the graph to be there to run the sort of commands that most
benefit from it it's essential that it be generated, not some nice
optional extra.

> OTOH, this is an excellent comparison, and I do think of the .midx and
> graph files the same way as the pack index files.  During a clone, the
> pack index file isn't generated by running a separate 'git gc
> (--auto)', but by clone (or fetch-pack?) running 'git index-pack'.
> The way I see it that should be the case for these other files as well.
>
> And it is much simpler, shorter, and cleaner to either run 'git
> commit-graph ...' or even to call write_commit_graph_reachable()
> directly from cmd_clone(), than to bolt on another option and config
> variable on 'git gc' [1] to coax it into some kind of an "after clone"
> mode, that it shouldn't be doing in the first place.  At least for
> now, so when we'll eventually get as far ...
>
>> I would not be surprised by a future in which the initial index-pack
>> that is responsible for receiving the incoming pack stream and
>> storing that in .pack file(s) while creating corresponding .idx
>> file(s) becomes also responsible for building .midx and graph files
>> in the same pass, or at least smaller number of passes.  Once we
>> gain experience and confidence with these new auxiliary files, that
>> ought to happen naturally.  And at that point, we won't be having
>> this discussion---we'd all happily run index-pack to receive the
>> pack data, because that is pretty much the fundamental requirement
>> to make use of the data.
>
> ... that what you wrote here becomes a reality (and I fully agree that
> this is what we should ultimately aim for), then we won't have that
> option and config variable still lying around and requiring
> maintenance because of backwards compatibility.

We'll still have the use-case of wanting to turn on
gc.writeCommitGraph=true or equivalent and wanting previously-cloned
repositories to "catch up" and get a commit graph ASAP (but not do a
full repack).

This is why my patch tries to unify those two codepaths, i.e. so I can
turn this on and know that on the next "git fetch" we'll have the graph
(without needing to also run a full repack if it's not needed).

> 1 - https://public-inbox.org/git/87in2hgzin.fsf@evledraar.gmail.com/
>
>> [Footnote]
>>
>> *1* Even without considering these recent invention of auxiliary
>>     files, cloning from a sloppily packed server whose primary focus
>>     is to avoid spending cycles by not computing better deltas will
>>     give the cloner a suboptimal repository.  If we truly want to
>>     have an optimized repository ready to be used after cloning, we
>>     should run an equivalent of "repack -a -d -f" immediately after
>>     "git clone".
>
> I noticed a few times that I got surprisingly large packs from GitHub,
> e.g. there is over 70% size difference between --single-branch cloning
> v2.19.0 from GitHub and from my local clone or from kernel.org (~95MB
> vs. ~55MB vs ~52MB).  After running 'git repack -a -d -f' they all end
> up at ~65MB, which is a nice size reduction for the clone from GitHub,
> but the others just gained 10-13 more MBs.

To me this sounds like even more reason for why we shouldn't be
splitting up the entry point for the "does the repo look shitty? Fix
it!" function (currently called git-gc).

We might get such crappy packs from a clone, or from a fetch, or maybe
in the future when e.g. "git add" or "git commit" generate a pack
instead of a bunch of loose objects we'd want to immediately kick off
something in the background to optimize / consolidate them.

So instead of having clone/commit/add/whatever call some custom
function(s) to just do the things they *think* they want, we just call
"gc --auto", because that entry point needs to eventually know how to
"recover" from any of those states without any prior knowledge or hints
about what just happened.

This is why I'm calling "gc --auto" from clone in this WIP series, with
the only special sauce being "if you have stuff to do, don't background
yourself", because not having a commit graph at all after a clone is
just one special case of many where we have no commit graph and want to
have one made ASAP (e.g. someone rm'd it, or the "I want a commit graph"
config variable just got turned on).

So since we need all those smarts in some function anyway, let's just
have one entry point to that logic. Discarding what it doesn't need to
do (e.g. too_many_loose_objects() just after a clone) is a trivial cost,
so if we don't care about that we have less complexity to worry about by
not having a proliferation of entry points into what are now subsets of
the GC code.

  reply	other threads:[~2018-10-10 23:01 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-03 13:23 We should add a "git gc --auto" after "git clone" due to commit graph Ævar Arnfjörð Bjarmason
2018-10-03 13:36 ` SZEDER Gábor
2018-10-03 13:42   ` Derrick Stolee
2018-10-03 14:18     ` Ævar Arnfjörð Bjarmason
2018-10-03 14:01   ` Ævar Arnfjörð Bjarmason
2018-10-03 14:17     ` SZEDER Gábor
2018-10-03 14:22       ` Ævar Arnfjörð Bjarmason
2018-10-03 14:53         ` SZEDER Gábor
2018-10-03 15:19           ` Ævar Arnfjörð Bjarmason
2018-10-03 16:59             ` SZEDER Gábor
2018-10-05  6:09               ` Junio C Hamano
2018-10-10 22:07                 ` SZEDER Gábor
2018-10-10 23:01                   ` Ævar Arnfjörð Bjarmason [this message]
2018-10-03 19:08           ` Stefan Beller
2018-10-03 19:21             ` Jeff King
2018-10-03 20:35               ` Ævar Arnfjörð Bjarmason
2018-10-03 17:47         ` Stefan Beller
2018-10-03 18:47           ` Ævar Arnfjörð Bjarmason
2018-10-03 18:51             ` Jeff King
2018-10-03 18:59               ` Derrick Stolee
2018-10-03 19:18                 ` Jeff King
2018-10-08 16:41                   ` SZEDER Gábor
2018-10-08 16:57                     ` Derrick Stolee
2018-10-08 18:10                       ` SZEDER Gábor
2018-10-08 18:29                         ` Derrick Stolee
2018-10-09  3:08                           ` Jeff King
2018-10-09 13:48                             ` Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) Derrick Stolee
2018-10-09 18:45                               ` Ævar Arnfjörð Bjarmason
2018-10-09 18:46                               ` Jeff King
2018-10-09 19:03                                 ` Derrick Stolee
2018-10-09 21:14                                   ` Jeff King
2018-10-09 23:12                                     ` Bloom Filters Jeff King
2018-10-09 23:13                                       ` [PoC -- do not apply 1/3] initial tree-bitmap proof of concept Jeff King
2018-10-09 23:14                                       ` [PoC -- do not apply 2/3] test-tree-bitmap: add "dump" mode Jeff King
2018-10-10  0:48                                         ` Junio C Hamano
2018-10-11  3:13                                           ` Jeff King
2018-10-09 23:14                                       ` [PoC -- do not apply 3/3] test-tree-bitmap: replace ewah with custom rle encoding Jeff King
2018-10-10  0:58                                         ` Junio C Hamano
2018-10-11  3:20                                           ` Jeff King
2018-10-11 12:33                                       ` Bloom Filters Derrick Stolee
2018-10-11 13:43                                         ` Jeff King
2018-10-09 21:30                             ` We should add a "git gc --auto" after "git clone" due to commit graph SZEDER Gábor
2018-10-09 19:34                       ` [PATCH 0/4] Bloom filter experiment SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 1/4] Add a (very) barebones Bloom filter implementation SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 2/4] commit-graph: write a Bloom filter containing changed paths for each commit SZEDER Gábor
2018-10-09 21:06                           ` Jeff King
2018-10-09 21:37                             ` SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 3/4] revision.c: use the Bloom filter to speed up path-limited revision walks SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 4/4] revision.c: add GIT_TRACE_BLOOM_FILTER for a bit of statistics SZEDER Gábor
2018-10-09 19:47                         ` [PATCH 0/4] Bloom filter experiment Derrick Stolee
2018-10-11  1:21                         ` [PATCH 0/2] Per-commit filter proof of concept Jonathan Tan
2018-10-11  1:21                           ` [PATCH 1/2] One filter per commit Jonathan Tan
2018-10-11 12:49                             ` Derrick Stolee
2018-10-11 19:11                               ` [PATCH] Per-commit and per-parent filters for 2 parents Jonathan Tan
2018-10-11  1:21                           ` [PATCH 2/2] Only make bloom filter for first parent Jonathan Tan
2018-10-11  7:37                           ` [PATCH 0/2] Per-commit filter proof of concept Ævar Arnfjörð Bjarmason
2018-10-15 14:39                         ` [PATCH 0/4] Bloom filter experiment Derrick Stolee
2018-10-16  4:45                           ` Junio C Hamano
2018-10-16 11:13                             ` Derrick Stolee
2018-10-16 12:57                               ` Ævar Arnfjörð Bjarmason
2018-10-16 13:03                                 ` Derrick Stolee
2018-10-18  2:00                                 ` Junio C Hamano
2018-10-16 23:41                           ` Jonathan Tan
2018-10-08 23:02                     ` We should add a "git gc --auto" after "git clone" due to commit graph Junio C Hamano
2018-10-03 14:32     ` Duy Nguyen
2018-10-03 16:45 ` Duy Nguyen
2018-10-04 21:42 ` [RFC PATCH] " Ævar Arnfjörð Bjarmason
2018-10-05 12:05   ` Derrick Stolee
2018-10-05 13:05     ` Ævar Arnfjörð Bjarmason
2018-10-05 13:45       ` Derrick Stolee
2018-10-05 14:04         ` Ævar Arnfjörð Bjarmason
2018-10-05 19:21         ` Jeff King
2018-10-05 19:41           ` Derrick Stolee
2018-10-05 19:47             ` Jeff King
2018-10-05 20:00               ` Derrick Stolee
2018-10-05 20:02                 ` Jeff King
2018-10-05 20:01               ` Ævar Arnfjörð Bjarmason
2018-10-05 20:09                 ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87lg75e78g.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --cc=stolee@gmail.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).