git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* Re: What's cooking in git.git (Mar 2018, #03; Wed, 14)
  2018-03-15  8:36  7% ` Ævar Arnfjörð Bjarmason
@ 2018-03-19 21:16  0%   ` Derrick Stolee
  0 siblings, 0 replies; 4+ results
From: Derrick Stolee @ 2018-03-19 21:16 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Junio C Hamano
  Cc: git, Duy Nguyen, Takuto Ikuta, git, Derrick Stolee, Ben Peart

On 3/15/2018 4:36 AM, Ævar Arnfjörð Bjarmason wrote:
> On Thu, Mar 15 2018, Junio C. Hamano jotted:
>
>> * nd/repack-keep-pack (2018-03-07) 6 commits
>>   - SQUASH???
>>   - pack-objects: display progress in get_object_details()
>>   - pack-objects: show some progress when counting kept objects
>>   - gc --auto: exclude base pack if not enough mem to "repack -ad"
>>   - repack: add --keep-pack option
>>   - t7700: have closing quote of a test at the beginning of line
>>
>>   "git gc" in a large repository takes a lot of time as it considers
>>   to repack all objects into one pack by default.  The command has
>>   been taught to pretend as if the largest existing packfile is
>>   marked with ".keep" so that it is left untouched while objects in
>>   other packs and loose ones are repacked.
>>
>>   Expecting a reroll.
>>   cf. <CACsJy8BW_EtxQvgL=YrCXCQY7cEWCQxgfkeH=Gd=X=uVYhPJcw@mail.gmail.com>
>>   Except for final finishing touches, this looked more-or-less ready
>>   for 'next'.
>>
>>
> As I noted in 87a7vdqegi.fsf@evledraar.gmail.com and
> 877eqhq7ha.fsf@evledraar.gmail.com (both at:
> https://public-inbox.org/git/?q=87a7vdqegi.fsf%40evledraar.gmail.com) I
> think we should change the too-specific behavior here to be more generic
> (and am happy to do the work pending feedback from Duy on what he thinks
> about it).
>
> I'm also interested to know from those at Microsoft (CC'd some) if the
> mechanism I've proposed is something closer to what they could
> eventually use to gc windows.git.

Sorry that I couldn't get to this message sooner, I was traveling. While 
I was gone, the others who you CC'd volunteered me as the best person to 
respond ;)

In the interest of full disclosure and hopefully starting an interesting 
discussion, I want to share as much detail as possible as well as a few 
future directions that can inform our actions now. Here are some rough 
ideas that we are thinking about in this space:

   I. Use the multi-pack index (MIDX) [1] to track "packfile state" so 
we can do GC/repack incrementally.
  II. Replace our prefetch packs model with partial clone [2].
III. Stop including all trees and focus the fetch down a cone of the 
working directory.
  IV. Provide a way to defer certain read-only commands to a remote when 
the local repo doesn't have sufficient data.

> I know that now it doesn't GC now, and they have some side-channel
> mechanism for pre-deploying large (daily?) packs to clients, if it's
> adjusted as I suggest gc could be told not to touch packs of that size,
> leaving only stray small packs from "git pull" and loose objects to GC.
>
> I may also have entirely misunderstood how it works, this is from brief
> in-person conversations at Git Merge.
>
> But as far as mainlining some of that eventually I think it would be
> good to get feedback on whether the mechanism I proposed would get in
> their way more or less than what Duy has, or be entirely irrelevant
> because they need something else.
>
> Thanks!

The GVFS cache servers pre-compute daily packfiles filled with every 
commit and tree introduced that day. When a client calls 'git fetch' the 
GVFS hook runs a "prefetch" command to get these daily packs from the 
cache servers and place them in an alternate we call the "shared object 
cache". GVFS also disables the "receive-pack" portion of the fetch. The 
MIDX is updated to cover the new packs.

Something we are going to enable soon is the addition of "hourly packs" 
where the cache servers keep a list of up to 24 hourly packs and on 
prefetch the client receives an extra pack that concatenates that day's 
pack up to that point. We are doing this because the refs they are 
fetching are far enough ahead of the daily snapshots, which triggers a 
decent amount of loose object downloads when those refs are checked out. 
Since the next day will create a new daily pack, we can delete that 
hourly pack (I don't think we do this currently). At the very least the 
MIDX prefers the duplicates in the new packfile.

We do not GC in GVFS-enabled repos. This doesn't destroy clients' disks 
so far because the prefetch packs do not contain blobs -- blobs are a 
large portion of the full repo size. As the repo grows, we are 
re-examining how to make GC and repack work in this environment. An 
important part of any solution will be to make it incremental: we cannot 
afford to create a second copy of the repo and we can't take the repo 
offline for a significant portion of time.

I'm not sure that we can use Duy's solution out-of-the-box, in 
particular because we haven't integrated our prefetch packs with the 
"promisor" concept from partial clone. Hence, GC will fail with missing 
objects. I think the core idea is good: perform GC incrementally and 
don't touch previously-GC'd packs. Since it is rare for an object to be 
reachable for a long time and then become unreachable, performing GC and 
then keeping the resulting pack forever is a good idea. Especially when 
paired with the MIDX so you care less about the number of packfiles.

I.

One of our thoughts is to use the MIDX to mark the packfiles with 
metadata. For instance, we could mark new packfiles as "raw" and after 
enough time we could run a GC on just one packfile (or a batch of 
packfiles) to remove unreachable objects and mark it as "clean". In our 
situation, we don't want to immediately GC the daily prefetch pack 
because a large portion of those objects will be reachable in a few days 
due to branch integrations.

II.

Our long-term plan is to remove these GVFS-specific fetch patterns and 
replace them with partial clone logic. We still see the cache servers as 
an important part of this process, but they are essentially read-only 
copies of the main server object database (no refs). We haven't done the 
work to see how expensive it will be to replace the precomputed prefetch 
packs with a partial clone negotiation. Then we will need to see how to 
make the shared object cache be transparent to the user.

An aside: it would be good to investigate how Git could provide 
replication like a cache server with minimal setup, including client 
configuration. The GVFS protocol has a "gvfs/config" endpoint that can 
provide a list of cache server URLs, including a default. (For the 
Windows repo, these URLs point to load balancers that pass-through to an 
array of machines. I imagine most users will only need one server per 
location.) Does Git already have a way to redirect the call to upload-pack?

III.

Partial clone has options for filtering trees by paths [3]. We want to 
take advantage of this functionality to reduce the number of objects 
stored on disk. In a perfect world, we would use the virtualization 
layer in GVFS to auto-detect the paths that are important to the user 
and create a sparse filter for them that way. Partial clone doesn't have 
a mechanism for that right now (we require a sparse-checkout 
specification to be present on the remote) but perhaps it could be part 
of a verb in protocol v2. We are working to collect sparse-checkout 
files from our users to see if it is possible to cluster their usage 
patterns into a reasonably small list of sparse-checkout specifications. 
Alternatively, how large would it be to dynamically send a list of paths 
based on the current virtual projection? In either case, we expect some 
extra trees by including some amount of sibling paths as we expect users 
to be interested in reading nearby files or inspecting history for those 
paths.

IV.

One big reason we started computing daily prefetch packs for GVFS is 
that certain important Git commands became too slow if all objects were 
obtained from the loose-object download. Simple commands like "git log" 
would take forever. With all commits and trees, we can run "git log -- 
path" at the same speed as a vanilla repo. "git diff" can still be slow 
because it needs blob contents, but hopefully the diff was for a single 
commit and there are not too many calls. "git grep" could be a disaster.

When we start restricting to sparse-checkout specifications, not all 
"git log -- deep/paths/in/particular" commands will have enough data 
locally to run successfully.

One approach is to have a way to pass-through to a remote. Certain 
commands could be requested (via a protocol v2 verb) to the server. The 
server can compute the answer and send a (paged) response faster than 
the client can download objects and then compute locally. This also 
prevents users from getting a bloated object database full of loose 
blobs they don't need.

At the very least, we will want to detect that we are missing many 
objects and ask the user something like "Did you really want to do that? 
It will cause 1000+ objects to be downloaded. [Y/n]"


I hope this combo of information and half-baked ideas is helpful. We are 
definitely approaching a lot of interesting scale limits with many 
different possible solutions.

Thanks,
-Stolee

[1] 
https://public-inbox.org/git/20180107181459.222909-1-dstolee@microsoft.com/T/#u
     [RFC PATCH 00/18] Multi-pack index (MIDX)

[2] 
https://public-inbox.org/git/20171214152404.35708-1-git@jeffhostetler.com/T/#u
     [PATCH v2] Partial clone design document

[3] 
https://github.com/git/git/blob/0afbf6caa5b16dcfa3074982e5b48e27d452dbbb/Documentation/rev-list-options.txt#L727-L733
     rev-list-options.txt Documentation for "--filter=sparse:*"

^ permalink raw reply	[relevance 0%]

* Re: What's cooking in git.git (Mar 2018, #03; Wed, 14)
  @ 2018-03-15  8:36  7% ` Ævar Arnfjörð Bjarmason
  2018-03-19 21:16  0%   ` Derrick Stolee
  0 siblings, 1 reply; 4+ results
From: Ævar Arnfjörð Bjarmason @ 2018-03-15  8:36 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Duy Nguyen, Takuto Ikuta, git, Derrick Stolee, Ben Peart


On Thu, Mar 15 2018, Junio C. Hamano jotted:

> * ti/fetch-everything-local-optim (2018-03-14) 1 commit
>  - fetch-pack.c: use oidset to check existence of loose object

Looks good to me, and great to have such an optimization in.

> * ab/pcre-v2 (2018-03-14) 3 commits
>
>  Will merge to 'next'.
>
>[...]
>
> * ab/perl-fixes (2018-03-05) 13 commits
>
>  Will merge to 'master'.

Given 2.17-rc1 next Tuesday according to your calendar, what's the
status of these two landing in 2.17?

I'd like to annoy packagers with all this perl/ stuff in just one
release instead of spreading out out over two, and without ab/perl-fixes
they'll need another hack to avoid installing our Error.pm.

The ab/pcre-v2 is less important, but given the minimal nature of it
would be very nice to have in 2.17 as well so we get off the
mostly-unmaintained v1 sooner than later.

> * nd/repack-keep-pack (2018-03-07) 6 commits
>  - SQUASH???
>  - pack-objects: display progress in get_object_details()
>  - pack-objects: show some progress when counting kept objects
>  - gc --auto: exclude base pack if not enough mem to "repack -ad"
>  - repack: add --keep-pack option
>  - t7700: have closing quote of a test at the beginning of line
>
>  "git gc" in a large repository takes a lot of time as it considers
>  to repack all objects into one pack by default.  The command has
>  been taught to pretend as if the largest existing packfile is
>  marked with ".keep" so that it is left untouched while objects in
>  other packs and loose ones are repacked.
>
>  Expecting a reroll.
>  cf. <CACsJy8BW_EtxQvgL=YrCXCQY7cEWCQxgfkeH=Gd=X=uVYhPJcw@mail.gmail.com>
>  Except for final finishing touches, this looked more-or-less ready
>  for 'next'.
>
>

As I noted in 87a7vdqegi.fsf@evledraar.gmail.com and
877eqhq7ha.fsf@evledraar.gmail.com (both at:
https://public-inbox.org/git/?q=87a7vdqegi.fsf%40evledraar.gmail.com) I
think we should change the too-specific behavior here to be more generic
(and am happy to do the work pending feedback from Duy on what he thinks
about it).

I'm also interested to know from those at Microsoft (CC'd some) if the
mechanism I've proposed is something closer to what they could
eventually use to gc windows.git.

I know that now it doesn't GC now, and they have some side-channel
mechanism for pre-deploying large (daily?) packs to clients, if it's
adjusted as I suggest gc could be told not to touch packs of that size,
leaving only stray small packs from "git pull" and loose objects to GC.

I may also have entirely misunderstood how it works, this is from brief
in-person conversations at Git Merge.

But as far as mainlining some of that eventually I think it would be
good to get feedback on whether the mechanism I proposed would get in
their way more or less than what Duy has, or be entirely irrelevant
because they need something else.

Thanks!

^ permalink raw reply	[relevance 7%]

* Re: [PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad"
  @ 2018-03-12 22:01  5%               ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 4+ results
From: Ævar Arnfjörð Bjarmason @ 2018-03-12 22:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Duy Nguyen, Eric Wong, Git Mailing List, Jeff King


On Mon, Mar 12 2018, Junio C. Hamano jotted:

> On Mon, Mar 12, 2018 at 11:56 AM, Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>> As someone who expects to use this (although hopefully in slightly
>> modified form), it's very useful if we can keep the useful semantics in
>> gc.* config values without needing some external job finding repos and
>> creating *.keep files to get custom behavior.
>>
>> E.g. I have the use-case of wanting to set this on servers that I know
>> are going to be used for cloning some big repos in user's ~ directory
>> manually, so if I can set something sensible in /etc/gitconfig that's
>> great, but it sucks a lot more to need to write some cronjob that goes
>> hunting for repos in those ~ directories and tweaks *.keep files.
>
> Yeah, but that is exactly what I suggested, no? That is, if you don't do any
> specific marking to describe _which_ ones need to be kept, this new thing
> would kick in and pick the largest one and repack all others. If you choose
> to want more control, on the other hand, you can mark those packs you
> would want to keep, and this mechanism will not kick in to countermand
> your explicit settings done via those .keep files.

Yes, this configurable mechanism as it stands only needs /etc/gitconfig.

What I was pointing out in this mail is that we really should get the
advanced use-cases right as well (see my
87a7vdqegi.fsf@evledraar.gmail.com for details) via the config, because
it's a pain to cross the chasm between setting config centrally on the
one hand, and needing to track down .git's in arbitrary locations on the
FS (you may not have cloned them yourself) to set *.keep flags.

Doubly so if the machines in questions are just the laptops of some
developers. It's relatively easy to tell them "we work with git repos,
run this git config commands", not so easy to have them install & keep
up-to-date some arbitrary cronjob that needs to hunt down their repos
and set *.keep flags.

^ permalink raw reply	[relevance 5%]

* Re: [PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad"
    @ 2018-03-12 19:30  3%     ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 4+ results
From: Ævar Arnfjörð Bjarmason @ 2018-03-12 19:30 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: e, git, peff, Junio C Hamano


On Tue, Mar 06 2018, Nguyễn Thái Ngọc Duy jotted:

> pack-objects could be a big memory hog especially on large repos,
> everybody knows that. The suggestion to stick a .keep file on the
> giant base pack to avoid this problem is also known for a long time.
>
> Let's do the suggestion automatically instead of waiting for people to
> come to Git mailing list and get the advice. When a certain condition
> is met, "gc --auto" tells "git repack" to keep the base pack around.
> The end result would be two packs instead of one.
>
> On linux-2.6.git, valgrind massif reports 1.6GB heap in "pack all"
> case, and 535MB [1] in "pack all except the base pack" case. We save
> roughly 1GB memory by excluding the base pack.
>
> gc --auto decides to do this based on an estimation of pack-objects
> memory usage, which is quite accurate at least for the heap part, and
> whether that fits in half of system memory (the assumption here is for
> desktop environment where there are many other applications running).
>
> Since the estimation may be inaccurate and that 1/2 threshold is
> really arbitrary, give the user a finer control over this mechanism:
> if the largest pack is larger than gc.bigBasePackThreshold, it's kept.
>
> PS. A big chunk of the remaining 535MB is the result of pack-objects
> running rev-list internally. This will be dealt with when we could run
> rev-list externally. Right now we can't because pack-objects internal
> rev-list does more regarding unreachable objects, which cannot be done
> by "git rev-list".
>
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
>  Documentation/config.txt |   7 ++
>  Documentation/git-gc.txt |  13 ++++
>  builtin/gc.c             | 153 +++++++++++++++++++++++++++++++++++++--
>  builtin/pack-objects.c   |   2 +-
>  config.mak.uname         |   1 +
>  git-compat-util.h        |   4 +
>  pack-objects.h           |   2 +
>  t/t6500-gc.sh            |  29 ++++++++
>  8 files changed, 204 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index f57e9cf10c..120cf6bac9 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -1549,6 +1549,13 @@ gc.autoDetach::
>  	Make `git gc --auto` return immediately and run in background
>  	if the system supports it. Default is true.
>
> +gc.bigBasePackThreshold::
> +	Make `git gc --auto` only enable `--keep-base-pack` when the
> +	base pack's size is larger than this limit (in bytes).
> +	Defaults to zero, which disables this check and lets
> +	`git gc --auto` determine when to enable `--keep-base-pack`
> +	based on memory usage.
> +

I'm really keen to use this (and would be happy to apply a patch on
top), but want to get your thoughts first, see also my just-sent
87bmftqg1n.fsf@evledraar.gmail.com
(https://public-inbox.org/git/87bmftqg1n.fsf@evledraar.gmail.com/).

The thing I'd like to change is that the underlying --keep-pack= takes a
list of paths (good!), but then I think this patch needlessly
complicates things by talking about "base packs" and having the
implementation limitation that we only ever pass one --keep-pack down to
pack-objects (bad!).

Why don't we instead just have a gc.* variable that you can set to some
size of pack that we'll always implicitly *.keep? That way I can
e.g. clone a 5GB pack and set the limit to 2GB, then keep adding new
content per the rules of gc.autoPackLimit, until I finally create a
larger than 2GB pack, at that point I'll have 5GB, 2GB, and some smaller
packs and loose objects.

We already have pack.packSizeLimit, perhaps we could call this
e.g. gc.keepPacksSize=2GB?

Or is there a use-case for still having the concept of a "base" pack? Is
it magic in some way? Maybe I'm missing something but I don't see why,
we can just stop thinking about whether some one pack is larger than X,
and consider all packs larger than X specially.

But if we do maybe an extra gc.keepBasePack=true?

Finally I wonder if there should be something equivalent to
gc.autoPackLimit for this. I.e. with my proposed semantics above it's
possible that we end up growing forever, i.e. I could have 1000 2GB
packs and then 50 very small packs per gc.autoPackLimit.

Maybe we need a gc.keepPackLimit=100 to deal with that, then e.g. if
gc.keepPacksSize=2GB is set and we have 101 >= 2GB packs, we'd pick the
two smallest one and not issue a --keep-pack for those, although then
maybe our memory use would spike past the limit.

I don't know, maybe we can leave that for later, but I'm quite keen to
turn the top-level config variable into something that just considers
size instead of "base" if possible, and it seems we're >95% of the way
to that already with this patch.

Finally, I don't like the way the current implementation conflates a
"size" variable with auto detecting the size from memory, leaving no way
to fallback to the auto-detection if you set it manually.

I think we should split out the auto-memory behavior into another
variable, and also make the currently hardcoded 50% of memory
configurable.

That way you could e.g. say you'd always like to keep 2GB packs, but if
you happen to have ended up with a 1GB pack and it's time to repack, and
you only have 500MB free memory on that system, it would keep the 1GB
one until such time as we have more memory.

Actually maybe that should be a "if we're that low on memory, forget
about GC for now" config, but urgh, there's a lot of potential
complexity to be handled here...

^ permalink raw reply	[relevance 3%]

Results 1-4 of 4 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2018-03-01  9:20     [PATCH/RFC 0/1] Avoid expensive 'repack -ad' in gc --auto Nguyễn Thái Ngọc Duy
2018-03-06 10:41     ` [PATCH v2 0/5] " Nguyễn Thái Ngọc Duy
2018-03-06 10:41       ` [PATCH v2 3/5] gc --auto: exclude base pack if not enough mem to "repack -ad" Nguyễn Thái Ngọc Duy
2018-03-06 19:19         ` Junio C Hamano
2018-03-07 10:48           ` Duy Nguyen
2018-03-07 18:38             ` Junio C Hamano
2018-03-12 18:56               ` Ævar Arnfjörð Bjarmason
2018-03-12 21:16                 ` Junio C Hamano
2018-03-12 22:01  5%               ` Ævar Arnfjörð Bjarmason
2018-03-12 19:30  3%     ` Ævar Arnfjörð Bjarmason
2018-03-15  1:34     What's cooking in git.git (Mar 2018, #03; Wed, 14) Junio C Hamano
2018-03-15  8:36  7% ` Ævar Arnfjörð Bjarmason
2018-03-19 21:16  0%   ` Derrick Stolee

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).