git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: John Cai <johncai86@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: John Cai via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org,
	Christian Couder <christian.couder@gmail.com>
Subject: Re: [PATCH 2/2] repack: add --filter=<filter-spec> option
Date: Sun, 30 Jan 2022 08:02:07 -0500	[thread overview]
Message-ID: <8D4655AA-8D2A-4D4C-A7CD-B79A8A9E66D7@gmail.com> (raw)
In-Reply-To: <A4BAD509-FA1F-49C3-87AF-CF4B73C559F1@gmail.com>

Sorry forgot to include the link to Christian's demo. included below

On 29 Jan 2022, at 14:14, John Cai wrote:

> Hi Stolee,
>
> Thanks for taking the time to review this patch! I added some points of clarification
> down below.
>
> On 27 Jan 2022, at 10:03, Derrick Stolee wrote:
>
>> On 1/26/2022 8:49 PM, John Cai via GitGitGadget wrote:
>>> From: John Cai <johncai86@gmail.com>
>>>
>>> Currently, repack does not work with partial clones. When repack is run
>>> on a partially cloned repository, it grabs all missing objects from
>>> promisor remotes. This also means that when gc is run for repository
>>> maintenance on a partially cloned repository, it will end up getting
>>> missing objects, which is not what we want.
>>
>> This shouldn't be what is happening. Do you have a demonstration of
>> this happening? repack_promisor_objects() should be avoiding following
>> links outside of promisor packs so we can safely 'git gc' in a partial
>> clone without downloading all reachable blobs.
>
> You're right, sorry I was mistaken about this detail of how partial clones work.
>>
>>> In order to make repack work with partial clone, teach repack a new
>>> option --filter, which takes a <filter-spec> argument. repack will skip
>>> any objects that are matched by <filter-spec> similar to how the clone
>>> command will skip fetching certain objects.
>>
>> This is a bit misleading, since 'git clone' doesn't "skip fetching",
>> but instead requests a filter and the server can choose to write a
>> pack-file using that filter. I'm not sure if it's worth how pedantic
>> I'm being here.
>
> Thanks for the more precise description of the mechanics of partial clone.
> I'll improve the wording in the next version of this patch series.
>
>>
>> The thing that I find confusing here is that you are adding an option
>> that could be run on a _full_ repository. If I have a set of packs
>> and none of them are promisor (I have every reachable object), then
>> what is the end result after 'git repack -adf --filter=blob:none'?
>> Those existing pack-files shouldn't be deleted because they have
>> objects that are not in the newly-created pack-file.
>>
>> I'd like to see some additional clarity on this before continuing
>> to review this series.
>
> Apologies for the lack of clarity. Indeed, I can see why this is the most important
> detail of this patch to provide enough context on, as it involves deleting
> objects from a full repository as you said.
>
> To back up a little, the goal is to be able to offload large
> blobs to a separate http server. Christian Couder has a demo [1] that shows
> this in detail.
>
> If we had the following:
> A. an http server to use as a generalized object store
> B. a server update hook that uploads large blobs to 1.
> C. a git server
> D. a regular job that runs `git repack --filter` to remove large
> blobs from C.
>
> Clients would need to configure both C) and A) as promisor remotes to
> be able to get everything. When they push new large blobs, they can
> still push them to C), as B) will upload them to A), and D) will
> regularly remove those large blobs from C).
>
> This way with a little bit of client and server configuration, we can have
> a native way to support offloading large files without git LFS.
> It would be more flexible as you can easily tweak which blobs are considered large
> files by tweaking B) and D).
>

[1] https://gitlab.com/chriscool/partial-clone-demo/-/blob/master/http-promisor/server_demo.txt

>>
>>> The final goal of this feature, is to be able to store objects on a
>>> server other than the regular git server itself.
>>>
>>> There are several scripts added so we can test the process of using a
>>> remote helper to upload blobs to an http server:
>>>
>>> - t/lib-httpd/list.sh lists blobs uploaded to the http server.
>>> - t/lib-httpd/upload.sh uploads blobs to the http server.
>>> - t/t0410/git-remote-testhttpgit a remote helper that can access blobs
>>>   onto from an http server. Copied over from t/t5801/git-remote-testhttpgit
>>>   and modified to upload blobs to an http server.
>>> - t/t0410/lib-http-promisor.sh convenience functions for uploading
>>>   blobs
>>
>> I think these changes to the tests should be extracted to a new
>> patch where this can be discussed in more detail. I didn't look
>> too closely at them because I want to focus on whether this
>> --filter option is a good direction for 'git repack'.
>>
>>>  		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
>>> @@ -819,6 +824,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>>>  		if (line.len != the_hash_algo->hexsz)
>>>  			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
>>>  		string_list_append(&names, line.buf);
>>> +		if (po_args.filter) {
>>> +			char *promisor_name = mkpathdup("%s-%s.promisor", packtmp,
>>> +							line.buf);
>>> +			write_promisor_file(promisor_name, NULL, 0);
>>
>> This code is duplicated in repack_promisor_objects(), so it would be
>> good to extract that logic into a helper method called by both places.
>
> Thanks for pointing this out. I'll incorporate this into the next version.
>>
>>> +		}
>>>  	}
>>>  	fclose(out);
>>>  	ret = finish_command(&cmd);
>>
>>> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
>>> index e489869dd94..78cc1858cb6 100755
>>> --- a/t/t7700-repack.sh
>>> +++ b/t/t7700-repack.sh
>>> @@ -237,6 +237,26 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
>>>  	test_must_be_empty actual
>>>  '
>>>
>>> +test_expect_success 'repack with filter does not fetch from remote' '
>>> +	rm -rf server client &&
>>> +	test_create_repo server &&
>>> +	git -C server config uploadpack.allowFilter true &&
>>> +	git -C server config uploadpack.allowAnySHA1InWant true &&
>>> +	echo content1 >server/file1 &&
>>> +	git -C server add file1 &&
>>> +	git -C server commit -m initial_commit &&
>>> +	expected="?$(git -C server rev-parse :file1)" &&
>>> +	git clone --bare --no-local server client &&
>>
>> You could use "file:://$(pwd)/server" here instead of "server".
>
> good point, thanks
>
>>
>>> +	git -C client config remote.origin.promisor true &&
>>> +	git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
>> This isn't testing what you want it to test, because your initial
>> clone doesn't use --filter=blob:none, so you already have all of
>> the objects in the client. You would never trigger a need for a
>> fetch from the remote.
>
> right, so this test is actually testing that repack --filter would shed objects to show
> that it can be used as D) as a regular cleanup job for git servers that utilize another
> http server to host large blobs.
>
>>
>>> +	git -C client rev-list --objects --all --missing=print >objects &&
>>> +	grep "$expected" objects &&
>>> +	git -C client repack -a -d &&
>>> +	expected="$(git -C server rev-parse :file1)" &&
>>
>> This is signalling to me that you are looking for a remote fetch
>> now that you are repacking everything, and that can only happen
>> if you deleted objects from the client during your first repack.
>> That seems incorrect.
>>
>>> +	git -C client rev-list --objects --all --missing=print >objects &&
>>> +	grep "$expected" objects
>>> +'
>>
>> Based on my current understanding, this patch seems unnecessary (repacks
>> should already be doing the right thing when in the presence of a partial
>> clone) and incorrect (we should not delete existing reachable objects
>> when repacking with a filter).
>>
>> I look forward to hearing more about your intended use of this feature so
>> we can land on a better way to solve the problems you are having.
>
> Thanks for the callouts on the big picture of this proposed change. Looking
> forward to getting your thoughts on this!
>>
>> Thanks,
>> -Stolee

  parent reply	other threads:[~2022-01-30 13:02 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-27  1:49 [PATCH 0/2] repack: add --filter= John Cai via GitGitGadget
2022-01-27  1:49 ` [PATCH 1/2] pack-objects: allow --filter without --stdout John Cai via GitGitGadget
2022-01-27  1:49 ` [PATCH 2/2] repack: add --filter=<filter-spec> option John Cai via GitGitGadget
2022-01-27 15:03   ` Derrick Stolee
2022-01-29 19:14     ` John Cai
2022-01-30  8:16       ` Christian Couder
2022-01-30 13:02       ` John Cai [this message]
2022-02-09  2:10 ` [PATCH v2 0/4] [RFC] repack: add --filter= John Cai via GitGitGadget
2022-02-09  2:10   ` [PATCH v2 1/4] pack-objects: allow --filter without --stdout John Cai via GitGitGadget
2022-02-09  2:10   ` [PATCH v2 2/4] repack: add --filter=<filter-spec> option John Cai via GitGitGadget
2022-02-09  2:10   ` [PATCH v2 3/4] upload-pack: allow missing promisor objects John Cai via GitGitGadget
2022-02-09  2:10   ` [PATCH v2 4/4] tests for repack --filter mode John Cai via GitGitGadget
2022-02-17 16:14     ` Robert Coup
2022-02-17 20:36       ` John Cai
2022-02-09  2:27   ` [PATCH v2 0/4] [RFC] repack: add --filter= John Cai
2022-02-16 15:39   ` Robert Coup
2022-02-16 21:07     ` John Cai
2022-02-21  3:11       ` Taylor Blau
2022-02-21 15:38         ` Robert Coup
2022-02-21 17:57           ` Taylor Blau
2022-02-21 21:10         ` Christian Couder
2022-02-21 21:42           ` Taylor Blau
2022-02-22 17:11             ` Christian Couder
2022-02-22 17:33               ` Taylor Blau
2022-02-23 15:40               ` Robert Coup
2022-02-23 19:31               ` Junio C Hamano
2022-02-26 16:01                 ` John Cai
2022-02-26 17:29                   ` Taylor Blau
2022-02-26 20:19                     ` John Cai
2022-02-26 20:30                       ` Taylor Blau
2022-02-26 21:05                         ` John Cai
2022-02-26 21:44                           ` Taylor Blau
2022-02-22 18:52             ` John Cai
2022-02-22 19:35               ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8D4655AA-8D2A-4D4C-A7CD-B79A8A9E66D7@gmail.com \
    --to=johncai86@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).