From: Dan Holmsand <holmsand@gmail.com>
To: Daniel Barkalow <barkalow@iabervon.org>
Cc: git@vger.kernel.org
Subject: Re: [RFC] Design for http-pull on repo with packs
Date: Sun, 10 Jul 2005 23:39:11 +0200 [thread overview]
Message-ID: <42D1957F.1050609@gmail.com> (raw)
In-Reply-To: <Pine.LNX.4.21.0507101557510.30848-100000@iabervon.org>
Daniel Barkalow wrote:
> On Sun, 10 Jul 2005, Dan Holmsand wrote:
>>Daniel Barkalow wrote:
>>> If an individual file is not available, figure out what packs are
>>> available:
>>>
>>> Get the list of pack files the repository has
>>> (currently, I just use "e3117bbaf6a59cb53c3f6f0d9b17b9433f0e4135")
>>> For any packs we don't have, get the index files.
>>
>>This part might be slightly expensive, for large repositories. If one
>>assumes that packs are named as by git-repack-script, however, one might
>>cache indexes we've already seen (again, see below). Or, if you go for
>>the mandatory "pack-index-file", require that it has a reliable order,
>>so that you can get the last added index first.
>
>
> Nothing bad happens if you have index files for pack files you don't have,
> as it turns out; the library ignores them. So we can keep the index files
> around so we can quickly check if they have the objects we want. That way,
> we don't have to worry about skipping something now (because it's not
> needed) and then ignoring it when the branch gets merged in.
>
> So what I actually do is make a list of the pack files that aren't already
> downloaded that are available from the server, and download the index
> files for any where the index file isn't downloaded, either.
Aah. In other words, you do the caching thing as well. It seems a little
ugly, though, to store the index-only index files with the rest of the
pack. It might be preferable to introduce something like
$GIT_DIR/index-cache or something, so than it can be easily cleaned (and
don't follow us around forever when
cloning-by-hardlinking-the-entire-object-directory).
You might end up with quite a large number of index files, after a while
though, if you pull from several repositories that are regularly repacked.
>>> Keep a list of the struct packed_gits for the packs the server has
>>> (these are not used as places to look for objects)
>>>
>>> Each time we need an object, check the list for it. If it is in there,
>>> download the corresponding pack and report success.
>>
>>Here you will need some strategy to deal with packs that overlap with
>>what we've already got. Basically, small and overlapping packs should be
>>unpacked, big and non-overlapping ones saved as is (since
>>git-unpack-objects is painfully slow and memory-hungry...).
>
>
> I don't think there's an issue to having overlapping packs, either with
> each other or with separate objects. If the user wants, stuff can be
> repacked outside of the pull operation (note, though, that the index files
> should be truncated rather than removed, so that the program doesn't fetch
> them again next time some object can't be found easily).
Well, the only issue is obviously waste of space. If you fetch a lot of
branches from independently packed repos, it might mean a lot of waste,
though.
About truncating index files: this seems a bit ugly. You get a file that
doesn't contain what it says it contains, which may cause trouble if for
example the git prune thing is used.
You might be better off with a simple list of index files we know we
have all the objects of (and make sure that git-prune-script deletes
this file, since it possibly breaks the contract).
>>One could also optimize the pack-download bit, by figuring out the last
>>object in the pack that we need (easy enough to do from the index file),
>> and just get the part of the pack file leading up to that object. That
>>could be a huge win for independently packed repositories (I don't do
>>that in my code below, though).
>
>
> That's only possible if you can figure out what you want to have before
> you get it. My code is walking the reachability graph on the client; it
> can only figure out what other objects it needs after it's mapped the pack
> file.
No, but we can find out which objects we *don't* want (i.e. the ones we
have). And that may be a lot, e.g. if a repository is fully repacked, or
if we track branches on several similar but independently packed
repositories. And as far as I understand git-pack-objects, it tries to
put recent objects in the front.
I don't have any numbers to back this up with, though. Some testing may
be needed, but since the population of packed public repositories is 1,
this is tricky...
> I might use that method for listing the available packs, although I'd sort
> of like to encourage a clean solution first.
Encouraging cleanliness is obviously a good thing :-)
/dan
next prev parent reply other threads:[~2005-07-10 21:44 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-07-10 18:42 [RFC] Design for http-pull on repo with packs Daniel Barkalow
2005-07-10 19:56 ` Dan Holmsand
2005-07-10 20:29 ` Daniel Barkalow
2005-07-10 21:39 ` Dan Holmsand [this message]
2005-07-11 3:18 ` Junio C Hamano
2005-07-11 15:53 ` Dan Holmsand
2005-07-11 17:08 ` Tony Luck
2005-07-11 23:30 ` Junio C Hamano
2005-07-12 17:21 ` Dan Holmsand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=42D1957F.1050609@gmail.com \
--to=holmsand@gmail.com \
--cc=barkalow@iabervon.org \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).