git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Dan Holmsand <holmsand@gmail.com>
To: Daniel Barkalow <barkalow@iabervon.org>
Cc: git@vger.kernel.org
Subject: Re: [RFC] Design for http-pull on repo with packs
Date: Sun, 10 Jul 2005 23:39:11 +0200	[thread overview]
Message-ID: <42D1957F.1050609@gmail.com> (raw)
In-Reply-To: <Pine.LNX.4.21.0507101557510.30848-100000@iabervon.org>

Daniel Barkalow wrote:
> On Sun, 10 Jul 2005, Dan Holmsand wrote:
>>Daniel Barkalow wrote:
>>> If an individual file is not available, figure out what packs are
>>>  available:
>>>
>>>   Get the list of pack files the repository has
>>>    (currently, I just use "e3117bbaf6a59cb53c3f6f0d9b17b9433f0e4135")
>>>   For any packs we don't have, get the index files.
>>
>>This part might be slightly expensive, for large repositories. If one 
>>assumes that packs are named as by git-repack-script, however, one might 
>>cache indexes we've already seen (again, see below). Or, if you go for 
>>the mandatory "pack-index-file", require that it has a reliable order, 
>>so that you can get the last added index first.
> 
> 
> Nothing bad happens if you have index files for pack files you don't have,
> as it turns out; the library ignores them. So we can keep the index files
> around so we can quickly check if they have the objects we want. That way,
> we don't have to worry about skipping something now (because it's not
> needed) and then ignoring it when the branch gets merged in.
> 
> So what I actually do is make a list of the pack files that aren't already
> downloaded that are available from the server, and download the index
> files for any where the index file isn't downloaded, either.

Aah. In other words, you do the caching thing as well. It seems a little 
ugly, though, to store the index-only index files with the rest of the 
pack. It might be preferable to introduce something like 
$GIT_DIR/index-cache or something, so than it can be easily cleaned (and 
don't follow us around forever when 
cloning-by-hardlinking-the-entire-object-directory).

You might end up with quite a large number of index files, after a while 
though, if you pull from several repositories that are regularly repacked.

>>>   Keep a list of the struct packed_gits for the packs the server has
>>>    (these are not used as places to look for objects)
>>>
>>> Each time we need an object, check the list for it. If it is in there,
>>>  download the corresponding pack and report success.
>>
>>Here you will need some strategy to deal with packs that overlap with 
>>what we've already got. Basically, small and overlapping packs should be 
>>unpacked, big and non-overlapping ones saved as is (since 
>>git-unpack-objects is painfully slow and memory-hungry...).
> 
> 
> I don't think there's an issue to having overlapping packs, either with
> each other or with separate objects. If the user wants, stuff can be
> repacked outside of the pull operation (note, though, that the index files
> should be truncated rather than removed, so that the program doesn't fetch
> them again next time some object can't be found easily).

Well, the only issue is obviously waste of space. If you fetch a lot of 
branches from independently packed repos, it might mean a lot of waste, 
though.

About truncating index files: this seems a bit ugly. You get a file that 
doesn't contain what it says it contains, which may cause trouble if for 
example the git prune thing is used.

You might be better off with a simple list of index files we know we 
have all the objects of (and make sure that git-prune-script deletes 
this file, since it possibly breaks the contract).

>>One could also optimize the pack-download bit, by figuring out the last 
>>object in the pack that we need (easy enough to do from the index file), 
>>  and just get the part of the pack file leading up to that object. That 
>>could be a huge win for independently packed repositories (I don't do 
>>that in my code below, though).
> 
> 
> That's only possible if you can figure out what you want to have before
> you get it. My code is walking the reachability graph on the client; it
> can only figure out what other objects it needs after it's mapped the pack
> file.

No, but we can find out which objects we *don't* want (i.e. the ones we 
have). And that may be a lot, e.g. if a repository is fully repacked, or 
if we track branches on several similar but independently packed 
repositories. And as far as I understand git-pack-objects, it tries to 
put recent objects in the front.

I don't have any numbers to back this up with, though. Some testing may 
be needed, but since the population of packed public repositories is 1, 
this is tricky...

> I might use that method for listing the available packs, although I'd sort
> of like to encourage a clean solution first.

Encouraging cleanliness is obviously a good thing :-)

/dan

  reply	other threads:[~2005-07-10 21:44 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-07-10 18:42 [RFC] Design for http-pull on repo with packs Daniel Barkalow
2005-07-10 19:56 ` Dan Holmsand
2005-07-10 20:29   ` Daniel Barkalow
2005-07-10 21:39     ` Dan Holmsand [this message]
2005-07-11  3:18   ` Junio C Hamano
2005-07-11 15:53     ` Dan Holmsand
2005-07-11 17:08       ` Tony Luck
2005-07-11 23:30       ` Junio C Hamano
2005-07-12 17:21         ` Dan Holmsand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=42D1957F.1050609@gmail.com \
    --to=holmsand@gmail.com \
    --cc=barkalow@iabervon.org \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).