git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git@vger.kernel.org, markbt@efaref.net, git@jeffhostetler.com,
	kevin.david@microsoft.com
Subject: Re: Proposal for missing blob support in Git repos
Date: Sun, 30 Apr 2017 20:57:18 -0700	[thread overview]
Message-ID: <xmqqinllgrfl.fsf@gitster.mtv.corp.google.com> (raw)
In-Reply-To: <20170426221346.25337-1-jonathantanmy@google.com> (Jonathan Tan's message of "Wed, 26 Apr 2017 15:13:46 -0700")

Jonathan Tan <jonathantanmy@google.com> writes:

> In order to determine the code changes in sha1_file.c necessary, I
> investigated the following:
>  (1) functions in sha1_file that take in a hash, without the user
>      regarding how the object is stored (loose or packed)
>  (2) functions in sha1_file that operate on packed objects (because I
>      need to check callers that know about the loose/packed distinction
>      and operate on both differently, and ensure that they can handle
>      the concept of objects that are neither loose nor packed)
>
> For (1), I looked through all non-static functions in sha1_file.c that
> take in an unsigned char * parameter. The ones that are relevant, and my
> modifications to them to resolve this problem, are:
>  - sha1_object_info_extended (fixed in this commit)
>  - sha1_object_info (auto-fixed by sha1_object_info_extended)
>  - read_sha1_file_extended (fixed by fixing read_object)
>  - read_object_with_reference (auto-fixed by read_sha1_file_extended)
>  - force_object_loose (only called from builtin/pack-objects.c, which
>    already knows that at least one pack contains this object)
>  - has_sha1_file_with_flags (fixed in this commit)
>  - assert_sha1_type (auto-fixed by sha1_object_info)
>
> As described in the list above, several changes have been included in
> this commit to fix the necessary functions.
>
> For (2), I looked through the same functions as in (1) and also
> for_each_packed_object. The ones that are relevant are:
>  - parse_pack_index
>    - http - indirectly from http_get_info_packs
>  - find_pack_entry_one
>    - this searches a single pack that is provided as an argument; the
>      caller already knows (through other means) that the sought object
>      is in a specific pack
>  - find_sha1_pack
>    - fast-import - appears to be an optimization to not store a
>      file if it is already in a pack
>    - http-walker - to search through a struct alt_base
>    - http-push - to search through remote packs
>  - has_sha1_pack
>    - builtin/fsck - fixed in this commit
>    - builtin/count-objects - informational purposes only (check if loose
>      object is also packed)
>    - builtin/prune-packed - check if object to be pruned is packed (if
>      not, don't prune it)
>    - revision - used to exclude packed objects if requested by user
>    - diff - just for optimization
>  - for_each_packed_object
>    - reachable - only to find recent objects
>    - builtin/fsck - fixed in this commit
>    - builtin/cat-file - see below
>
> As described in the list above, builtin/fsck has been updated. I have
> left builtin/cat-file alone; this means that cat-file
> --batch-all-objects will only operate on objects physically in the repo.

One thing I wonder is what the performance impact of a change like
this to the codepath that wants to see if an object does _not_ exist
in the repository.  When creating a new object by hashing raw data,
we see if an object with the same name already exists before writing
the compressed loose object out (or comparing the payload to detect
hash collision).  With a "missing blob" support, we'd essentially
spawn an extra process every time we want to create a new blob
locally, and most of the time that is done only to hear the external
command to say "no, we've never heard of such an object", with a
possibly large latency.

If we do not have to worry about that (or if it is no use to worry
about it, because we cannot avoid it if we wanted to do the lazy
loading of objects from elsewhere), then the patch presented here
looked like a sensible first step towards the stated goal.

Thanks.

  reply	other threads:[~2017-05-01  3:57 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-26 22:13 Proposal for missing blob support in Git repos Jonathan Tan
2017-05-01  3:57 ` Junio C Hamano [this message]
2017-05-01 19:12   ` Jonathan Tan
2017-05-01 23:29     ` Junio C Hamano
2017-05-02  0:33       ` Jonathan Tan
2017-05-02  0:38         ` Brandon Williams
2017-05-02  1:41         ` Junio C Hamano
2017-05-02 17:21           ` Jonathan Tan
2017-05-02 18:32             ` Ævar Arnfjörð Bjarmason
2017-05-02 21:45               ` Jonathan Tan
2017-05-04  4:29                 ` Junio C Hamano
2017-05-04 17:09                   ` Jonathan Tan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqinllgrfl.fsf@gitster.mtv.corp.google.com \
    --to=gitster@pobox.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=kevin.david@microsoft.com \
    --cc=markbt@efaref.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).