git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Understanding pack format
@ 2018-11-02  5:23 Farhan Khan
  2018-11-02  6:15 ` Junio C Hamano
  2018-11-02 15:54 ` Duy Nguyen
  0 siblings, 2 replies; 7+ messages in thread
From: Farhan Khan @ 2018-11-02  5:23 UTC (permalink / raw)
  To: git

Hi all,

I am trying to understand the pack file format and have been reading
the documentation, specifically https://git-scm.com/docs/pack-format
(which is in git's own git repository as
"Documentation/technical/pack-format.txt"). I see that the file starts
with the "PACK" signature, followed by the 4 byte version and 4 byte
number of objects. After this, the documentation speaks about
Undeltified and Deltified representations. I understand conceptually
what each is, but do not know specifically how git parses it out.

Can someone please explain this to me? Is there any sample code of how
to interpret each entry? Where is this in the git code? That might
serve as a good guide.

I see a few references to "PACK_SIGNATURE", but not certain which
actually reads the data.

Thanks!
--
Farhan Khan
PGP Fingerprint: B28D 2726 E2BC A97E 3854 5ABE 9A9F 00BC D525 16EE

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Understanding pack format
  2018-11-02  5:23 Understanding pack format Farhan Khan
@ 2018-11-02  6:15 ` Junio C Hamano
  2018-11-02 16:00   ` Duy Nguyen
  2018-11-02 15:54 ` Duy Nguyen
  1 sibling, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2018-11-02  6:15 UTC (permalink / raw)
  To: Farhan Khan; +Cc: git

Farhan Khan <khanzf@gmail.com> writes:

> ...Where is this in the git code? That might
> serve as a good guide.

There are two major codepaths.  One is used at runtime, giving us
random access into the packfile with the help with .idx file.  The
other is used when receiving a new packstream to create an .idx
file.

Personally I find the latter a bit too dense for those who are new
to the codebase, and the former would probably be easier to grok.

Start from sha1-file.c::read_object(), which will eventually lead
you to oid_object_info_extended() that essentially boils down to

 - a call to find_pack_entry() with the object name, and then

 - a call to packed_object_info() with the pack entry found earlier.

Following packfile.c::packed_object_info() will lead you to
cache_or_unpack_entry(); the unpack_entry() function is where all
the action to read from the packstream for one object's worth of
data and to reconstruct the object out of its deltified representation
takes place.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Understanding pack format
  2018-11-02  5:23 Understanding pack format Farhan Khan
  2018-11-02  6:15 ` Junio C Hamano
@ 2018-11-02 15:54 ` Duy Nguyen
  1 sibling, 0 replies; 7+ messages in thread
From: Duy Nguyen @ 2018-11-02 15:54 UTC (permalink / raw)
  To: khanzf; +Cc: Git Mailing List

On Fri, Nov 2, 2018 at 6:26 AM Farhan Khan <khanzf@gmail.com> wrote:
>
> Hi all,
>
> I am trying to understand the pack file format and have been reading
> the documentation, specifically https://git-scm.com/docs/pack-format
> (which is in git's own git repository as
> "Documentation/technical/pack-format.txt"). I see that the file starts
> with the "PACK" signature, followed by the 4 byte version and 4 byte
> number of objects. After this, the documentation speaks about
> Undeltified and Deltified representations. I understand conceptually
> what each is, but do not know specifically how git parses it out.

If by "it" you mean the deltified representations, I think it's
actually documented in pack-format.txt. If you prefer C over English,
look at patch-delta.c

-- 
Duy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Understanding pack format
  2018-11-02  6:15 ` Junio C Hamano
@ 2018-11-02 16:00   ` Duy Nguyen
  2018-11-06  2:23     ` Farhan Khan
  0 siblings, 1 reply; 7+ messages in thread
From: Duy Nguyen @ 2018-11-02 16:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: khanzf, Git Mailing List

On Fri, Nov 2, 2018 at 7:19 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Farhan Khan <khanzf@gmail.com> writes:
>
> > ...Where is this in the git code? That might
> > serve as a good guide.
>
> There are two major codepaths.  One is used at runtime, giving us
> random access into the packfile with the help with .idx file.  The
> other is used when receiving a new packstream to create an .idx
> file.

The third path is copying/reusing objects in
builtin/pack-objects.c::write_reuse_object(). Since it's mostly
encoding the header of new objects in pack, it could also be a good
starting point. Then you can move to write_no_reuse_object() and get
how the data is encoded, deltified or not (yeah not parsed, but I
think it's more or less the same thing conceptually).
-- 
Duy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Understanding pack format
  2018-11-02 16:00   ` Duy Nguyen
@ 2018-11-06  2:23     ` Farhan Khan
  2018-11-06  6:09       ` Jeff King
  2018-11-06 16:06       ` Duy Nguyen
  0 siblings, 2 replies; 7+ messages in thread
From: Farhan Khan @ 2018-11-06  2:23 UTC (permalink / raw)
  To: pclouds; +Cc: gitster, git

On Fri, Nov 2, 2018 at 12:00 PM Duy Nguyen <pclouds@gmail.com> wrote:
>
> On Fri, Nov 2, 2018 at 7:19 AM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Farhan Khan <khanzf@gmail.com> writes:
> >
> > > ...Where is this in the git code? That might
> > > serve as a good guide.
> >
> > There are two major codepaths.  One is used at runtime, giving us
> > random access into the packfile with the help with .idx file.  The
> > other is used when receiving a new packstream to create an .idx
> > file.
>
> The third path is copying/reusing objects in
> builtin/pack-objects.c::write_reuse_object(). Since it's mostly
> encoding the header of new objects in pack, it could also be a good
> starting point. Then you can move to write_no_reuse_object() and get
> how the data is encoded, deltified or not (yeah not parsed, but I
> think it's more or less the same thing conceptually).
> --
> Duy

Hi all,

To follow-up from the other day, I have been reading the code that
retrieves the pack entry for the past 3 days now without much success.
But there are quite a few abstractions and I get lost half-way down
the line.

I am trying to identify where the content from a pack comes from. I
traced it back to sha1-file.c:read_object(), which will return the
'content'. I want to know where the 'content' comes from, which seems
to come from sha1-file.c:oid_object_info_extended. This goes into
packfile.c:find_pack_entry(), but from here I get lost. I do not
understand what is happening.

How does it retrieve the pack content? I am lost here. Please assist.
This is in the technical git documentation, but it was not clear.

Thank you,

--
Farhan Khan
PGP Fingerprint: B28D 2726 E2BC A97E 3854 5ABE 9A9F 00BC D525 16EE

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Understanding pack format
  2018-11-06  2:23     ` Farhan Khan
@ 2018-11-06  6:09       ` Jeff King
  2018-11-06 16:06       ` Duy Nguyen
  1 sibling, 0 replies; 7+ messages in thread
From: Jeff King @ 2018-11-06  6:09 UTC (permalink / raw)
  To: Farhan Khan; +Cc: pclouds, gitster, git

On Mon, Nov 05, 2018 at 09:23:45PM -0500, Farhan Khan wrote:

> I am trying to identify where the content from a pack comes from. I
> traced it back to sha1-file.c:read_object(), which will return the
> 'content'. I want to know where the 'content' comes from, which seems
> to come from sha1-file.c:oid_object_info_extended. This goes into
> packfile.c:find_pack_entry(), but from here I get lost. I do not
> understand what is happening.
> 
> How does it retrieve the pack content? I am lost here. Please assist.
> This is in the technical git documentation, but it was not clear.

After find_pack_entry() tells us the object is in a pack, we end up in
packed_object_info(). Depending what the caller is asking for, there are
a couple different strategies (because we try to avoid loading the whole
object if we don't need it).

Probably the one you're interested in is just grabbing the content,
which happens via cache_or_unpack_entry(). The cached case is less
interesting, so try unpack_entry(), which is what actually reads the
bytes out of the packfile.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Understanding pack format
  2018-11-06  2:23     ` Farhan Khan
  2018-11-06  6:09       ` Jeff King
@ 2018-11-06 16:06       ` Duy Nguyen
  1 sibling, 0 replies; 7+ messages in thread
From: Duy Nguyen @ 2018-11-06 16:06 UTC (permalink / raw)
  To: Farhan Khan; +Cc: Junio C Hamano, Git Mailing List

On Tue, Nov 6, 2018 at 3:23 AM Farhan Khan <khanzf@gmail.com> wrote:
> To follow-up from the other day, I have been reading the code that
> retrieves the pack entry for the past 3 days now without much success.
> But there are quite a few abstractions and I get lost half-way down
> the line.

Jeff already gave you some pointers. This is just a side note.

I think it's easier to run the code under a debugger and see what it
does than just reading it. You can create a repo with just one blob to
have better control over it (small packs also make it possible to
examine with a hex editor in parallel), e.g.

git init foo
cd foo
echo hello >file
git add file
git repack -ad
gdb --args git show :./file

then put a breakpoint in some interesting functions (perhaps one of
those Jeff pointed out)
-- 
Duy

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-11-06 16:07 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-02  5:23 Understanding pack format Farhan Khan
2018-11-02  6:15 ` Junio C Hamano
2018-11-02 16:00   ` Duy Nguyen
2018-11-06  2:23     ` Farhan Khan
2018-11-06  6:09       ` Jeff King
2018-11-06 16:06       ` Duy Nguyen
2018-11-02 15:54 ` Duy Nguyen

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).