git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Farhan Khan <farhan@farhan.codes>
Cc: git@vger.kernel.org
Subject: Re: Calculating pack file SHA value
Date: Wed, 27 Mar 2019 22:02:27 -0400	[thread overview]
Message-ID: <20190328020227.GB7887@sigill.intra.peff.net> (raw)
In-Reply-To: <a48b86698802006045ed0af060b4e822@farhan.codes>

On Wed, Mar 27, 2019 at 09:06:20PM -0400, Farhan Khan wrote:

> I am trying to figure out how to calculate the SHA value of a pack file when you
> run `git index-pack file.pack`. I am close, but having a bit of trouble at the
> end. Here's my understanding so far.

It's all but the last 20 bytes. You should be able to reproduce it with:

  # the computed sha1
  size=$(stat --format=%s $pack)
  head -c $((size-20)) $pack | sha1sum

  # the sha1 stored in the file, which should match
  tail -c 20 $pack | xxd

> Git buffers data to be processed and when its exhausted, updates the SHA
> checksum with the previously read data. This is from builtin/index-pack.c,
> specifically fill() which calls flush() to update the SHA value. My question is,
> how does git determine how many bytes at a time to process?
> 
> The size of the buffer is the file-scope variable input_len. This size seems to
> be 4096 several times until the very end where it reduces to less-than 4096
> (obviously this depends on the pack file, but in my case its 1074 bytes).
> Ordinarily I would think its a result of the read() call not receiving the full
> 4096 bytes, but there still are left over bytes in the file but my manual
> verification shows there are still remaining bytes in the file which are not run
> through the SHA checksum.

On the fill() side, we may over-read bytes into our buffer. But it's on
the use() side that we actually decide bytes have been used. Note that
it increments input_offset, and then flush() only hashes bytes up to
that offset.

So index-pack is not just blindly hashing N-20 bytes. It's actually
parsing the packfile as it goes, and putting any data it has parsed
correctly into the hash. At the end, we _should_ be left with exactly 20
bytes, and they should match exactly the hash we've computed up to that
point. And in --verify mode (and maybe even other modes) it should be
confirming that.

> How does git calculate a pack file's SHA verification? How does it know what
> size (number of bytes) to read when running flush() to update the buffer?
> (typically 4096). How does it know when in the file to stop updating the SHA1
> value?

The key observation is that flush() isn't actually reading into the
buffer. It's throwing away bytes that have already been marked as used
by use() and shifting the rest to the front of the buffer. And then
fill() is free to read more data into the rest of it.

-Peff

      reply	other threads:[~2019-03-28  2:02 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-28  1:06 Calculating pack file SHA value Farhan Khan
2019-03-28  2:02 ` Jeff King [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190328020227.GB7887@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=farhan@farhan.codes \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).