git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Calculating pack file SHA value
@ 2019-03-28  1:06 Farhan Khan
  2019-03-28  2:02 ` Jeff King
  0 siblings, 1 reply; 2+ messages in thread
From: Farhan Khan @ 2019-03-28  1:06 UTC (permalink / raw)
  To: git

Hi all,

I am trying to figure out how to calculate the SHA value of a pack file when you
run `git index-pack file.pack`. I am close, but having a bit of trouble at the
end. Here's my understanding so far.

Git buffers data to be processed and when its exhausted, updates the SHA
checksum with the previously read data. This is from builtin/index-pack.c,
specifically fill() which calls flush() to update the SHA value. My question is,
how does git determine how many bytes at a time to process?

The size of the buffer is the file-scope variable input_len. This size seems to
be 4096 several times until the very end where it reduces to less-than 4096
(obviously this depends on the pack file, but in my case its 1074 bytes).
Ordinarily I would think its a result of the read() call not receiving the full
4096 bytes, but there still are left over bytes in the file but my manual
verification shows there are still remaining bytes in the file which are not run
through the SHA checksum.

How does git calculate a pack file's SHA verification? How does it know what
size (number of bytes) to read when running flush() to update the buffer?
(typically 4096). How does it know when in the file to stop updating the SHA1
value?

I hope my questions are clear. Thanks!

---
Farhan Khan
PGP Fingerprint: 1312 89CE 663E 1EB2 179C 1C83 C41D 2281 F8DA C0DE

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Calculating pack file SHA value
  2019-03-28  1:06 Calculating pack file SHA value Farhan Khan
@ 2019-03-28  2:02 ` Jeff King
  0 siblings, 0 replies; 2+ messages in thread
From: Jeff King @ 2019-03-28  2:02 UTC (permalink / raw)
  To: Farhan Khan; +Cc: git

On Wed, Mar 27, 2019 at 09:06:20PM -0400, Farhan Khan wrote:

> I am trying to figure out how to calculate the SHA value of a pack file when you
> run `git index-pack file.pack`. I am close, but having a bit of trouble at the
> end. Here's my understanding so far.

It's all but the last 20 bytes. You should be able to reproduce it with:

  # the computed sha1
  size=$(stat --format=%s $pack)
  head -c $((size-20)) $pack | sha1sum

  # the sha1 stored in the file, which should match
  tail -c 20 $pack | xxd

> Git buffers data to be processed and when its exhausted, updates the SHA
> checksum with the previously read data. This is from builtin/index-pack.c,
> specifically fill() which calls flush() to update the SHA value. My question is,
> how does git determine how many bytes at a time to process?
> 
> The size of the buffer is the file-scope variable input_len. This size seems to
> be 4096 several times until the very end where it reduces to less-than 4096
> (obviously this depends on the pack file, but in my case its 1074 bytes).
> Ordinarily I would think its a result of the read() call not receiving the full
> 4096 bytes, but there still are left over bytes in the file but my manual
> verification shows there are still remaining bytes in the file which are not run
> through the SHA checksum.

On the fill() side, we may over-read bytes into our buffer. But it's on
the use() side that we actually decide bytes have been used. Note that
it increments input_offset, and then flush() only hashes bytes up to
that offset.

So index-pack is not just blindly hashing N-20 bytes. It's actually
parsing the packfile as it goes, and putting any data it has parsed
correctly into the hash. At the end, we _should_ be left with exactly 20
bytes, and they should match exactly the hash we've computed up to that
point. And in --verify mode (and maybe even other modes) it should be
confirming that.

> How does git calculate a pack file's SHA verification? How does it know what
> size (number of bytes) to read when running flush() to update the buffer?
> (typically 4096). How does it know when in the file to stop updating the SHA1
> value?

The key observation is that flush() isn't actually reading into the
buffer. It's throwing away bytes that have already been marked as used
by use() and shifting the rest to the front of the buffer. And then
fill() is free to read more data into the rest of it.

-Peff

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2019-03-28  2:02 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-28  1:06 Calculating pack file SHA value Farhan Khan
2019-03-28  2:02 ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).