git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Ben Peart <benpeart@microsoft.com>
Cc: git@vger.kernel.org, pclouds@gmail.com, chriscool@tuxfamily.org,
	Johannes.Schindelin@gmx.de, alexmv@dropbox.com, peff@peff.net
Subject: Re: [PATCH v1 1/4] fastindex: speed up index load through parallelization
Date: Fri, 10 Nov 2017 13:46:36 +0900	[thread overview]
Message-ID: <xmqqbmkahhar.fsf@gitster.mtv.corp.google.com> (raw)
In-Reply-To: <20171109141737.47976-2-benpeart@microsoft.com> (Ben Peart's message of "Thu, 9 Nov 2017 09:17:34 -0500")

Ben Peart <benpeart@microsoft.com> writes:

> To make this work for V4 indexes, when writing the index, it periodically
> "resets" the prefix-compression by encoding the current entry as if the
> path name for the previous entry is completely different and saves the
> offset of that entry in the IEOT.  Basically, with V4 indexes, it
> generates offsets into blocks of prefix-compressed entries.

You specifically mentioned about this change in your earlier
correspondence on this series, and it reminds me of Shawn's reftable
proposal that also is heavily depended on prefix compression with
occasional reset to allow scanning from an entry in the middle.  I
find it a totally sensible design choice.

> To enable reading the IEOT extension before reading all the variable
> length cache entries and other extensions, the IEOT is written last,
> right before the trailing SHA1.

OK, we have already closed the door for further enhancements to
place something other than the index extensions after this block by
mistakenly made it the rule that the end of the series of extended
sections must coincide with the beginning of the trailing checksum,
so it does not sound all that bad to insist on this particular one
to be the last, I guess.  But see below...

> The format of that extension has the signature bytes and size at the
> beginning (like a normal extension) as well as at the end in reverse
> order to enable parsing the extension by seeking back from the end of
> the file.

I think this is a reasonable workaround to allow a single extension
that needs to be read before the main index is loaded.  

But I'd suggest taking this approach one step further.  Instead,
what if we introduce a new extension EOIE ("end of index entries")
whose sole purpose is to sit at the end of the series of extensions
and point at the beginning of the index extension section of the
file, to tell you where to seek in order to skip the main index?

That way, you can

 - seek to the end of the index file;

 - go backward skiping the trailing file checksum;

 - now you might be at the end of the EOIE extension.  seek back
   necessary number of bytes and verify EOIE header, pick up the
   recorded file offset of the beginning of the extensions section.

 - The 4-byte sequence you found may happen to match EOIE but that
   is not enough to be sure that you indeed have such an extension.
   So the following must be done carefully, allowing the possibility
   that there wasn't any EOIE extension at the end.
   Seek back to that offset, and repeat these three steps to skip
   over all extensions:

   - read the (possible) 4-byte header
   - read the (possible) extsize (validate that this is a sane value)
   - skip that many bytes

   until you come back to the location you assumed that you found
   your EOIE header, to make sure you did not get fooled by bytes
   that happened to look like one.  Some "extsize" you picked up
   during that process may lead you beyond the end of the index
   file, which would be a sure sign that what you found at the end
   of the index file was *not* the EOIE extension but a part of
   some other extension who happened to have these four bytes at the
   right place.

which would be a lot more robust way to allow any extensions to be
read before the main body of the index.

And a lot more importantly, we will leave the door open to allow
more than one index extensions that we would prefer to read before
reading the main body if we do it this way, because we can easily
skip things over without spending cycles once we have a robust way
to find where the end of the main index is.  After all, the reason
you placed IEOT at the end is not because you wanted to have it at
the very end.  You only wanted to be able to find where it is
without having to parse the variable-length records of the main
index.  And there may be other index extensions that wants to be
readable without reading the main part just like IEOT, and we do not
want to close the door for them.

Hmm?

  reply	other threads:[~2017-11-10  4:46 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-09 14:17 [PATCH v1 0/4] Speed up index load through parallelization Ben Peart
2017-11-09 14:17 ` [PATCH v1 1/4] fastindex: speed " Ben Peart
2017-11-10  4:46   ` Junio C Hamano [this message]
2017-11-13 16:42     ` Ben Peart
2017-11-14  1:10       ` Junio C Hamano
2017-11-14 14:31         ` Ben Peart
2017-11-14 15:04           ` Junio C Hamano
2017-11-14 15:40             ` Ben Peart
2017-11-15  1:12               ` Junio C Hamano
2017-11-15  4:16                 ` Ben Peart
2017-11-15  4:40                   ` Junio C Hamano
2017-11-20 14:01                     ` Ben Peart
2017-11-20 14:20                       ` Jeff King
2017-11-20 15:38                         ` Jeff King
2017-11-20 23:51                       ` Ramsay Jones
2017-11-21  0:45                         ` Ben Peart
2017-11-09 14:17 ` [PATCH v1 2/4] update-index: add fastindex support to update-index Ben Peart
2017-11-09 14:17 ` [PATCH v1 3/4] fastindex: add test tools and a test script Ben Peart
2017-11-09 14:17 ` [PATCH v1 4/4] fastindex: add documentation for the fastindex extension Ben Peart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqbmkahhar.fsf@gitster.mtv.corp.google.com \
    --to=gitster@pobox.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=alexmv@dropbox.com \
    --cc=benpeart@microsoft.com \
    --cc=chriscool@tuxfamily.org \
    --cc=git@vger.kernel.org \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).