Index format v5

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Index format v5
@ 2012-05-03 17:25 Thomas Gummerer
  2012-05-03 18:16 ` Thomas Rast
                   ` (9 more replies)
  0 siblings, 10 replies; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-03 17:25 UTC (permalink / raw)
  To: git; +Cc: trast, gitster, mhagger, peff, spearce, davidbarr

I have been drafting the Version 5 of the index format over the past
few days with the help of Thomas Rast, Michael Haggerty, cmn and
barrbrain on IRC. It will save with prefix compression on the path, and
using a crc32 over the stat data, instead of the full data, since it is only
used for checking if the file is changed. (Thanks Michael Haggerty for
this hint. Unless we are missing something this will save another
~4 MB on the Webkit index.



GIT index format
================

= The git index file has the following format

  All binary numbers are in network byte order. Version 5 is described
  here.

   - A 16-byte header consisting of

     4-byte signature:
       The signature is { 'D', 'I', 'R', 'C' } (stands for "dircache")

     4-byte version number:
       The current supported versions are 2, 3, 4 and 5.

     32-bit number of directories.

     32-bit number of file entries.

   - Offset to the extensions.

     32-bit number of extensions.

     32-bit number offset to the extension. (Possibly none, as many as
     indicated in the 4-byte number of extensions)

   - A number of directory offsets (see below). [1]

   - A number of sorted directories (see below). [2]

   - 32-bit crc32 checksum for the header, extension offsets and directories.

   - A number of file offsets (see below). [1]

   - A number of file entries (see below).

   - A number of entries for conflicted data/resolved conflicts (see below).

   - Extensions

     Extensions are identified by signature. Optional extensions can
     be ignored if GIT does not understand them.

     GIT supports an arbitrary number of extension, but currently none
     is implemented. [3]

     4-byte extension signature. If the first byte is 'A'..'Z' the
     extension is optional and can be ignored.

     32-bit size of the extension

     32-bit crc32 checksum of the extension signature and size

     Extension data


== Directory entry offsets

  32-bit offset to the directory.

  This part is needed for making the directory entries bisectable and
    thus allowing a binary search.

== Directory entry

  Directory entries are sorted in lexicographic order by the name
  of their path starting with the root.

  Path names (variable length) relative to top level directory (without the
    leading slash). '/' is used as path separator. '.' indicates the root
    directory. The special patch components ".." and ".git" (without quotes)
    are disallowed. Trailing slash is also disallowed.

  1 nul byte to terminate the path.

  32-bit offset to the first file of a directory

  32-bit offset to conflicted/resolved data at the end of the index.
    0 if there is no such data. [4]

  4-byte number of subtrees this tree has

  4-byte number of entries in the index that is covered by the tree this
    entry represents. (entry_count) (-1 if the entry is invalid)

  160-bit object name for the object that would result from writing
    this span of index as a tree.

  The last 24 bytes are for the cache tree. An entry can be in an
    invalidated state which is represented by having -1 in the entry_count
    field. If an entry is in invalidated state, the next entry will begin
    after the number of subtrees, and the 160-bit object name is dropped.

  The entries are written out in the top-down, depth-first order. The
    first entry represents the root level of the repository, followed by
    the first subtree - let's call it A - of the root level, followed by
    the first subtree of A, ...

== File entry offsets

  32-bit offset to the directory.

  This part is needed for making the directory entries bisectable and
    thus allowing a binary search.

== File entry

  File entries are sorted in ascending order on the name field, after the
  respective offset given by the directory entries.

  File name (variable length). Nul bytes are not allowed in file names and
    they have no leading slash. They are 7-bit ASCII encoded.

  1 nul byte to terminate the filename.

  A 16-bit 'flags' field split into (high to low bits)

    1-bit assume-valid flag

    1-bit conflict flag

    2-bit stage (during merge)

    2-bit mode (0 = 1000644 (regular file without execution
      permission), 1 = 1000755 (regular file with execution
      permission), 2 = 1010000 (symbolic link), 3 = 1110
      (gitlink)) [5]

    1-bit skip-worktree flag (used by sparse checkout)

    1-bit intent-to-add flag (used by "git add -N")

    8-bit unused, must be zero [6]

  32-bit mtime seconds, the last time a file's data changed
    this is stat(2) data

  32-bit mtime nanosecond fractions
    this is stat(2) data

  32-bit crc32 checksum over ctime seconds, ctime nanoseconds,
    ino, file size, dev, uid, gid (All stat(2) data except mtime) [7]

  160-bit SHA-1 for the represented object

  32-bit crc32 checksum for the file entry

== Conflicted data

  A conflict is represented in the index as a set of higher stage entries.
  These entries are stored at the end of the index. When a conflict is
  resolved (e.g. with "git add path"). A bit is flipped, to indicate that
  the conflict is resolved, but the entries will be kept, so that
  conflicts can be recreated (e.g. with "git checkout -m", in case users
  want to redo a conflict resolution from scratch.

  - NUL-terminated filename of the entry

  - A 8-bit 'flags' field split into:

    - 1-bit conflicted state (conflicted/resolved) (1 if conflicted)

    - 7-bit unused

  - Three 4-byte octal numbers, entry mode of entries in stage 1 to 3 (a
    missing stage is represented by "0" in this field);
    and

  - At most three 160-bit object names of the entry in stages from 1 to 3
    (nothing is written for a missing stage).

  - 32-bit crc32 checksum over one conflicted entry.

== Design explanations

[1] The directory and file offsets are included in the index format
    to enable bisectability of the index, for binary searches.Updating
    a single entry and partial reading will benefit from this.

[2] The directories are saved in their own block, to be able to
    quickly search for a directory in the index. They include a
    offset to the (lexically) first file in the directory.

[3] The data of the cache-tree extension and the resolve undo
    extension is now part of the index itself, but if other extensions
    come up in the future, there is no need to change the index, they
    can simply be added at the end.

[4] To avoid rewrites of the whole index when there are conflicts or
    conflicts are being resolved, conflicted data will be stored at
    the end of the index. To mark the conflict resolved, just a bit
    has to be flipped. The data will still be there, if a user wants
    to redo the conflict resolution.

[5] Since only 4 modes are effectively allowed in git but 32-bit are
    used to store them, having a two bit flag for the mode is enough
    and saves 4 byte per entry.

[6] The length of the file name was dropped, since each file name is
    nul terminated anyway.

[7] Since all stat data (except mtime and ctime) is just used for
    checking if a file has changed a checksum of the data is enough.
    In addition to that Thomas Rast suggested ctime could be ditched
    completely (core.trustctime=false) and thus included in the
    checksum. This would save 24 bytes per index entry, which would
    be about 4 MB on the Webkit index.
    (Thanks for the suggestion to Michael Haggerty)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-03 17:25 Index format v5 Thomas Gummerer
@ 2012-05-03 18:16 ` Thomas Rast
  2012-05-03 19:03   ` Junio C Hamano
  2012-05-04  7:12   ` Michael Haggerty
  2012-05-03 18:21 ` Ronan Keryell
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 49+ messages in thread
From: Thomas Rast @ 2012-05-03 18:16 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, gitster, mhagger, peff, spearce, davidbarr

Thomas Gummerer <t.gummerer@gmail.com> writes:

>   32-bit crc32 checksum over ctime seconds, ctime nanoseconds,
>     ino, file size, dev, uid, gid (All stat(2) data except mtime) [7]
[...]
> [7] Since all stat data (except mtime and ctime) is just used for
>     checking if a file has changed a checksum of the data is enough.
>     In addition to that Thomas Rast suggested ctime could be ditched
>     completely (core.trustctime=false) and thus included in the
>     checksum. This would save 24 bytes per index entry, which would
>     be about 4 MB on the Webkit index.
>     (Thanks for the suggestion to Michael Haggerty)

This is the part I'm most curious about.  Are we missing anything?
Michael brought it up on IRC: the stat() results are only used to test
whether they are still the same, with the exception of the mtime (which
also undergoes raciness checks).

As far as I can see, none of st_{ino,dev,uid,gid} are useful for
anything.  st_size might conceivably be used as a hint for a buffer
size, but nobody actually does that.  The ctime undergoes stricter
checks, but AFAICS it's also all about whether it has changed, and
besides that can be turned off.  We think all of those fields can be
replaced by an arbitrary hash/CRC and only tested for equality.  32 bits
should be plenty, probably even if we just xor the values together.

So what's wrong in this thinking?

[The one flaw I found so far is that this makes it impossible to convert
back to v2-4 without at the very least refreshing the index.  Do we
care?]

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-03 17:25 Index format v5 Thomas Gummerer
  2012-05-03 18:16 ` Thomas Rast
@ 2012-05-03 18:21 ` Ronan Keryell
  2012-05-03 20:36   ` Thomas Gummerer
  2012-05-03 18:54 ` Junio C Hamano
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 49+ messages in thread
From: Ronan Keryell @ 2012-05-03 18:21 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git

>>>>> On Thu, 3 May 2012 19:25:12 +0200, Thomas Gummerer <t.gummerer@gmail.com> said:

    Thomas> I have been drafting the Version 5 of the index format over
    Thomas> the past few days with the help of Thomas Rast, Michael
    Thomas> Haggerty, cmn and barrbrain on IRC. It will save with prefix
    Thomas> compression on the path, and using a crc32 over the stat
    Thomas> data, instead of the full data, since it is only used for
    Thomas> checking if the file is changed. (Thanks Michael Haggerty
    Thomas> for this hint. Unless we are missing something this will
    Thomas> save another ~4 MB on the Webkit index.

Great!

But I wonder whether it may not worth to investigate a 64-bit version for
the offsets and so on, just in case...
-- 
  Ronan KERYELL                            |\/  Phone:  +1 408 658 9453
  Wild Systems / HPC Project               |/)
  5201 Great America Parkway, Suite 320    K    Ronan.Keryell@wild-systems.com
  Santa Clara, CA 95054                    |\   skype:keryell
  USA                                      | \  http://wild-systems.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-03 17:25 Index format v5 Thomas Gummerer
  2012-05-03 18:16 ` Thomas Rast
  2012-05-03 18:21 ` Ronan Keryell
@ 2012-05-03 18:54 ` Junio C Hamano
  2012-05-03 19:11   ` Thomas Rast
                     ` (2 more replies)
  2012-05-03 19:38 ` solo-git
                   ` (6 subsequent siblings)
  9 siblings, 3 replies; 49+ messages in thread
From: Junio C Hamano @ 2012-05-03 18:54 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, gitster, mhagger, peff, spearce, davidbarr

Thomas Gummerer <t.gummerer@gmail.com> writes:

> I have been drafting the Version 5 of the index format over the past
> few days with the help of Thomas Rast, Michael Haggerty, cmn and
> barrbrain on IRC.

Hrm, so if there is anything glaringly wrong below, should I reduce the
"trustable reviewer karma point" from these people?  Or did you forget to
say "but remaining errors are mine" ;-)?

> GIT index format
> ================
>
> = The git index file has the following format
>
>   All binary numbers are in network byte order. Version 5 is described
>   here.
>
>    - A 16-byte header consisting of
>
>      4-byte signature:
>        The signature is { 'D', 'I', 'R', 'C' } (stands for "dircache")
>
>      4-byte version number:
>        The current supported versions are 2, 3, 4 and 5.
>
>      32-bit number of directories.

>      32-bit number of file entries.

I take these two numbers mean "we have this many directory entries in the
index" and "we have this many file entries in the index"; I found it
unclear during my first scan.

Have you considered expressing these new "directories", "files" and
associated data as a new (mandatory) extension?

>    - Offset to the extensions.
>
>      32-bit number of extensions.

Why is this necessary?  It means that the writer needs to enumerate how
many extensions it is going to write before starting to write, unless it
is willing to seek back and fill this.  And it is not like the reader will
first allocate an array to hold uniformly sized extension and will be
helped to have the number of entries upfront in order to do the
allocation.

What purpose does this serve?

>      32-bit number offset to the extension. (Possibly none, as many as
>      indicated in the 4-byte number of extensions)

Why is this needed?  It appears that there is no field that points at the
beginning of array that stores "directory offsets", nor "file offsets", so
I am assuming that you will be scanning the index file to find them.  Why
can't the extensions be handled exactly the same way?

>    - A number of directory offsets (see below). [1]
>
>    - A number of sorted directories (see below). [2]
>
>    - 32-bit crc32 checksum for the header, extension offsets and directories.

What does "32" in "crc32" stand for ? ;-)

>    - A number of file offsets (see below). [1]
>
>    - A number of file entries (see below).
>
>    - A number of entries for conflicted data/resolved conflicts (see below).
>
>    - Extensions
>
>      Extensions are identified by signature. Optional extensions can
>      be ignored if GIT does not understand them.
>
>      GIT supports an arbitrary number of extension, but currently none
>      is implemented. [3]
>
>      4-byte extension signature. If the first byte is 'A'..'Z' the
>      extension is optional and can be ignored.
>
>      32-bit size of the extension
>
>      32-bit crc32 checksum of the extension signature and size
>
>      Extension data
>
>
> == Directory entry offsets

The name "directory entry offset" does not seem to appear anywhere before
this, so it is unclear what this section of the documentation is trying to
describe.  Is this the same as "directory offsets" above?  In other words,
"A number of directory offsets (see below)." above is an array of something,
and this section is trying to describe what that something is?

>   32-bit offset to the directory.

I take it is "offset relative to the beginning of the file in the index".

>   This part is needed for making the directory entries bisectable and
>     thus allowing a binary search.
>
> == Directory entry
>
>   Directory entries are sorted in lexicographic order by the name
>   of their path starting with the root.
>
>   Path names (variable length) relative to top level directory (without the
>     leading slash). '/' is used as path separator. '.' indicates the root
>     directory. The special patch components ".." and ".git" (without quotes)
>     are disallowed. Trailing slash is also disallowed.

I understood "the root" to mean "the top level of the directory hierarchy"
(i.e. the directory that corresponds to the top level of the working
tree), but it needs to be explained better.  Using '.' for root sounds
somewhat questionable, though.  Why not a string of length 0?

When an index represents a D/F conflict, some stages may have a directory
D while others do not have D (but have a regular file at D).  Doesn't
directory entry need to have a stage information?

>   1 nul byte to terminate the path.
>
>   32-bit offset to the first file of a directory

What does this point at?  Does it point into the array of "file offsets"
or the array of "file entries"?

>   32-bit offset to conflicted/resolved data at the end of the index.
>     0 if there is no such data. [4]
>
>   4-byte number of subtrees this tree has

Which is an undefined number unless you specify which stage you are
talking about.

>   4-byte number of entries in the index that is covered by the tree this
>     entry represents. (entry_count) (-1 if the entry is invalid)
>
>   160-bit object name for the object that would result from writing
>     this span of index as a tree.
>
>   The last 24 bytes are for the cache tree. An entry can be in an
>     invalidated state which is represented by having -1 in the entry_count
>     field. If an entry is in invalidated state, the next entry will begin
>     after the number of subtrees, and the 160-bit object name is dropped.

By "The last 24 bytes", do you mean the "4-byte number of entries..." and
"160-bit object name"?

>   The entries are written out in the top-down, depth-first order. The
>     first entry represents the root level of the repository, followed by
>     the first subtree - let's call it A - of the root level, followed by
>     the first subtree of A, ...
>
> == File entry offsets
>
>   32-bit offset to the directory.

What directory?  The containing directory?  What does this point at?  Does
it point into which array?

>   This part is needed for making the directory entries bisectable and
>     thus allowing a binary search.
>
> == File entry
>
>   File entries are sorted in ascending order on the name field, after the
>   respective offset given by the directory entries.
>
>   File name (variable length). Nul bytes are not allowed in file names and
>     they have no leading slash. They are 7-bit ASCII encoded.

Is this a name relative to its containing directory (i.e. without leading
components)?  Or is it a full path relative to the top level of the
working tree?

I have some UTF-8 encoded files in my repository.  Are they
now disallowed?

>   1 nul byte to terminate the filename.
>
>   A 16-bit 'flags' field split into (high to low bits)
>
>     1-bit assume-valid flag

Is this "assume unchanged"?

>     1-bit conflict flag
>
>     2-bit stage (during merge)

Huh?  When stage #0 entry exists for a given path, no other stages for the
same path can exist in the index.  By definition, that is how a conflicted
path is resolved.  What is this separate "conflict flag" for?

>     2-bit mode (0 = 1000644 (regular file without execution
>       permission), 1 = 1000755 (regular file with execution
>       permission), 2 = 1010000 (symbolic link), 3 = 1110
>       (gitlink)) [5]

Don't penny-pinch bits like this. 

>     1-bit skip-worktree flag (used by sparse checkout)
>
>     1-bit intent-to-add flag (used by "git add -N")
>
>     8-bit unused, must be zero [6]
>
>   32-bit mtime seconds, the last time a file's data changed
>     this is stat(2) data
>
>   32-bit mtime nanosecond fractions
>     this is stat(2) data
>
>   32-bit crc32 checksum over ctime seconds, ctime nanoseconds,
>     ino, file size, dev, uid, gid (All stat(2) data except mtime) [7]

Giving occassional false positive to "did this change?" is acceptable, but
any false negative is absolutely unacceptable.  How does this work with
something like "racy git" situation (i.e. coming from "mtime happens to be
the same as before") but due to crc32 collisions?

If there is no good answer to the above question, I would have to say that
anybody who suggested or passed this through review loses all the
accumulated reviewer karma points (if s/he has accumulated any, that is).

>   160-bit SHA-1 for the represented object
>
>   32-bit crc32 checksum for the file entry
>
> == Conflicted data

I do not think the data described in this section should be conflicting.
It ought to be data that describe conflicted state.  Perhaps you meant
"Conflict data"?

>   A conflict is represented in the index as a set of higher stage entries.
>   These entries are stored at the end of the index. When a conflict is
>   resolved (e.g. with "git add path"). A bit is flipped, to indicate that
>   the conflict is resolved, but the entries will be kept, so that
>   conflicts can be recreated (e.g. with "git checkout -m", in case users
>   want to redo a conflict resolution from scratch.
>
>   - NUL-terminated filename of the entry

Is this a name relative to its containing directory (i.e. without leading
components)?  Or a full path relative to the top-level of the working
tree?

>   - A 8-bit 'flags' field split into:
>
>     - 1-bit conflicted state (conflicted/resolved) (1 if conflicted)
>
>     - 7-bit unused
>
>   - Three 4-byte octal numbers, entry mode of entries in stage 1 to 3 (a
>     missing stage is represented by "0" in this field);
>     and
>
>   - At most three 160-bit object names of the entry in stages from 1 to 3
>     (nothing is written for a missing stage).

It is allowed to have more than 1 entries in stage #1 to represent
multiple merge-base, so this needs to be rethought.

>   - 32-bit crc32 checksum over one conflicted entry.

There is no definition of "one conflicted entry"; be consistent and say
"Conflicted data" as what the section header claims to describe.

> == Design explanations
>
> [1] The directory and file offsets are included in the index format
>     to enable bisectability of the index, for binary searches.Updating
>     a single entry and partial reading will benefit from this.
>
> [2] The directories are saved in their own block, to be able to
>     quickly search for a directory in the index. They include a
>     offset to the (lexically) first file in the directory.
>
> [3] The data of the cache-tree extension and the resolve undo
>     extension is now part of the index itself, but if other extensions
>     come up in the future, there is no need to change the index, they
>     can simply be added at the end.
>
> [4] To avoid rewrites of the whole index when there are conflicts or
>     conflicts are being resolved, conflicted data will be stored at
>     the end of the index. To mark the conflict resolved, just a bit
>     has to be flipped. The data will still be there, if a user wants
>     to redo the conflict resolution.
>
> [5] Since only 4 modes are effectively allowed in git but 32-bit are
>     used to store them, having a two bit flag for the mode is enough
>     and saves 4 byte per entry.
>
> [6] The length of the file name was dropped, since each file name is
>     nul terminated anyway.

This is micronit, but I think we do this to save one strlen() for each
read of the entry, except for unusually long paths where we fall back to
strlen(). A change like this needs to be justified better than simply
saying "because we _could_ compute in a different way by spending extra
cycles".

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-03 18:16 ` Thomas Rast
@ 2012-05-03 19:03   ` Junio C Hamano
  2012-05-04  7:12   ` Michael Haggerty
  1 sibling, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2012-05-03 19:03 UTC (permalink / raw)
  To: Thomas Rast
  Cc: Thomas Gummerer, git, gitster, mhagger, peff, spearce, davidbarr

Thomas Rast <trast@student.ethz.ch> writes:

> So what's wrong in this thinking?

We tolerate giving occasional false positive "Yes" to "Has this changed?",
but do not allow false negative "No".  So the value st_{ino,dev,uid,gid}
gives us is strictly "these are not likely to change, but if even a single
bit in them change, we need to suspect that it may have changed."  The
primary field that protect us is mtime, and all the rest are more or less
belt-and-suspender safety.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-03 18:54 ` Junio C Hamano
@ 2012-05-03 19:11   ` Thomas Rast
  2012-05-03 19:31   ` Thomas Rast
  2012-05-03 21:38   ` Thomas Gummerer
  2 siblings, 0 replies; 49+ messages in thread
From: Thomas Rast @ 2012-05-03 19:11 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Gummerer, git, trast, mhagger, peff, spearce, davidbarr

Junio C Hamano <gitster@pobox.com> writes:

>> [6] The length of the file name was dropped, since each file name is
>>     nul terminated anyway.
>
> This is micronit, but I think we do this to save one strlen() for each
> read of the entry, except for unusually long paths where we fall back to
> strlen(). A change like this needs to be justified better than simply
> saying "because we _could_ compute in a different way by spending extra
> cycles".

(Partially also answering the whole "what are these offsets" confusion)

The bisectability has evolved to the point where we envision the
structure of a "flat list" (the directories, and the files in each
directory) to have the format

  offset to entry 1 [1]
  ....
  offset to entry n

  entry 1, consisting of:
    name, nul-terminated
    rest of data:
      for dirs: cache-tree sha1, offset to files, etc.
      for files: stat data, content sha1, flags, etc.

That makes bisection very easy: the offsets point at the start of each
string, so you just strcmp() and get on with it.

On the other hand, by the time you can look at the flags, it's too late
for the strlen() optimization anyway, so meh.  If you think it's
important, we can perhaps lay it out so the rest of the data goes
immediately after the pointer.  But so far it wasn't clear whether it's
fixed-size, or uses some smart compression scheme.  Tonight's edition
has a fixed length, so perhaps it would be preferable to keep the
length.

Footnotes: 
[1]  It probably doesn't matter whether this is relative to the position
of the offset, or absolute (in terms of file pointer).

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-03 18:54 ` Junio C Hamano
  2012-05-03 19:11   ` Thomas Rast
@ 2012-05-03 19:31   ` Thomas Rast
  2012-05-03 19:32     ` Thomas Rast
  2012-05-03 21:38   ` Thomas Gummerer
  2 siblings, 1 reply; 49+ messages in thread
From: Thomas Rast @ 2012-05-03 19:31 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Thomas Gummerer, git, trast, mhagger, peff, spearce, davidbarr

Junio C Hamano <gitster@pobox.com> writes:

> Thomas Gummerer <t.gummerer@gmail.com> writes:
>
>> I have been drafting the Version 5 of the index format over the past
>> few days with the help of Thomas Rast, Michael Haggerty, cmn and
>> barrbrain on IRC.
>
> Hrm, so if there is anything glaringly wrong below, should I reduce the
> "trustable reviewer karma point" from these people?  Or did you forget to
> say "but remaining errors are mine" ;-)?

Heh.

Partly it's my fault, I told Thomas to get this out tonight for a single
reason: Michael's CRC-over-stat idea is so radical that I wanted to know
whether there's anything wrong with it.  So the misunderstanding is
really that this is not anywhere near the final result.

But yeah:

> >   32-bit crc32 checksum over ctime seconds, ctime nanoseconds,
> >     ino, file size, dev, uid, gid (All stat(2) data except mtime) [7]
> 
> Giving occassional false positive to "did this change?" is acceptable, but
> any false negative is absolutely unacceptable.  How does this work with
> something like "racy git" situation (i.e. coming from "mtime happens to be
> the same as before") but due to crc32 collisions?
> 
> If there is no good answer to the above question, I would have to say that
> anybody who suggested or passed this through review loses all the
> accumulated reviewer karma points (if s/he has accumulated any, that is).

If this is a problem, then the fault is with me (and I do hope I have
some karma to lose...).

Note that the scenario you outlined is not an issue.  The entries other
than mtime and ctime are really only compared for equality, see
e.g. ce_match_stat_basic().  Comparisons for equality never have false
negatives with any hash function.  Collisions are false positives.

Upon closer reading I noticed that the ie_match_stat and ce_match_stat_*
family actually to distinguish between basically all the fields that can
change.  But this knowledge is never actually put to use, except that
there's an optimized code path where

* mode/type difference implies changed entry
* size difference implies changed entry

So that does constitute an argument to not put the size in the
stat-hash.  Mode and type aren't part of it in the proposed format
anyway.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-03 19:31   ` Thomas Rast
@ 2012-05-03 19:32     ` Thomas Rast
  2012-05-03 20:32       ` Junio C Hamano
  0 siblings, 1 reply; 49+ messages in thread
From: Thomas Rast @ 2012-05-03 19:32 UTC (permalink / raw)
  To: Thomas Rast
  Cc: Junio C Hamano, Thomas Gummerer, git, mhagger, peff, spearce,
	davidbarr

Thomas Rast <trast@student.ethz.ch> writes:

> Comparisons for equality never have false negatives with any hash
> function.  Collisions are false positives.

Bah, shouldn't hit "send" so fast.  There goes my karma :-D

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-03 17:25 Index format v5 Thomas Gummerer
                   ` (2 preceding siblings ...)
  2012-05-03 18:54 ` Junio C Hamano
@ 2012-05-03 19:38 ` solo-git
  2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 49+ messages in thread
From: solo-git @ 2012-05-03 19:38 UTC (permalink / raw)
  To: git

On Thu, May 03, 2012 at 07:25:12PM +0200, Thomas Gummerer wrote:

>   160-bit object name for the object that would result from writing
>     this span of index as a tree.

Would now be a fun time to bring up the idea of eventually migrating
away from 160-bit hashes?  I love to keep reminding people that SHA-1
is showing its age[1].

Even if we don't add support for variable length hashes at this point,
it would be nice to at least have the code written in a relatively
hash-length-independent way, such that someone could come along and
make a new, very similar, format without rewriting much of the code.

If there is going to be support for different hash algorithms, or even
different length hashes, we need space in the file (before the first
hash, obviously) to indicate what we're using; currently reserved.

For this, we need to decide on how we're going to store the algorithm;
my guesses would be either:

* a (1-octet) enum, like tls (0: sha1/160, 1: sha2-256/160 [sha2's
256-bit mode truncated to 160-bits using a recommended truncation mode],
2: keccak-512, etc.); stopping at 127 (i.e. reserving the high-bit for
future use), or

* a lowascii string, with prefixed 1 (or 2?)-octet length, like ssh,
using the above format. If people believe that the chance of this being
implemented (within the lifespan of v5) is tiny, this octet can be
reserved; i.e. a zero-length string means sha1/160.

tl;dr: Please reserve one or more octet(s) early on for the details of
the hash, and think about people who want more than 20/40 octets
available when implementing it.

--
Chris West (FauxFaux).

[1]: http://csrc.nist.gov/groups/ST/hash/policy.html "...and must use
the SHA-2 family of hash functions for these applications after 2010."

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-03 19:32     ` Thomas Rast
@ 2012-05-03 20:32       ` Junio C Hamano
  0 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2012-05-03 20:32 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Thomas Gummerer, git, mhagger, peff, spearce, davidbarr

Thomas Rast <trast@student.ethz.ch> writes:

> Thomas Rast <trast@student.ethz.ch> writes:
>
>> Comparisons for equality never have false negatives with any hash
>> function.  Collisions are false positives.
>
> Bah, shouldn't hit "send" so fast.  There goes my karma :-D

Heh, don't take "karma" thing too seriously. I was merely trying to be
funny and I seem to have failed the attempt.

It depends on what question you are asking, and against "Did this change?"
question, saying "No, it didn't" when the right answer is "Yes, it did" is
a false negative.  You are asking for a false negative if you substitute
byte-for-byte comparison with comparison of hashed results of the values
that needs to be compared.

But as I said, mtime is the primary thing that protects us, so it may not
be such a big deal.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-03 18:21 ` Ronan Keryell
@ 2012-05-03 20:36   ` Thomas Gummerer
  0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-03 20:36 UTC (permalink / raw)
  To: Ronan Keryell; +Cc: git

On Thu, May 3, 2012 at 8:21 PM, Ronan Keryell
<Ronan.Keryell@hpc-project.com> wrote:
>>>>>> On Thu, 3 May 2012 19:25:12 +0200, Thomas Gummerer <t.gummerer@gmail.com> said:
>
>    Thomas> I have been drafting the Version 5 of the index format over
>    Thomas> the past few days with the help of Thomas Rast, Michael
>    Thomas> Haggerty, cmn and barrbrain on IRC. It will save with prefix
>    Thomas> compression on the path, and using a crc32 over the stat
>    Thomas> data, instead of the full data, since it is only used for
>    Thomas> checking if the file is changed. (Thanks Michael Haggerty
>    Thomas> for this hint. Unless we are missing something this will
>    Thomas> save another ~4 MB on the Webkit index.
>
> Great!
>
> But I wonder whether it may not worth to investigate a 64-bit version for
> the offsets and so on, just in case...

64-bit versions of the offsets were taken into consideration, but currently
the Webkit index (the largest I know) has a size of 26 Mb, which is
reduced to about 15 MB or less with the v5 format. With 32-bit we can
address 4GB, which is about 266 times the Webkit index. Therefore there
probably is no use for 64-bit offsets in the years to come.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-03 18:54 ` Junio C Hamano
  2012-05-03 19:11   ` Thomas Rast
  2012-05-03 19:31   ` Thomas Rast
@ 2012-05-03 21:38   ` Thomas Gummerer
  2012-05-07 18:57     ` Robin Rosenberg
  2 siblings, 1 reply; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-03 21:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, trast, mhagger, peff, spearce, davidbarr

On Thu, May 3, 2012 at 8:54 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Thomas Gummerer <t.gummerer@gmail.com> writes:
>
>> I have been drafting the Version 5 of the index format over the past
>> few days with the help of Thomas Rast, Michael Haggerty, cmn and
>> barrbrain on IRC.
>
> Hrm, so if there is anything glaringly wrong below, should I reduce the
> "trustable reviewer karma point" from these people?  Or did you forget to
> say "but remaining errors are mine" ;-)?

Heh yes most of the remaining errors are mine ;-) (And I hope nothing is
*that* wrong ;-) )

>> GIT index format
>> ================
>>
>> = The git index file has the following format
>>
>>   All binary numbers are in network byte order. Version 5 is described
>>   here.
>>
>>    - A 16-byte header consisting of
>>
>>      4-byte signature:
>>        The signature is { 'D', 'I', 'R', 'C' } (stands for "dircache")
>>
>>      4-byte version number:
>>        The current supported versions are 2, 3, 4 and 5.
>>
>>      32-bit number of directories.
>
>>      32-bit number of file entries.
>
> I take these two numbers mean "we have this many directory entries in the
> index" and "we have this many file entries in the index"; I found it
> unclear during my first scan.

Yes, that's exactly what they mean. I'll write this down more clearly in the
next version.

> Have you considered expressing these new "directories", "files" and
> associated data as a new (mandatory) extension?

Since they are the main part of the index, I don't consider them
extensions, unless we say "everything in the index is a extension"

>>    - Offset to the extensions.
>>
>>      32-bit number of extensions.
>
> Why is this necessary?  It means that the writer needs to enumerate how
> many extensions it is going to write before starting to write, unless it
> is willing to seek back and fill this.  And it is not like the reader will
> first allocate an array to hold uniformly sized extension and will be
> helped to have the number of entries upfront in order to do the
> allocation.
>
> What purpose does this serve?

It's to determine how many offsets the reader has to read (or to skip
to get to the first directory offset).

>>      32-bit number offset to the extension. (Possibly none, as many as
>>      indicated in the 4-byte number of extensions)
>
> Why is this needed?  It appears that there is no field that points at the
> beginning of array that stores "directory offsets", nor "file offsets", so
> I am assuming that you will be scanning the index file to find them.  Why
> can't the extensions be handled exactly the same way?

The offsets are necessary to quickly skip to the extension data, in case it
makes sense without knowing about the index/without reading the whole
index. The directories and files are different in the sense that the reader
needs to read to the first "directory offset" entry in any case, since it needs
to read the header in any case, then the number of extensions (and
if it doesn't know about any extensions skipping them), and then there come
the directory offsets. Those then have the pointer to the actual directory
data, which itself has the pointer to it's files. And since the files don't make
sense without knowing about the directory, there is no need for offsets for
the files.

>>    - A number of directory offsets (see below). [1]
>>
>>    - A number of sorted directories (see below). [2]
>>
>>    - 32-bit crc32 checksum for the header, extension offsets and directories.
>
> What does "32" in "crc32" stand for ? ;-)

32 bit I guess ;-) It's just standing there to make it clear on first
sight that there
is a 32-bit entry, since every other entry starts with the bit/byte count too.

>>    - A number of file offsets (see below). [1]
>>
>>    - A number of file entries (see below).
>>
>>    - A number of entries for conflicted data/resolved conflicts (see below).
>>
>>    - Extensions
>>
>>      Extensions are identified by signature. Optional extensions can
>>      be ignored if GIT does not understand them.
>>
>>      GIT supports an arbitrary number of extension, but currently none
>>      is implemented. [3]
>>
>>      4-byte extension signature. If the first byte is 'A'..'Z' the
>>      extension is optional and can be ignored.
>>
>>      32-bit size of the extension
>>
>>      32-bit crc32 checksum of the extension signature and size
>>
>>      Extension data
>>
>>
>> == Directory entry offsets
>
> The name "directory entry offset" does not seem to appear anywhere before
> this, so it is unclear what this section of the documentation is trying to
> describe.  Is this the same as "directory offsets" above?  In other words,
> "A number of directory offsets (see below)." above is an array of something,
> and this section is trying to describe what that something is?

Yes, exactly, it's the directory offsets above. I'll write this down more
clearly too.

>>   32-bit offset to the directory.
>
> I take it is "offset relative to the beginning of the file in the index".

Yes, that will be written more clearly too.

>>   This part is needed for making the directory entries bisectable and
>>     thus allowing a binary search.
>>
>> == Directory entry
>>
>>   Directory entries are sorted in lexicographic order by the name
>>   of their path starting with the root.
>>
>>   Path names (variable length) relative to top level directory (without the
>>     leading slash). '/' is used as path separator. '.' indicates the root
>>     directory. The special patch components ".." and ".git" (without quotes)
>>     are disallowed. Trailing slash is also disallowed.
>
> I understood "the root" to mean "the top level of the directory hierarchy"
> (i.e. the directory that corresponds to the top level of the working
> tree), but it needs to be explained better.  Using '.' for root sounds
> somewhat questionable, though.  Why not a string of length 0?

Yes, that's right. I thought about '.' for root since *nix environments do it
that way too. But the string of length 0 would also work better with the
lexicographic sorting.

> When an index represents a D/F conflict, some stages may have a directory
> D while others do not have D (but have a regular file at D).  Doesn't
> directory entry need to have a stage information?

I didn't think about that, I'll include that in the next version.

>>   1 nul byte to terminate the path.
>>
>>   32-bit offset to the first file of a directory
>
> What does this point at?  Does it point into the array of "file offsets"
> or the array of "file entries"?

It points to the "file offsets".

>>   32-bit offset to conflicted/resolved data at the end of the index.
>>     0 if there is no such data. [4]
>>
>>   4-byte number of subtrees this tree has
>
> Which is an undefined number unless you specify which stage you are
> talking about.

I'm talking about the number of subtrees, that are in the index. This
is used for further compressing the index, and not being redundant
with path components. E.g: we have a directory A/A then the index
saves it as

A...1Subtree...
A...0Subtrees...

(... is the rest of the data that's stored)

>>   4-byte number of entries in the index that is covered by the tree this
>>     entry represents. (entry_count) (-1 if the entry is invalid)
>>
>>   160-bit object name for the object that would result from writing
>>     this span of index as a tree.
>>
>>   The last 24 bytes are for the cache tree. An entry can be in an
>>     invalidated state which is represented by having -1 in the entry_count
>>     field. If an entry is in invalidated state, the next entry will begin
>>     after the number of subtrees, and the 160-bit object name is dropped.
>
> By "The last 24 bytes", do you mean the "4-byte number of entries..." and
> "160-bit object name"?

Yes, exactly.

>>   The entries are written out in the top-down, depth-first order. The
>>     first entry represents the root level of the repository, followed by
>>     the first subtree - let's call it A - of the root level, followed by
>>     the first subtree of A, ...
>>
>> == File entry offsets
>>
>>   32-bit offset to the directory.
>
> What directory?  The containing directory?  What does this point at?  Does
> it point into which array?

That's actually wrong it should say file entry, as it should below. That's what
I get from copy and pasting ;-)

>>   This part is needed for making the directory entries bisectable and
>>     thus allowing a binary search.
>>
>> == File entry
>>
>>   File entries are sorted in ascending order on the name field, after the
>>   respective offset given by the directory entries.
>>
>>   File name (variable length). Nul bytes are not allowed in file names and
>>     they have no leading slash. They are 7-bit ASCII encoded.
>
> Is this a name relative to its containing directory (i.e. without leading
> components)?  Or is it a full path relative to the top level of the
> working tree?

The path is relative to its containing directory, to save space, as in your
v4 index format.

> I have some UTF-8 encoded files in my repository.  Are they
> now disallowed?

Right I should have taken the old format there, and leave the
files in undefined encoding?

>>   1 nul byte to terminate the filename.
>>
>>   A 16-bit 'flags' field split into (high to low bits)
>>
>>     1-bit assume-valid flag
>
> Is this "assume unchanged"?

I have taken this from the old index format documentation, which
describes it as assume-valid flag, but I guess it's assume unchanged then.

>>     1-bit conflict flag
>>
>>     2-bit stage (during merge)
>
> Huh?  When stage #0 entry exists for a given path, no other stages for the
> same path can exist in the index.  By definition, that is how a conflicted
> path is resolved.  What is this separate "conflict flag" for?

I probably got this wrong, and the conflict flag isn't needed. So if I have a
stage #1 entry in the index, I'm sure that it's conflicted?

>>     2-bit mode (0 = 1000644 (regular file without execution
>>       permission), 1 = 1000755 (regular file with execution
>>       permission), 2 = 1010000 (symbolic link), 3 = 1110
>>       (gitlink)) [5]
>
> Don't penny-pinch bits like this.

Heh ok, I thought this saves 4-bytes per entry so it would be good,
but if it's better the other way we'll invest 16-bits for this.

>>     1-bit skip-worktree flag (used by sparse checkout)
>>
>>     1-bit intent-to-add flag (used by "git add -N")
>>
>>     8-bit unused, must be zero [6]
>>
>>   32-bit mtime seconds, the last time a file's data changed
>>     this is stat(2) data
>>
>>   32-bit mtime nanosecond fractions
>>     this is stat(2) data
>>
>>   32-bit crc32 checksum over ctime seconds, ctime nanoseconds,
>>     ino, file size, dev, uid, gid (All stat(2) data except mtime) [7]
>
> Giving occassional false positive to "did this change?" is acceptable, but
> any false negative is absolutely unacceptable.  How does this work with
> something like "racy git" situation (i.e. coming from "mtime happens to be
> the same as before") but due to crc32 collisions?
>
> If there is no good answer to the above question, I would have to say that
> anybody who suggested or passed this through review loses all the
> accumulated reviewer karma points (if s/he has accumulated any, that is).
>
>>   160-bit SHA-1 for the represented object
>>
>>   32-bit crc32 checksum for the file entry
>>
>> == Conflicted data
>
> I do not think the data described in this section should be conflicting.
> It ought to be data that describe conflicted state.  Perhaps you meant
> "Conflict data"?

Yes, it's conflict data. It's the data that used to have more entries in the
in the index, and it has been moved to the far end to avoid lots of rewrites
of the index, when a conflict is resolved.

>>   A conflict is represented in the index as a set of higher stage entries.
>>   These entries are stored at the end of the index. When a conflict is
>>   resolved (e.g. with "git add path"). A bit is flipped, to indicate that
>>   the conflict is resolved, but the entries will be kept, so that
>>   conflicts can be recreated (e.g. with "git checkout -m", in case users
>>   want to redo a conflict resolution from scratch.
>>
>>   - NUL-terminated filename of the entry
>
> Is this a name relative to its containing directory (i.e. without leading
> components)?  Or a full path relative to the top-level of the working
> tree?

This is again relative to it's containing directory. That's what the second
offsets in the directory entry part is about. (The offset is pointing to the
filename)

>>   - A 8-bit 'flags' field split into:
>>
>>     - 1-bit conflicted state (conflicted/resolved) (1 if conflicted)
>>
>>     - 7-bit unused
>>
>>   - Three 4-byte octal numbers, entry mode of entries in stage 1 to 3 (a
>>     missing stage is represented by "0" in this field);
>>     and
>>
>>   - At most three 160-bit object names of the entry in stages from 1 to 3
>>     (nothing is written for a missing stage).
>
> It is allowed to have more than 1 entries in stage #1 to represent
> multiple merge-base, so this needs to be rethought.

Ok, I'll rethink this. Thought it's this way, because it currently
works this way
in the resolve undo extension.

>>   - 32-bit crc32 checksum over one conflicted entry.
>
> There is no definition of "one conflicted entry"; be consistent and say
> "Conflicted data" as what the section header claims to describe.

Ok, I'll change that.


>> == Design explanations
>>
>> [1] The directory and file offsets are included in the index format
>>     to enable bisectability of the index, for binary searches.Updating
>>     a single entry and partial reading will benefit from this.
>>
>> [2] The directories are saved in their own block, to be able to
>>     quickly search for a directory in the index. They include a
>>     offset to the (lexically) first file in the directory.
>>
>> [3] The data of the cache-tree extension and the resolve undo
>>     extension is now part of the index itself, but if other extensions
>>     come up in the future, there is no need to change the index, they
>>     can simply be added at the end.
>>
>> [4] To avoid rewrites of the whole index when there are conflicts or
>>     conflicts are being resolved, conflicted data will be stored at
>>     the end of the index. To mark the conflict resolved, just a bit
>>     has to be flipped. The data will still be there, if a user wants
>>     to redo the conflict resolution.
>>
>> [5] Since only 4 modes are effectively allowed in git but 32-bit are
>>     used to store them, having a two bit flag for the mode is enough
>>     and saves 4 byte per entry.
>>
>> [6] The length of the file name was dropped, since each file name is
>>     nul terminated anyway.
>
> This is micronit, but I think we do this to save one strlen() for each
> read of the entry, except for unusually long paths where we fall back to
> strlen(). A change like this needs to be justified better than simply
> saying "because we _could_ compute in a different way by spending extra
> cycles".

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-03 18:16 ` Thomas Rast
  2012-05-03 19:03   ` Junio C Hamano
@ 2012-05-04  7:12   ` Michael Haggerty
  2012-05-07 22:18     ` Robin Rosenberg
  1 sibling, 1 reply; 49+ messages in thread
From: Michael Haggerty @ 2012-05-04  7:12 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Thomas Gummerer, git, gitster, peff, spearce, davidbarr

On 05/03/2012 08:16 PM, Thomas Rast wrote:
> Thomas Gummerer<t.gummerer@gmail.com>  writes:
>
>>    32-bit crc32 checksum over ctime seconds, ctime nanoseconds,
>>      ino, file size, dev, uid, gid (All stat(2) data except mtime) [7]
> [...]
>> [7] Since all stat data (except mtime and ctime) is just used for
>>      checking if a file has changed a checksum of the data is enough.
>>      In addition to that Thomas Rast suggested ctime could be ditched
>>      completely (core.trustctime=false) and thus included in the
>>      checksum. This would save 24 bytes per index entry, which would
>>      be about 4 MB on the Webkit index.
>>      (Thanks for the suggestion to Michael Haggerty)
>
> This is the part I'm most curious about.  Are we missing anything?
> Michael brought it up on IRC: the stat() results are only used to test
> whether they are still the same, with the exception of the mtime (which
> also undergoes raciness checks).
>
> As far as I can see, none of st_{ino,dev,uid,gid} are useful for
> anything.  st_size might conceivably be used as a hint for a buffer
> size, but nobody actually does that.  The ctime undergoes stricter
> checks, but AFAICS it's also all about whether it has changed, and
> besides that can be turned off.  We think all of those fields can be
> replaced by an arbitrary hash/CRC and only tested for equality.  32 bits
> should be plenty, probably even if we just xor the values together.

XOR is definitely *not* adequate; for example, changing uid=gid="you" to 
uid=gid="me" would not affect the XOR of the values (assuming, as is 
often the case, that each user has his own uid/gid with the same 
numerical values).

Which hash to use depends on some estimate of the likelihood that the 
hashes collide and simultaneously that the other metadata coincide.  It 
seems to me that CRC-32 would be adequate.  But if not, a longer hash 
could be used (albeit with less space savings).

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-03 17:25 Index format v5 Thomas Gummerer
                   ` (3 preceding siblings ...)
  2012-05-03 19:38 ` solo-git
@ 2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
  2012-05-04 15:44   ` Thomas Gummerer
  2012-05-04 13:25 ` Philip Oakley
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 49+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-05-04 13:20 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, gitster, mhagger, peff, spearce, davidbarr

On Fri, May 4, 2012 at 12:25 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> GIT index format
> ================
>
> = The git index file has the following format
>
>  All binary numbers are in network byte order. Version 5 is described
>  here.
>   ...
>   - A number of directory offsets (see below). [1]
>
>   - A number of sorted directories (see below). [2]
>
>   - 32-bit crc32 checksum for the header, extension offsets and directories.

So we use one checksum for all dirs? I thought we could do checksum
per dir, so if I'm interested in path/to/here only, I only need to
verify data of three directories.

> == Directory entry offsets
>
>  32-bit offset to the directory.
>
>  This part is needed for making the directory entries bisectable and
>    thus allowing a binary search.

How is this (I assume) array ordered? The same top-down depth-first
with "Directory entry" section below? I can see ordering as
top-down/breadth-first help bsearch though.

> == Directory entry
>
>  Directory entries are sorted in lexicographic order by the name
>  of their path starting with the root.
>
>  Path names (variable length) relative to top level directory (without the
>    leading slash). '/' is used as path separator. '.' indicates the root
>    directory. The special patch components ".." and ".git" (without quotes)
>    are disallowed. Trailing slash is also disallowed.
>
>  1 nul byte to terminate the path.

I don't see it mention prefix compression here, nor in "file entry"
section. Does it use it here? If so I don't think prefix compression
plays well with bsearch (on path name). In the worst case you may have
to process up to the first entry in order to get a path name (e.g. a
directory with entries "a", "aa", "aaa", "aaaa"...)

>  The entries are written out in the top-down, depth-first order. The
>    first entry represents the root level of the repository, followed by
>    the first subtree - let's call it A - of the root level, followed by
>    the first subtree of A, ...

So depth-first traversal becomes natural even without the help of
directory offset table above. Nice.

> == File entry
>
>  File entries are sorted in ascending order on the name field, after the
>  respective offset given by the directory entries.

I wonder if we need to keep file entry table separate from directory
entry. It feels more natural to put the sequence of file entries of a
directory right after the directory entry, might help read-ahead too
during traversal. You save 4 bytes (for file entry offset) in each
directory entry. You still have file offset table for random access.

>  File name (variable length). Nul bytes are not allowed in file names and
>    they have no leading slash. They are 7-bit ASCII encoded.

Why can't it be 8-bit? I suppose file name is also prefix compressed?
-- 
Duy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-03 17:25 Index format v5 Thomas Gummerer
                   ` (4 preceding siblings ...)
  2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
@ 2012-05-04 13:25 ` Philip Oakley
  2012-05-04 15:46   ` Junio C Hamano
  2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 49+ messages in thread
From: Philip Oakley @ 2012-05-04 13:25 UTC (permalink / raw)
  To: Thomas Gummerer, git; +Cc: trast, gitster, mhagger, peff, spearce, davidbarr

From: "Thomas Gummerer" <t.gummerer@gmail.com> Sent: Thursday, May 03, 2012 
6:25 PM
>I have been drafting the Version 5 of the index format over the past
> few days with the help of Thomas Rast, Michael Haggerty, cmn and
> barrbrain on IRC. It will save with prefix compression on the path, and
> using a crc32 over the stat data, instead of the full data, since it is 
> only
> used for checking if the file is changed. (Thanks Michael Haggerty for
> this hint. Unless we are missing something this will save another
> ~4 MB on the Webkit index.
>
>
>
> GIT index format
> ================
>
> = The git index file has the following format
>
xxx
>
> == Directory entry
>
>  Directory entries are sorted in lexicographic order by the name
>  of their path starting with the root.
>
>  Path names (variable length) relative to top level directory (without the
>    leading slash). '/' is used as path separator. '.' indicates the root
>    directory. The special patch components ".." and ".git" (without 
> quotes)
>    are disallowed. Trailing slash is also disallowed.
>

Does the prohibition of ".git" prevent the potential for versioning of the 
.git directory itself (e.g. .gitignore the pack & objects themselves)?

It should be possible for one's current repo status to be tracked and 
summarised by a single sha1, say as an indepenedent branch, as you would the 
code itself.

[my use case is in a managed environment where one has plenty of networked 
project storage, but no real availability of a network server, so anybody 
and his mate could corrupt ones network repo with badly thought through 
tweaks. The lack of a server is the root cause, but that's not something the 
project can fix, so thinks ... 'version the meta data...']

I can see that the lack of the leading "/", and the ".." are a policy 
statement about paths being relative to top level directory, with direct 
path referencing. The typical avoidance of ".git" should be a note about how 
regular git works, rather than an absolute prohibition.

Philip 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
@ 2012-05-04 15:44   ` Thomas Gummerer
  0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-04 15:44 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: git, trast, gitster, mhagger, peff, spearce, davidbarr

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 4205 bytes --]



On 05/04, Nguyen Thai Ngoc Duy wrote:
> On Fri, May 4, 2012 at 12:25 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> > GIT index format
> > ================
> >
> > = The git index file has the following format
> >
> >  All binary numbers are in network byte order. Version 5 is described
> >  here.
> >   ...
> >   - A number of directory offsets (see below). [1]
> >
> >   - A number of sorted directories (see below). [2]
> >
> >   - 32-bit crc32 checksum for the header, extension offsets and directories.
> 
> So we use one checksum for all dirs? I thought we could do checksum
> per dir, so if I'm interested in path/to/here only, I only need to
> verify data of three directories.

Good point. Not sure how they could exactly be implemented, but probably
one checksum for offset + directory data. I'll definitely think about this.

> > == Directory entry offsets
> >
> >  32-bit offset to the directory.
> >
> >  This part is needed for making the directory entries bisectable and
> >    thus allowing a binary search.
> 
> How is this (I assume) array ordered? The same top-down depth-first
> with "Directory entry" section below? I can see ordering as
> top-down/breadth-first help bsearch though.

True, the breadth-first approach might be better, since we are using
prefix compression for the pathname. It will need some more offsets
(or calculation, but should still be faster)

> > == Directory entry
> >
> >  Directory entries are sorted in lexicographic order by the name
> >  of their path starting with the root.
> >
> >  Path names (variable length) relative to top level directory (without the
> >    leading slash). '/' is used as path separator. '.' indicates the root
> >    directory. The special patch components ".." and ".git" (without quotes)
> >    are disallowed. Trailing slash is also disallowed.
> >
> >  1 nul byte to terminate the path.
> 
> I don't see it mention prefix compression here, nor in "file entry"
> section. Does it use it here? If so I don't think prefix compression
> plays well with bsearch (on path name). In the worst case you may have
> to process up to the first entry in order to get a path name (e.g. a
> directory with entries "a", "aa", "aaa", "aaaa"...)

I planned to use prefix compression here, which would benefit especially
the reader (we're reading more often then writing). By designing the
offsets carefully we should still be able to get log(n) (n = number of
directories in the index) search time for a directory.

> >  The entries are written out in the top-down, depth-first order. The
> >    first entry represents the root level of the repository, followed by
> >    the first subtree - let's call it A - of the root level, followed by
> >    the first subtree of A, ...
> 
> So depth-first traversal becomes natural even without the help of
> directory offset table above. Nice.
> 
> > == File entry
> >
> >  File entries are sorted in ascending order on the name field, after the
> >  respective offset given by the directory entries.
> 
> I wonder if we need to keep file entry table separate from directory
> entry. It feels more natural to put the sequence of file entries of a
> directory right after the directory entry, might help read-ahead too
> during traversal. You save 4 bytes (for file entry offset) in each
> directory entry. You still have file offset table for random access.

The reason for this design choice is the fast searching of a directory, 
(for partial reading or changing a single file in the index). Keeping
them separate also simplifies the reading of the cache-tree, which will
be included in the directory section. Instead of offsets to the first file
we'd need offsets to the next directory to enable fast reading of the
cache-tree.

> >  File name (variable length). Nul bytes are not allowed in file names and
> >    they have no leading slash. They are 7-bit ASCII encoded.
> 
> Why can't it be 8-bit? I suppose file name is also prefix compressed?

I changed that, the file name can have UTF8 or ASCII encoding, as it was
allowed in the old index.

--
Thomas

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-04 13:25 ` Philip Oakley
@ 2012-05-04 15:46   ` Junio C Hamano
  0 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2012-05-04 15:46 UTC (permalink / raw)
  To: Philip Oakley
  Cc: Thomas Gummerer, git, trast, mhagger, peff, spearce, davidbarr

"Philip Oakley" <philipoakley@iee.org> writes:

> Does the prohibition of ".git" prevent the potential for versioning of
> the .git directory itself (e.g. .gitignore the pack & objects
> themselves)?

It is a good thing that you are forbidden from adding .git/anything to the
index and the trees.  The information inside your .git/ repository does
have to change from time to time.  For example, you may change your e-mail
address and update user.email in your .git/config file.

But the history of them and the history of the evolution of the project
itself are distinct [*1*]. When you clone git.git, you have no business
learning what is in .git/config in my repository.  The project should not
be able to overwrite what is in your .git/hooks/. These are only a few
reasons why an attempt to "git add .git/something" is always a mistake.

And it is irrelevant to this discussion, as the rest of .git does not
allow putting .git anyway.  In that sense, the index "format" does not
have to enforce it.  For that reason, I would suggest dropping the mention
of ".git" (but not "..") from the format description.

[Footnote]

*1* You could choose to track the contents of .git/ in another repository
with creative uses of GIT_DIR/GIT_WORK_TREE/core.worktree and friends.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-03 17:25 Index format v5 Thomas Gummerer
                   ` (5 preceding siblings ...)
  2012-05-04 13:25 ` Philip Oakley
@ 2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
  2012-05-07 13:44   ` Thomas Gummerer
  2012-05-06 16:49 ` Phil Hord
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 49+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-05-06 10:23 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, gitster, mhagger, peff, spearce, davidbarr

On Fri, May 4, 2012 at 12:25 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> == Directory entry
>
>  Directory entries are sorted in lexicographic order by the name
>  of their path starting with the root.
>
>  Path names (variable length) relative to top level directory (without the
>    leading slash). '/' is used as path separator. '.' indicates the root
>    directory. The special patch components ".." and ".git" (without quotes)
>    are disallowed. Trailing slash is also disallowed.
>
>  1 nul byte to terminate the path.
>
>  32-bit offset to the first file of a directory
>
>  32-bit offset to conflicted/resolved data at the end of the index.
>    0 if there is no such data. [4]

If it's non-zero, how do we know how many conflict entries we have?

>  4-byte number of subtrees this tree has

let's name this nr_subtrees

>  4-byte number of entries in the index that is covered by the tree this
>    entry represents. (entry_count) (-1 if the entry is invalid)

and this nr_entries.

So how do we know how many entries (including all dirs, files, staged
files) this directory has? I assume if enry_count != -1, the number
would be nr_subtrees + nr_entries (or just nr_entries, depending on
your definition). When entry_count == -1, how do we calculate this
number?

>  160-bit object name for the object that would result from writing
>    this span of index as a tree.
>
>  The last 24 bytes are for the cache tree. An entry can be in an
>    invalidated state which is represented by having -1 in the entry_count
>    field. If an entry is in invalidated state, the next entry will begin
>    after the number of subtrees, and the 160-bit object name is dropped.
>
>  The entries are written out in the top-down, depth-first order. The
>    first entry represents the root level of the repository, followed by
>    the first subtree - let's call it A - of the root level, followed by
>    the first subtree of A, ...

Assume the command is "git diff -- path/to/h*", we don't need full
index, just stuff in "path/to/h*" from the index. I'm trying to see
how to load just those paths from index, not full index.

I assume again that you won't invent a new function and use
tree_entry_interesting() to do tree pruning while loading index.
t_e_i() is designed to read tree objects. But I think we can make it
read on-disk directory/file entries with a few small changes. t_e_i()
is recursive and fits quite well with depth-first directory layout in
the proposed index format.

I have difficulties figuring out how you skip subtrees though. Assume
we are at "path" and we are not interested in anything there until we
meet "path/to", how do you skip subtrees "path/abc" and "path/def"?
Processing directory entries sequentially will eventually get us to
"path/to", but that could be a lot of entries if "path/abc" is deep. A
file offset pointer to the next sibling directory entry might help.
Does such a pointer exist but I did not see it, or you have other
means to do this?

Also the file/dir separation makes it more difficult to match the last
"h*" part, if there are both "here" directory and "howto" file.
-- 
Duy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-03 17:25 Index format v5 Thomas Gummerer
                   ` (6 preceding siblings ...)
  2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
@ 2012-05-06 16:49 ` Phil Hord
  2012-05-07 13:08   ` Thomas Gummerer
  2012-05-07 15:15 ` Michael Haggerty
  2012-05-13 21:01 ` Philip Oakley
  9 siblings, 1 reply; 49+ messages in thread
From: Phil Hord @ 2012-05-06 16:49 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, gitster, mhagger, peff, spearce, davidbarr

On Thu, May 3, 2012 at 1:25 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> I have been drafting the Version 5 of the index format over the past
> few days with the help of Thomas Rast, Michael Haggerty, cmn and
> barrbrain on IRC. It will save with prefix compression on the path, and
> using a crc32 over the stat data, instead of the full data, since it is only
> used for checking if the file is changed. (Thanks Michael Haggerty for
> this hint. Unless we are missing something this will save another
> ~4 MB on the Webkit index.

...

> == Directory entry
>
>  Directory entries are sorted in lexicographic order by the name
>  of their path starting with the root.
>
>  Path names (variable length) relative to top level directory (without the
>    leading slash). '/' is used as path separator. '.' indicates the root
>    directory. The special patch components ".." and ".git" (without quotes)
>    are disallowed. Trailing slash is also disallowed.

typo: The special _path_ components ".." and ".git" ...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-06 16:49 ` Phil Hord
@ 2012-05-07 13:08   ` Thomas Gummerer
  0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-07 13:08 UTC (permalink / raw)
  To: Phil Hord; +Cc: git, trast, gitster, mhagger, peff, spearce, davidbarr

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 386 bytes --]

> >  Path names (variable length) relative to top level directory (without the
> >    leading slash). '/' is used as path separator. '.' indicates the root
> >    directory. The special patch components ".." and ".git" (without quotes)
> >    are disallowed. Trailing slash is also disallowed.
> 
> typo: The special _path_ components ".." and ".git" ...

Thanks, I changed it.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
@ 2012-05-07 13:44   ` Thomas Gummerer
  0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-07 13:44 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: git, trast, gitster, mhagger, peff, spearce, davidbarr

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 4265 bytes --]





On 05/06, Nguyen Thai Ngoc Duy wrote:
> On Fri, May 4, 2012 at 12:25 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> > == Directory entry
> >
> >  Directory entries are sorted in lexicographic order by the name
> >  of their path starting with the root.
> >
> >  Path names (variable length) relative to top level directory (without the
> >    leading slash). '/' is used as path separator. '.' indicates the root
> >    directory. The special patch components ".." and ".git" (without quotes)
> >    are disallowed. Trailing slash is also disallowed.
> >
> >  1 nul byte to terminate the path.
> >
> >  32-bit offset to the first file of a directory
> >
> >  32-bit offset to conflicted/resolved data at the end of the index.
> >    0 if there is no such data. [4]
> 
> If it's non-zero, how do we know how many conflict entries we have?

Right, I'll add this to the next version. 

(32-bit number of conflicted/resolved data entries at the end of the
  index if the offset is non 0.)

> >  4-byte number of subtrees this tree has
> 
> let's name this nr_subtrees
> 
> >  4-byte number of entries in the index that is covered by the tree this
> >    entry represents. (entry_count) (-1 if the entry is invalid)
> 
> and this nr_entries.
> 
> So how do we know how many entries (including all dirs, files, staged
> files) this directory has? I assume if enry_count != -1, the number
> would be nr_subtrees + nr_entries (or just nr_entries, depending on
> your definition). When entry_count == -1, how do we calculate this
> number?

You're right. In order to maintain the bisectability we'll have to
keep the entry count up to date.

> >  160-bit object name for the object that would result from writing
> >    this span of index as a tree.
> >
> >  The last 24 bytes are for the cache tree. An entry can be in an
> >    invalidated state which is represented by having -1 in the entry_count
> >    field. If an entry is in invalidated state, the next entry will begin
> >    after the number of subtrees, and the 160-bit object name is dropped.
> >
> >  The entries are written out in the top-down, depth-first order. The
> >    first entry represents the root level of the repository, followed by
> >    the first subtree - let's call it A - of the root level, followed by
> >    the first subtree of A, ...
> 
> Assume the command is "git diff -- path/to/h*", we don't need full
> index, just stuff in "path/to/h*" from the index. I'm trying to see
> how to load just those paths from index, not full index.
> 
> I assume again that you won't invent a new function and use
> tree_entry_interesting() to do tree pruning while loading index.
> t_e_i() is designed to read tree objects. But I think we can make it
> read on-disk directory/file entries with a few small changes. t_e_i()
> is recursive and fits quite well with depth-first directory layout in
> the proposed index format.
> 
> I have difficulties figuring out how you skip subtrees though. Assume
> we are at "path" and we are not interested in anything there until we
> meet "path/to", how do you skip subtrees "path/abc" and "path/def"?
> Processing directory entries sequentially will eventually get us to
> "path/to", but that could be a lot of entries if "path/abc" is deep. A
> file offset pointer to the next sibling directory entry might help.
> Does such a pointer exist but I did not see it, or you have other
> means to do this?

I have changed the index format slightly, not to use prefix compression
on the directory entries, so that binary search through the index gets
simple. Using the directory offsets, we can binary search to the
path/to. We always have log(n) (n is the number of directories) search
time for each path.

> Also the file/dir separation makes it more difficult to match the last
> "h*" part, if there are both "here" directory and "howto" file.

It doesn't make it more difficult to find, since we can first just
check if there exists the directory with a binary search (and
possibly more directories, around it) and then search for the file
in the superdirectory (can you say that?) with another binary search.

-- 
Thomas

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-03 17:25 Index format v5 Thomas Gummerer
                   ` (7 preceding siblings ...)
  2012-05-06 16:49 ` Phil Hord
@ 2012-05-07 15:15 ` Michael Haggerty
  2012-05-08 14:11   ` Thomas Gummerer
  2012-05-13 21:01 ` Philip Oakley
  9 siblings, 1 reply; 49+ messages in thread
From: Michael Haggerty @ 2012-05-07 15:15 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, gitster, peff, spearce, davidbarr

On 05/03/2012 07:25 PM, Thomas Gummerer wrote:
> I have been drafting the Version 5 of the index format over the past
> few days with the help of Thomas Rast, Michael Haggerty, cmn and
> barrbrain on IRC. It will save with prefix compression on the path, and
> using a crc32 over the stat data, instead of the full data, since it is only
> used for checking if the file is changed. (Thanks Michael Haggerty for
> this hint. Unless we are missing something this will save another
> ~4 MB on the Webkit index.
>
>
>
> GIT index format
> ================
 > [...]

Here are some comments about the format document, version 55047b3d. 
They probably overlap with other feedback that you have gotten, but I 
don't have time to cross-correlate everything.  Hope it helps.

Overall
=======

I find the format definition pretty difficult to read.  The following
suggestions might make it easier to follow.

* You might want to give each of the elements C-like names; e.g.,
   "ndir (32 bits): number of directories in the index".  These names
   could be used as unambiguous references from other parts of the
   documentation.  The names could also be used in the implementing
   code, to make the correspondence with the format definition easier
   to follow.

* Name things consistently.  For example, "A number of sorted
   directories (see below)" should be changed to "A number of directory
   entries (see below), sorted by path name".

* Putting the above two things together, the description "A number of
   sorted directories" might become "direntries (ndir * directory
   entries): one directory entry for each of the ndir directories in
   the index, sorted by path name".

* Are all "offsets" relative to the start of the index file?  If so,
   it would be good to state this explicitly at the start (and maybe
   rename them "file positions"?).  If not, please be very explicit
   about where each offset is measured from, and whether it is signed
   or unsigned.

* You seem to switch randomly between counting sizes in bits vs bytes.
   Please try to be more consistent.  BTW, I think the size of an SHA1
   object name is more commonly described as "20 bytes" rather than
   "160 bits".

* The details of the extension data blocks are described in the first
   (overview) section, whereas it seems like they should be described
   in their own section following the "conflict data" section.  But
   wouldn't the presence of extension data blocks prevent the addition
   of conflict data?

* Are there situations (for example during conflicts?) when it is
   allowed to have directory and file entries with the same name?

* Does the index file include directory entries for empty directories?
   What about directories that contain only other directories?

Overview
========

* Does "32-bit size of the extension" include the whole extension data
   block (including header) or only the "extension data"?

Directory entry
===============

* "4-byte number of entries in the index that is covered by the tree
   this entry represents."  What does this include?
   Files/directories/both?  Recursive or non-recursive?

* "160-bit object name for the object that would result from writing
   this span of index as a tree."  Is this always valid?

* It might be convenient to store directory names with trailing '/'
   characters (except for the top-level directory, which can be stored
   as "").  That way (1) it is trivial to concatenate the directory
   name and the filename to produce the file path; (2) directories and
   files can be distinguished by name and therefore intermingled in
   lists; (3) such lists can be sorted using strcmp() without having to
   simulate an invisible '/' character.

File entry
==========

* I believe that only the basename is stored, not the whole path.  But
   then why is the encoding for '/' specified (there should be no '/'
   characters)?

* Why is the encoding of '.' specified?  Is it somehow significant for
   the index file format?

* Are file entries sorted by entire path or only by the basename?

Flat loading
============

* I found the explanation pretty incomprehensible.  Perhaps some
   pseudo-code would make it clearer?

* Since I can't understand the explanation, I'm not sure if this
   comment makes any sense.  But when traversing into a subdirectory,
   don't *all* of the remaining files from the parent directory need to
   be tucked away somewhere?

* At an even higher level of abstraction, your directory entry
   contains a count "number of entries in the index that is covered by
   the tree this entry represents".  If this count is recursive, then
   it seems like it would be enough to know how many entries will come
   from traversing a whole subdirectory tree.  So it should be possible
   to skip that many entries in the in-memory array and continue
   reading the file entries for the parent subdirectory.  For example,
   suppose our files are [A, B/1, B/2/a, B/2/b, B/3, C].  If I start by
   reading the files in the root directory, then I can fill the index
   array entries

       [A, -, -, -, -, C]

   because when I see that "B" is a directory containing a total of
   four entries, I just leave fours spaces for them in the array and
   continue with the next file, "C".  Of course I would have to
   remember (e.g., on a queue) that directory "B" still needs to be
   processed, and the starting index in the array where its entries
   have to be filled in.  Still, this jumping around would be in the
   RAM holding the index array pointers rather than in the file
   positions.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-03 21:38   ` Thomas Gummerer
@ 2012-05-07 18:57     ` Robin Rosenberg
  0 siblings, 0 replies; 49+ messages in thread
From: Robin Rosenberg @ 2012-05-07 18:57 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Junio C Hamano, git, trast, mhagger, peff, spearce, davidbarr

Thomas Gummerer skrev 2012-05-03 23.38:
> On Thu, May 3, 2012 at 8:54 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> Thomas Gummerer <t.gummerer@gmail.com> writes:
>>>    1 nul byte to terminate the filename.
>>>
>>>    A 16-bit 'flags' field split into (high to low bits)
>>>
>>>      1-bit assume-valid flag
>>
>> Is this "assume unchanged"?
>

> I have taken this from the old index format documentation, which
> describes it as assume-valid flag, but I guess it's assume unchanged then.

The bit for "assume unchanged" is called CE_VALID in the code.

-- robin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-04  7:12   ` Michael Haggerty
@ 2012-05-07 22:18     ` Robin Rosenberg
  0 siblings, 0 replies; 49+ messages in thread
From: Robin Rosenberg @ 2012-05-07 22:18 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Thomas Rast, Thomas Gummerer, git, gitster, peff, spearce,
	davidbarr

Michael Haggerty skrev 2012-05-04 09.12:
> On 05/03/2012 08:16 PM, Thomas Rast wrote:
>> Thomas Gummerer<t.gummerer@gmail.com>  writes:
>>
>>>    32-bit crc32 checksum over ctime seconds, ctime nanoseconds,
>>>      ino, file size, dev, uid, gid (All stat(2) data except mtime) [7]
>> [...]
>>> [7] Since all stat data (except mtime and ctime) is just used for
>>>      checking if a file has changed a checksum of the data is enough.
>>>      In addition to that Thomas Rast suggested ctime could be ditched
>>>      completely (core.trustctime=false) and thus included in the
>>>      checksum. This would save 24 bytes per index entry, which would
>>>      be about 4 MB on the Webkit index.
>>>      (Thanks for the suggestion to Michael Haggerty)
>>
>> This is the part I'm most curious about.  Are we missing anything?
>> Michael brought it up on IRC: the stat() results are only used to test
>> whether they are still the same, with the exception of the mtime (which
>> also undergoes raciness checks).
>>
>> As far as I can see, none of st_{ino,dev,uid,gid} are useful for
>> anything.  st_size might conceivably be used as a hint for a buffer
>> size, but nobody actually does that.  The ctime undergoes stricter
>> checks, but AFAICS it's also all about whether it has changed, and
>> besides that can be turned off.  We think all of those fields can be
>> replaced by an arbitrary hash/CRC and only tested for equality.  32 bits
>> should be plenty, probably even if we just xor the values together.
>
> XOR is definitely *not* adequate; for example, changing uid=gid="you" to uid=gid="me"
 > would not affect the XOR of the values (assuming, as is often the case, that each user
> has his own uid/gid with the same numerical values).

If you change uid/gid, that has no relevance for the content that git tracks. If the CRC
is equal you have to check the content. Ideally a change that does not change the content
should not change the CRC either, so there is really no absolute need to see that change.

I assume the idea is that if you do "tar xvf" or something like that, then changes in file,
mtime etc could be picked up by looking at these attributes, but it seems that those that
mess with mtime such that it goes back in time are out of luck with git anyway.

> Which hash to use depends on some estimate of the likelihood that the hashes collide and
 > simultaneously that the other metadata coincide.  It seems to me that CRC-32 would
> be adequate.  But if not, a longer hash could be used (albeit with less space savings).
>
> Michael
>

JGit simply ignores ctime, ino, dev, uid and gid.  The real reason is of course that
standard Java does not have an API for these extra attributes. On the the other hand
nobody is going to fix this bug. The reason is that if you follow the rule that mtime
must always change to "now" if content change, then all changes will be found simply
by looking at mtime or performing a content check for the racy case.  Those that mess
with mtime tend to be unhappy anyway.

Then there is the issue of how often we can detect changes without checking content. Ino
usually changes, but when it changes mtime usually does too, so how often does it speed
up.

Has anyone instrumented git to see how much the different attributes actually
contribute to performance and accuracy?

I'd like to extend the size field to 64 bits. We rarely need the extra bits, but we
cannot differ between 3 bytes and 4294967299 bytes so avoiding the very expensive
content check there would be welcome, even it it's a rare event. I haven't thought
too much about this though. I just felt uncomfortable when looking at the code and
knowing that performing a content check of a 4 GB file could take a minute or two.

-- robin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-07 15:15 ` Michael Haggerty
@ 2012-05-08 14:11   ` Thomas Gummerer
  2012-05-08 14:25     ` Nguyen Thai Ngoc Duy
  2012-05-09  8:37     ` Michael Haggerty
  0 siblings, 2 replies; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-08 14:11 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, trast, gitster, peff, spearce, davidbarr

On 05/07, Michael Haggerty wrote:
> Here are some comments about the format document, version 55047b3d.
> They probably overlap with other feedback that you have gotten, but
> I don't have time to cross-correlate everything.  Hope it helps.

Thanks for the feedback!

For those who may not know yet, I've created a wiki on github, since
this file is going through a lot of revisions, it may not be a good
idea to post every revision on the mailing list.
https://github.com/tgummerer/git/wiki/Index-format-v5

> Overall
> =======
> 
> I find the format definition pretty difficult to read.  The following
> suggestions might make it easier to follow.

Thanks, I've incorporated your feedback. I hope it's easier readable now.

> [...] 
> * You seem to switch randomly between counting sizes in bits vs bytes.
>   Please try to be more consistent.  BTW, I think the size of an SHA1
>   object name is more commonly described as "20 bytes" rather than
>   "160 bits".

All sizes are now in bits. I have taken the 160 bits for the SHA1 from
the old index documentation in Documentation/technical. I chose to
leave it as 160bits for now, as all the other sizes were changed to
bits too.

> * The details of the extension data blocks are described in the first
>   (overview) section, whereas it seems like they should be described
>   in their own section following the "conflict data" section.  But
>   wouldn't the presence of extension data blocks prevent the addition
>   of conflict data?

Only the details that should be there for every extension are described
in the overview (the header of the extension), to make sure every
extension has the same header format, and thus a reader which doesn't
understand a specific extension still can read its header and know 
what's going on.

They won't prevent the addition of conflicted data, since when a
conflict is created, other files were probably added and the index has
to be rewritten anyway. Once the conflict is resolved however only a
bit has to be flipped, so there is no rewrite necessary.

> * Are there situations (for example during conflicts?) when it is
>   allowed to have directory and file entries with the same name?

Yes, that's why I have added the stage data to the directory.

> * Does the index file include directory entries for empty directories?
>   What about directories that contain only other directories?

In theory the index is able to include empty directories. I'm however
not sure if this should be implemented. I'd be happy to get more
feedback there.

> Overview
> ========
> 
> * Does "32-bit size of the extension" include the whole extension data
>   block (including header) or only the "extension data"?

It includes only the extension data. It's clarified in the documentation
now.

> Directory entry
> ===============
> 
> * "4-byte number of entries in the index that is covered by the tree
>   this entry represents."  What does this include?
>   Files/directories/both?  Recursive or non-recursive?

This is from the cache-tree. I'm not sure but I think it includes both
files and directories, recursively.

> * "160-bit object name for the object that would result from writing
>   this span of index as a tree."  Is this always valid?

No, this is only valid if the entry count is not -1. It's clarified
now.

> * It might be convenient to store directory names with trailing '/'
>   characters (except for the top-level directory, which can be stored
>   as "").  That way (1) it is trivial to concatenate the directory
>   name and the filename to produce the file path; (2) directories and
>   files can be distinguished by name and therefore intermingled in
>   lists; (3) such lists can be sorted using strcmp() without having to
>   simulate an invisible '/' character.

Good point. Changed this in the documentation.

> File entry
> ==========
> 
> * I believe that only the basename is stored, not the whole path.  But
>   then why is the encoding for '/' specified (there should be no '/'
>   characters)?
>
> * Why is the encoding of '.' specified?  Is it somehow significant for
>   the index file format?

Yes, you are right, only the basename is stored. '.' and '/' don't need
a specific encoding, it's removed from the documentation.

> * Are file entries sorted by entire path or only by the basename?

They are sorted by the basename, in the respective block of their
directories.
Example: paths: a/a a/z b/b
File entries in the index:
a ...
z ...
b ...

> Flat loading
> ============
> 
> * I found the explanation pretty incomprehensible.  Perhaps some
>   pseudo-code would make it clearer?
> 
> * Since I can't understand the explanation, I'm not sure if this
>   comment makes any sense.  But when traversing into a subdirectory,
>   don't *all* of the remaining files from the parent directory need to
>   be tucked away somewhere?
> 
> * At an even higher level of abstraction, your directory entry
>   contains a count "number of entries in the index that is covered by
>   the tree this entry represents".  If this count is recursive, then
>   it seems like it would be enough to know how many entries will come
>   from traversing a whole subdirectory tree.  So it should be possible
>   to skip that many entries in the in-memory array and continue
>   reading the file entries for the parent subdirectory.  For example,
>   suppose our files are [A, B/1, B/2/a, B/2/b, B/3, C].  If I start by
>   reading the files in the root directory, then I can fill the index
>   array entries
> 
>       [A, -, -, -, -, C]
> 
>   because when I see that "B" is a directory containing a total of
>   four entries, I just leave fours spaces for them in the array and
>   continue with the next file, "C".  Of course I would have to
>   remember (e.g., on a queue) that directory "B" still needs to be
>   processed, and the starting index in the array where its entries
>   have to be filled in.  Still, this jumping around would be in the
>   RAM holding the index array pointers rather than in the file
>   positions.

The entry_count in the index is only valid, if the cache-tree is valid,
which is not always the case. Therefore it's impossible to rely on that
for the reading. I have changed the flat loading in the documentation,
hope it's more understandable now.

--
Thomas

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-08 14:11   ` Thomas Gummerer
@ 2012-05-08 14:25     ` Nguyen Thai Ngoc Duy
  2012-05-08 14:34       ` Nguyen Thai Ngoc Duy
  2012-05-09  8:37     ` Michael Haggerty
  1 sibling, 1 reply; 49+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-05-08 14:25 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Michael Haggerty, git, trast, gitster, peff, spearce, davidbarr

On Tue, May 8, 2012 at 9:11 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> * "160-bit object name for the object that would result from writing
>>   this span of index as a tree."  Is this always valid?
>
> No, this is only valid if the entry count is not -1. It's clarified
> now.

..and..

> The entry_count in the index is only valid, if the cache-tree is valid,
> which is not always the case.

I think your trees are the cache-trees already. For invalid
cache-trees, you can just use all-zero sha-1 as the indicator. Then
entry_count can go away.
-- 
Duy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-08 14:25     ` Nguyen Thai Ngoc Duy
@ 2012-05-08 14:34       ` Nguyen Thai Ngoc Duy
  2012-05-10  6:53         ` Thomas Gummerer
  0 siblings, 1 reply; 49+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-05-08 14:34 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Michael Haggerty, git, trast, gitster, peff, spearce, davidbarr

Sorry I replied too fast.

On Tue, May 8, 2012 at 9:25 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Tue, May 8, 2012 at 9:11 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>>> * "160-bit object name for the object that would result from writing
>>>   this span of index as a tree."  Is this always valid?
>>
>> No, this is only valid if the entry count is not -1. It's clarified
>> now.
>
> ..and..
>
>> The entry_count in the index is only valid, if the cache-tree is valid,
>> which is not always the case.
>
> I think your trees are the cache-trees already. For invalid
> cache-trees, you can just use all-zero sha-1 as the indicator. Then
> entry_count can go away.

Furthermore, in directory entry format:

  The last 24 bytes (4-byte number of entries + 160-bit object name) are
    for the cache tree. An entry can be in an invalidated state which is
    represented by having -1 in the entry_count field. If an entry is in
    invalidated state, the next entry will begin after the number of
    subtrees, and the 160-bit object name is dropped.

Dropping objname out of invalid (cache-)trees is a bad idea. When you
generate tree objects (aka cache_tree_update), you'll need objname
field again, which means structure change and directory entry rewrite.
If objname is always there, you can just overwrite objname with new
value. Though this may bring race condition issue back to directory
entries. The same approach on file entries might be reused.
-- 
Duy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-08 14:11   ` Thomas Gummerer
  2012-05-08 14:25     ` Nguyen Thai Ngoc Duy
@ 2012-05-09  8:37     ` Michael Haggerty
  2012-05-10 12:19       ` Thomas Gummerer
  1 sibling, 1 reply; 49+ messages in thread
From: Michael Haggerty @ 2012-05-09  8:37 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, gitster, peff, spearce, davidbarr

On 05/08/2012 04:11 PM, Thomas Gummerer wrote:
>> * The details of the extension data blocks are described in the first
>>    (overview) section, whereas it seems like they should be described
>>    in their own section following the "conflict data" section.  But
>>    wouldn't the presence of extension data blocks prevent the addition
>>    of conflict data?
>
> Only the details that should be there for every extension are described
> in the overview (the header of the extension), to make sure every
> extension has the same header format, and thus a reader which doesn't
> understand a specific extension still can read its header and know
> what's going on.
>
> They won't prevent the addition of conflicted data, since when a
> conflict is created, other files were probably added and the index has
> to be rewritten anyway. Once the conflict is resolved however only a
> bit has to be flipped, so there is no rewrite necessary.

In other words, the presence of extensions *does indeed* prevent the 
addition of conflict data, but you don't think that it is a problem.

Moving the conflict data to after the extensions, on the other hand, 
would mean that conflict data can sometimes be added without a rewrite. 
  I cannot judge whether this would be useful.

Handling conflict data *as* an extension would allow the conflict data 
to be added at any time without rewriting.  I cannot judge whether this 
would be useful.

>> * Does the index file include directory entries for empty directories?
>>    What about directories that contain only other directories?
>
> In theory the index is able to include empty directories. I'm however
> not sure if this should be implemented. I'd be happy to get more
> feedback there.

Currently git does not keep track of empty directories.  Even though 
there have been proposals to fix this, it is far beyond the scope of 
your project to implement the handling of empty directories.  The 
question is whether your format definition *forbids* the presence of 
empty directories in the index file (in the interest of definiteness, 
and it might make the reader implementation a little bit simpler, but it 
imposes a constraint on the writer).  Obviously empty directories, even 
if present, mustn't have an effect on the SHA1 of the trees containing them.

>> Directory entry
>> ===============
>>
>> * "4-byte number of entries in the index that is covered by the tree
>>    this entry represents."  What does this include?
>>    Files/directories/both?  Recursive or non-recursive?
>
> This is from the cache-tree. I'm not sure but I think it includes both
> files and directories, recursively.

Please figure this out for the final spec.

>> File entry
>> ==========
>> [...]
>
>> * Are file entries sorted by entire path or only by the basename?
>
> They are sorted by the basename, in the respective block of their
> directories.
> Example: paths: a/a a/z b/b
> File entries in the index:
> a ...
> z ...
> b ...

OK, so in other words, the file entries of all files in a directory (not 
including files in subdirectories) are stored contiguously, sorted by 
basename.  (The thing that wasn't immediately clear is whether files 
from subdirectories are intermingled with those of the parent directory.)

>> Flat loading
>> ============
>>
>> * I found the explanation pretty incomprehensible.  Perhaps some
>>    pseudo-code would make it clearer?
>> [...]
> [...] I have changed the flat loading in the documentation,
> hope it's more understandable now.

Maybe it's just be, but I still don't think it is very clear.  Here is 
version fbf8add1b026:

> == Flat loading
>
> Since internally git expects and works with lexicografic ordering,
> a simple linear scan throught the subdirectories doesn't give
> the right internal sorting. To achieve the right internal sorting
> the loading will be done in the following way:
>
> 1. Start with the root directory, and read also the name of the
>   first subdirectory (=next directory in the list).
>
> 1a. Use the next directory (the one against which the filenames
>   were checked previously), and read the next directory name,
>   to check the files against.
>
> 2. Check the stack if the element at the top is < then the current
>   directoryname.
>
>   If it's < then current directory name, add files from the stack
>     to the entry list, until the file name is > then the
>     directory name.
>
> 2. While filename < directoryname add the filenames to the entry
>   list
>
> 3. Add the rest of the files to a stack.
>
> 4. Continue with 1a, if there are more directories left.
>
> 5. Add the rest of the files from the stack to the end of the
>   entry list.

Aside from the fact that there are two number (2)s,

* What does "Use the next directory (the one against which the filenames 
were checked previously)" mean?  What does it mean to "use a directory"? 
  Does it mean to recurse into the directory?  Is the stack preserved 
passed down to the recursive function calls, or does each level of the 
recursion have its own stack?  What does "against which the filenames 
were checked previously" mean (there are no filenames mentioned in the 
earlier steps)?

* You talk about a stack, and "Add the rest of the files to a stack". 
But when you retrieve entries from a stack, they come out in reverse 
order.  So are you imagining that each element of the stack is an array 
of file entries?  Or do you push the files onto the stack in reverse 
order?  Or do you really mean a queue rather than a stack?

* Are the file entries read before they are put on the stack, or does 
the stack just remember where to read them from when their turn comes?

* "Continue with 1a, if there are more directories left": I assume you 
mean subdirectories of the current directory, but maybe you are talking 
about all directories?

There is a reason that I asked for pseudocode, namely because it forces 
you to be more precise in your description.  I can certainly imagine 
several workable algorithms for reading the index file, and the 
different algorithms have different tradeoffs particularly regarding the 
amount of temporary space needed and locality of reference in the index 
file (which, I understand, will be mmapped when practical but it is not 
practical on all platforms).  Once you express the algorithm in 
pseudocode it is possible to be sure which variant you have chosen and 
consider whether it is really workable.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-08 14:34       ` Nguyen Thai Ngoc Duy
@ 2012-05-10  6:53         ` Thomas Gummerer
  2012-05-10 11:06           ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-10  6:53 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: Michael Haggerty, git, trast, gitster, peff, spearce, davidbarr

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 1925 bytes --]



On 05/08, Nguyen Thai Ngoc Duy wrote:
> Sorry I replied too fast.
> 
> On Tue, May 8, 2012 at 9:25 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> > On Tue, May 8, 2012 at 9:11 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> >>> * "160-bit object name for the object that would result from writing
> >>>   this span of index as a tree."  Is this always valid?
> >>
> >> No, this is only valid if the entry count is not -1. It's clarified
> >> now.
> >
> > ..and..
> >
> >> The entry_count in the index is only valid, if the cache-tree is valid,
> >> which is not always the case.
> >
> > I think your trees are the cache-trees already. For invalid
> > cache-trees, you can just use all-zero sha-1 as the indicator. Then
> > entry_count can go away.

How is it a cache-tree already? The subtree is covered, but the 
entry_count is calculated recursively, while nfiles only keeps track of
the files directly in the directory, which is used for bisectability.

> Furthermore, in directory entry format:
> 
>   The last 24 bytes (4-byte number of entries + 160-bit object name) are
>     for the cache tree. An entry can be in an invalidated state which is
>     represented by having -1 in the entry_count field. If an entry is in
>     invalidated state, the next entry will begin after the number of
>     subtrees, and the 160-bit object name is dropped.
> 
> Dropping objname out of invalid (cache-)trees is a bad idea. When you
> generate tree objects (aka cache_tree_update), you'll need objname
> field again, which means structure change and directory entry rewrite.
> If objname is always there, you can just overwrite objname with new
> value. Though this may bring race condition issue back to directory
> entries. The same approach on file entries might be reused.

Yes you're right, at least the field for the object name (even if 0)
should always be there.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-10  6:53         ` Thomas Gummerer
@ 2012-05-10 11:06           ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 49+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-05-10 11:06 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Michael Haggerty, git, trast, gitster, peff, spearce, davidbarr

On Thu, May 10, 2012 at 1:53 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>
>
> On 05/08, Nguyen Thai Ngoc Duy wrote:
>> Sorry I replied too fast.
>>
>> On Tue, May 8, 2012 at 9:25 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
>> > On Tue, May 8, 2012 at 9:11 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> >>> * "160-bit object name for the object that would result from writing
>> >>>   this span of index as a tree."  Is this always valid?
>> >>
>> >> No, this is only valid if the entry count is not -1. It's clarified
>> >> now.
>> >
>> > ..and..
>> >
>> >> The entry_count in the index is only valid, if the cache-tree is valid,
>> >> which is not always the case.
>> >
>> > I think your trees are the cache-trees already. For invalid
>> > cache-trees, you can just use all-zero sha-1 as the indicator. Then
>> > entry_count can go away.
>
> How is it a cache-tree already? The subtree is covered, but the
> entry_count is calculated recursively, while nfiles only keeps track of
> the files directly in the directory, which is used for bisectability.

Cache-tree maintains the information that what part of the index
already has corresponding tree objects (and what sha-1). If a tree is
changed, the tree all the way up must be invalidated (i.e. throw out
the old sha-1 of trees). You have trees and objname for each tree.
Invalidating a tree is simply erasing objname. Before performing a
commit, we just need to traverse the trees from leaves up, generate
tree objects for trees that do not have valid objname yet.
-- 
Duy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-09  8:37     ` Michael Haggerty
@ 2012-05-10 12:19       ` Thomas Gummerer
  2012-05-10 18:17         ` Michael Haggerty
  0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-10 12:19 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, trast, gitster, peff, spearce, davidbarr

On 05/09, Michael Haggerty wrote:
> On 05/08/2012 04:11 PM, Thomas Gummerer wrote:
> >>* The details of the extension data blocks are described in the first
> >>   (overview) section, whereas it seems like they should be described
> >>   in their own section following the "conflict data" section.  But
> >>   wouldn't the presence of extension data blocks prevent the addition
> >>   of conflict data?
> >
> >Only the details that should be there for every extension are described
> >in the overview (the header of the extension), to make sure every
> >extension has the same header format, and thus a reader which doesn't
> >understand a specific extension still can read its header and know
> >what's going on.
> >
> >They won't prevent the addition of conflicted data, since when a
> >conflict is created, other files were probably added and the index has
> >to be rewritten anyway. Once the conflict is resolved however only a
> >bit has to be flipped, so there is no rewrite necessary.
> 
> In other words, the presence of extensions *does indeed* prevent the
> addition of conflict data, but you don't think that it is a problem.

Exactly.

> Moving the conflict data to after the extensions, on the other hand,
> would mean that conflict data can sometimes be added without a
> rewrite.  I cannot judge whether this would be useful.
>
> Handling conflict data *as* an extension would allow the conflict
> data to be added at any time without rewriting.  I cannot judge
> whether this would be useful.

Since there are offsets in the directory data to the conflicted data
I don't think it's good to call this data extension data. It may
however be beneficial to have the conflict data after the extension.
I'll investigate this.

> >>* Does the index file include directory entries for empty directories?
> >>   What about directories that contain only other directories?
> >
> >In theory the index is able to include empty directories. I'm however
> >not sure if this should be implemented. I'd be happy to get more
> >feedback there.
> 
> Currently git does not keep track of empty directories.  Even though
> there have been proposals to fix this, it is far beyond the scope of
> your project to implement the handling of empty directories.  The
> question is whether your format definition *forbids* the presence of
> empty directories in the index file (in the interest of
> definiteness, and it might make the reader implementation a little
> bit simpler, but it imposes a constraint on the writer).  Obviously
> empty directories, even if present, mustn't have an effect on the
> SHA1 of the trees containing them.

No, the index format doesn't forbid the presence of empty directories.
Empty directories will have a fileoffset of 0, and the reader will
just ignore them as long as there is no empty directory tracking.

> >>Directory entry
> >>===============
> >>
> >>* "4-byte number of entries in the index that is covered by the tree
> >>   this entry represents."  What does this include?
> >>   Files/directories/both?  Recursive or non-recursive?
> >
> >This is from the cache-tree. I'm not sure but I think it includes both
> >files and directories, recursively.
> 
> Please figure this out for the final spec.

It includes only files, in a recursive manner. I've written this down
in the spec.

> >>File entry
> >>==========
> >>[...]
> >
> >>* Are file entries sorted by entire path or only by the basename?
> >
> >They are sorted by the basename, in the respective block of their
> >directories.
> >Example: paths: a/a a/z b/b
> >File entries in the index:
> >a ...
> >z ...
> >b ...
> 
> OK, so in other words, the file entries of all files in a directory
> (not including files in subdirectories) are stored contiguously,
> sorted by basename.  (The thing that wasn't immediately clear is
> whether files from subdirectories are intermingled with those of the
> parent directory.)

Yes, exactly.

> >>Flat loading
> >>============
> >>
> >>* I found the explanation pretty incomprehensible.  Perhaps some
> >>   pseudo-code would make it clearer?
> >>[...]
> >[...] I have changed the flat loading in the documentation,
> >hope it's more understandable now.
> 
> Maybe it's just be, but I still don't think it is very clear.  Here
> is version fbf8add1b026:
> 
> >== Flat loading
> >
> >Since internally git expects and works with lexicografic ordering,
> >a simple linear scan throught the subdirectories doesn't give
> >the right internal sorting. To achieve the right internal sorting
> >the loading will be done in the following way:
> >
> >1. Start with the root directory, and read also the name of the
> >  first subdirectory (=next directory in the list).
> >
> >1a. Use the next directory (the one against which the filenames
> >  were checked previously), and read the next directory name,
> >  to check the files against.
> >
> >2. Check the stack if the element at the top is < then the current
> >  directoryname.
> >
> >  If it's < then current directory name, add files from the stack
> >    to the entry list, until the file name is > then the
> >    directory name.
> >
> >2. While filename < directoryname add the filenames to the entry
> >  list
> >
> >3. Add the rest of the files to a stack.
> >
> >4. Continue with 1a, if there are more directories left.
> >
> >5. Add the rest of the files from the stack to the end of the
> >  entry list.
> 
> [..] 
> There is a reason that I asked for pseudocode, namely because it
> forces you to be more precise in your description.  I can certainly
> imagine several workable algorithms for reading the index file, and
> the different algorithms have different tradeoffs particularly
> regarding the amount of temporary space needed and locality of
> reference in the index file (which, I understand, will be mmapped
> when practical but it is not practical on all platforms).  Once you
> express the algorithm in pseudocode it is possible to be sure which
> variant you have chosen and consider whether it is really workable.

Ok, here is the variant in pseudo code. I hope it's understandable
this way. It needs some temporary space, but never more then the
actual entries will need in the end anyway.

== Flat loading

Since internally git expects and works with lexicografic ordering,
a simple linear scan throught the subdirectories doesn't give
the right internal sorting. To achieve the right internal sorting
the loading will be done in the following way:

The data structure is a stack of queues, to allow continous reading
of the file.

s -> queue1
t -> queue2
a -> queue3
c -> queue4
k -> queue5

dirs = read_all_directories

foreach dir in dirs do
    file = read_next_file

    while element_on_top_of_stack.first_element < nextdir
        indexentries.append(dequeue(element_on_top_of_stack))
        if element_on_top_of_stack == emtpy:
            remove_element_on_top_of_stack

    if file[filename] < nextdir
        indexentries.append(file)
    else
        queue.add(file)
        foreach f in rest_of_files_in_directory:
            queue.add(f)
        stack.push(queue)

foreach queue in stack:
    foreach entry in queue:
        indexentry.append(entry)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-10 12:19       ` Thomas Gummerer
@ 2012-05-10 18:17         ` Michael Haggerty
  2012-05-11 17:12           ` Thomas Gummerer
  0 siblings, 1 reply; 49+ messages in thread
From: Michael Haggerty @ 2012-05-10 18:17 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, gitster, peff, spearce, davidbarr

On 05/10/2012 02:19 PM, Thomas Gummerer wrote:
> == Flat loading
>
> Since internally git expects and works with lexicografic ordering,
> a simple linear scan throught the subdirectories doesn't give
> the right internal sorting. To achieve the right internal sorting
> the loading will be done in the following way:
>
> The data structure is a stack of queues, to allow continous reading
> of the file.
>
> s ->  queue1
> t ->  queue2
> a ->  queue3
> c ->  queue4
> k ->  queue5
>
> dirs = read_all_directories
>
> foreach dir in dirs do
>      file = read_next_file
>
>      while element_on_top_of_stack.first_element<  nextdir
>          indexentries.append(dequeue(element_on_top_of_stack))
>          if element_on_top_of_stack == emtpy:
>              remove_element_on_top_of_stack
>
>      if file[filename]<  nextdir
>          indexentries.append(file)
>      else
>          queue.add(file)
>          foreach f in rest_of_files_in_directory:
>              queue.add(f)
>          stack.push(queue)
>
> foreach queue in stack:
>      foreach entry in queue:
>          indexentry.append(entry)

1. It seems to me that the first time that a file with a filename before 
nextdir is encountered, the reading of the directory containing the file 
will be terminated.  This would, for example, be the case for any 
directory that contains multiple files but no subdirectories.

2. There is still a lot that is unnecessarily obscure.  For example, I 
suppose (but you don't say) that "rest_of_files_in_directory" means to 
read the files at that moment.  It would be more explicit (and no more 
verbose) to write

     while (f = read_next_file()) != NULL:
         queue.add(f)

3. You don't cover corner cases, like when read_next_file() is called 
but there are no files left in the directory, or when there is no 
nextdir (which itself is not defined).  OK, this pseudocode is only 
meant to be illustrative, so I guess we can wait until your real 
implementation to see such details.  On the other hand, you probably 
want to get all the details straight in pseudocode or Python before you 
start translating it into C.

4. I think the algorithm would be easier to understand and implement if 
it were written recursively.  The call stack would replace your explicit 
stack (but you would still need one queue per directory level).  A key 
observation is that when "nextdir" precedes the next file, then all of 
the files in subdirectories of nextdir do as well.  Thus an obvious 
recursion would be to call a function like 
read_all_files_under_dir(indexentries, dirs, dirindex) at this point, 
which would consume all of the directories that are subdirectories of 
dirs[dirindex] (i.e., all directories whose names have the string 
dirs[dirindex] as a prefix).  Using this method would mean that there is 
no need to compare each file under dirs[dirindex] against the next file 
in the outer directory.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-10 18:17         ` Michael Haggerty
@ 2012-05-11 17:12           ` Thomas Gummerer
  2012-05-13 19:50             ` Michael Haggerty
  0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-11 17:12 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, trast, gitster, peff, spearce, davidbarr

On 05/10, Michael Haggerty wrote:
> On 05/10/2012 02:19 PM, Thomas Gummerer wrote:
> >== Flat loading
> >
> >Since internally git expects and works with lexicografic ordering,
> >a simple linear scan throught the subdirectories doesn't give
> >the right internal sorting. To achieve the right internal sorting
> >the loading will be done in the following way:
> >
> >The data structure is a stack of queues, to allow continous reading
> >of the file.
> >
> >s ->  queue1
> >t ->  queue2
> >a ->  queue3
> >c ->  queue4
> >k ->  queue5
> >
> >dirs = read_all_directories
> >
> >foreach dir in dirs do
> >     file = read_next_file
> >
> >     while element_on_top_of_stack.first_element<  nextdir
> >         indexentries.append(dequeue(element_on_top_of_stack))
> >         if element_on_top_of_stack == emtpy:
> >             remove_element_on_top_of_stack
> >
> >     if file[filename]<  nextdir
> >         indexentries.append(file)
> >     else
> >         queue.add(file)
> >         foreach f in rest_of_files_in_directory:
> >             queue.add(f)
> >         stack.push(queue)
> >
> >foreach queue in stack:
> >     foreach entry in queue:
> >         indexentry.append(entry)
> 
> 1. It seems to me that the first time that a file with a filename
> before nextdir is encountered, the reading of the directory
> containing the file will be terminated.  This would, for example, be
> the case for any directory that contains multiple files but no
> subdirectories.
> 
> 2. There is still a lot that is unnecessarily obscure.  For example,
> I suppose (but you don't say) that "rest_of_files_in_directory"
> means to read the files at that moment.  It would be more explicit
> (and no more verbose) to write
> 
>     while (f = read_next_file()) != NULL:
>         queue.add(f)
> 
> 3. You don't cover corner cases, like when read_next_file() is
> called but there are no files left in the directory, or when there
> is no nextdir (which itself is not defined).  OK, this pseudocode is
> only meant to be illustrative, so I guess we can wait until your
> real implementation to see such details.  On the other hand, you
> probably want to get all the details straight in pseudocode or
> Python before you start translating it into C.
> 
> 4. I think the algorithm would be easier to understand and implement
> if it were written recursively.  The call stack would replace your
> explicit stack (but you would still need one queue per directory
> level).  A key observation is that when "nextdir" precedes the next
> file, then all of the files in subdirectories of nextdir do as well.
> Thus an obvious recursion would be to call a function like
> read_all_files_under_dir(indexentries, dirs, dirindex) at this
> point, which would consume all of the directories that are
> subdirectories of dirs[dirindex] (i.e., all directories whose names
> have the string dirs[dirindex] as a prefix).  Using this method
> would mean that there is no need to compare each file under
> dirs[dirindex] against the next file in the outer directory.
> 
> Michael

Thanks for your feedback! To get clearer code I've now written a
working reader for the v5 index format in Python. The full reader
would probably be to long for the mailing list, but here is the
interesting part:


def readfiles(directories, dirnr, entries):
    global filedata
    f.seek(directories[dirnr]["foffset"])
    offset = struct.unpack("!I", fread(4))[0]
    f.seek(offset)
    filedata = list()
    queue = list()
    i = 0
    while i < directories[dirnr]["nfiles"]:
        filedata.append(struct.pack("!I", f.tell()))
        filename = ""
        byte = fread(1)
        while byte != '\0':
            filename += byte
            byte = fread(1)

        data = struct.unpack("!HHIII", fread(16))
        objhash = fread(20)
        readcrc = struct.pack("!i", binascii.crc32("".join(filedata)))
        crc = f.read(4)
        if readcrc != crc:
            print "Wrong CRC: " + filename
        filedata = list()

        i += 1

        queue.append(dict({"name": directories[dirnr]["pathname"] + filename, "flags": data[0], "mode": data[1], "mtimes": data[2], "mtimens": data[3], "statcrc": data[4], "objhash": binascii.hexlify(objhash)}))

    if len(directories) > dirnr:
        i = 0
        while i < len(queue):
            if len(directories) - 1 > dirnr and queue[i]["name"] > directories[dirnr + 1]["pathname"]:
                entries, dirnr = readfiles(directories, dirnr + 1, entries)
            else:
                entries.append(queue[i])
                i += 1
        return entries, dirnr

The full reader can be found here:
https://github.com/tgummerer/git/blob/pythonprototype/git-read-index-v5.py

-- 
Thomas

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-11 17:12           ` Thomas Gummerer
@ 2012-05-13 19:50             ` Michael Haggerty
  2012-05-14 15:01               ` Thomas Gummerer
  0 siblings, 1 reply; 49+ messages in thread
From: Michael Haggerty @ 2012-05-13 19:50 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, gitster, peff, spearce, davidbarr

On 05/11/2012 07:12 PM, Thomas Gummerer wrote:
> Thanks for your feedback! To get clearer code I've now written a
> working reader for the v5 index format in Python. The full reader
> would probably be to long for the mailing list, but here is the
> interesting part:
>
> [...]
> The full reader can be found here:
> https://github.com/tgummerer/git/blob/pythonprototype/git-read-index-v5.py

Good.

I tried to review your code 3fe08f9b:git-read-index-v5.py and compare it 
to your file spec f858cf6a9:Index-format-v5.textile.  I have the 
following comments (some of them already discussed in IRC):

1. Your script seems to be reading a different version of the file than 
described in the spec.  [When I mentioned this on IRC you pushed a new 
version a4ee558ea of the spec.]

2. Your script seems to assume that the index file has no extensions. 
It would be better (for documentation purposes and to ensure that there 
are no surprises) to make sure that the code knows how to handle extensions.

3. Please document briefly how the scripts should be used.

4. Please limit line length to 80 columns (like the main git project).

5. Python has a nicer way to initialize dictionaries whose keys are all 
valid identifiers; for example:

-        return dict({"signature": signature, "vnr": header[0], "ndir": 
header[1], "nfile": header[2], "next": header[3]})
+        return dict(signature=signature, vnr=header[0], ndir=header[1],
+                    nfile=header[2], next=header[3])

6. Some of your print statements are just begging to be written using 
string interpolation; e.g.,

-        print d["pathname"] + " " + str(d["flags"]) + " " + 
str(d["foffset"]) + " " + str(d["cr"]) + " " + str(d["ncr"]) + " " + 
str(d["nsubtrees"]) + " " + str(d["nfiles"]) + " " + str(d["nentries"]) 
+ " " + str(binascii.hexlify(d["objname"]))
+        print ("%(pathname)s %(flags)s %(foffset)s %(cr)s %(ncr)s "
+               "%(nsubtrees)s %(nfiles)s %(nentries)s " % d
+               + str(binascii.hexlify(d["objname"])))

printheader() can be rewritten similarly.

7. You have a couple of while loops that would be easier to read if 
written as for loops.

8. There is no need to use global variables.  Global variables have lots 
of disadvantages, one of which is that it is hard to tell what functions 
have side effects via the global variables.  It is better to pass the 
needed variables explicitly to functions that need them.

9. ...after you eliminate the global variables, you will see that the 
checksums are mostly needed over limited areas of code then can be 
discarded.  Rewriting the checksum handling in this way would make it 
easier to see exactly what range of bytes is included in a particular 
checksum.

10. There is no need to keep track of all of the data that will go into 
a checksum.  The CRC32 checksum can be computed incrementally via the 
second argument of binascii.crc32(data, crc).  Therefore, you only need 
to retain a 32-bit running checksum instead of the filedata array of 
data strings.

11. It is bad style to generate output from within the 
readindexentries() function.  Given that it reads the whole array of 
file entries anyway, it would be cleaner to return the array to the 
caller and let the caller print out what it wants.

12. Your handling of checksum errors is inconsistent.  In some places 
you generate exceptions; in another you simply print an error to stdout 
(not stderr!) and proceed to use the corrupt data.

13. It is probably clearer to unpack the tuples returned by 
struct.unpack() directly into local variables with meaningful names 
instead of carrying them around as a tuple; e.g.,

-    header = struct.unpack('!IIII', checksum.add(f.read(16)))
+    (vnr, ndir, nfile, next) = struct.unpack('!IIII', fread(16))

14. It is more correct to check the file signature and version 
explicitly before plowing into the rest of the file (that's what they're 
there for!)

That's as far as I've got.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-03 17:25 Index format v5 Thomas Gummerer
                   ` (8 preceding siblings ...)
  2012-05-07 15:15 ` Michael Haggerty
@ 2012-05-13 21:01 ` Philip Oakley
  2012-05-14 14:54   ` Thomas Gummerer
  9 siblings, 1 reply; 49+ messages in thread
From: Philip Oakley @ 2012-05-13 21:01 UTC (permalink / raw)
  To: Thomas Gummerer, Git List
  Cc: trast, gitster, mhagger, peff, spearce, davidbarr

From: "Thomas Gummerer" <t.gummerer@gmail.com> Sent: Thursday, May 03, 2012 6:25 PM
> I have been drafting the Version 5 of the index format over the past
> few days with the help of Thomas Rast, Michael Haggerty, cmn and
> barrbrain on IRC. 

>
> GIT index format
> ================
>
> = The git index file has the following format
>

Given the discussions on the list about the general naming of Staging vs Index [1], 
would a careful change to the title, and the adding of an introductory line 
help in putting the index file format in the appropriate (implementation) context?

I'm thinking that perhaps -
Title: "GIT index file format (V5)", i.e. add the 'file' qualifier.

Introduction line:
"The git index file (.git/index) documents the status of the files in the git staging area."
    i.e. this is an implementation document for this particular file, but using the terms
    suggested in [1]. Followed by
"The staging area is used for preparing commits, merging, etc.".
   i.e. show the purpose of this index relative to the overall 'staging area'. 
   IIRC the use of the staging area for merging was one of Linus's key features;-)

By crafting the title and the introduction line(s) the confusion e.g. [2], between implementation 
details (this document) and conceptual operation can be clearly separated.

Philip

[1] http://article.gmane.org/gmane.comp.version-control.git/197111 
    [1.8.0] use 'stage' term consistently

[2] http://raflabs.com/blogs/silence-is-foo/2011/04/07/staging-area-index-cache-git/
    ... Git it's a little bit confusing to undersand some of its terminology

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-13 21:01 ` Philip Oakley
@ 2012-05-14 14:54   ` Thomas Gummerer
  0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-14 14:54 UTC (permalink / raw)
  To: Philip Oakley; +Cc: Git List, trast, gitster, mhagger, peff, spearce, davidbarr



On 05/13, Philip Oakley wrote:
> From: "Thomas Gummerer" <t.gummerer@gmail.com> Sent: Thursday, May 03, 2012 6:25 PM
> >I have been drafting the Version 5 of the index format over the past
> >few days with the help of Thomas Rast, Michael Haggerty, cmn and
> >barrbrain on IRC.
> 
> >
> >GIT index format
> >================
> >
> >= The git index file has the following format
> >
> 
> Given the discussions on the list about the general naming of
> Staging vs Index [1], would a careful change to the title, and the
> adding of an introductory line help in putting the index file format
> in the appropriate (implementation) context?
> 
> I'm thinking that perhaps -
> Title: "GIT index file format (V5)", i.e. add the 'file' qualifier.
> 
> Introduction line:
> "The git index file (.git/index) documents the status of the files in the git staging area."
>    i.e. this is an implementation document for this particular file, but using the terms
>    suggested in [1]. Followed by
> "The staging area is used for preparing commits, merging, etc.".
>   i.e. show the purpose of this index relative to the overall
> 'staging area'.   IIRC the use of the staging area for merging was
> one of Linus's key features;-)
> 
> By crafting the title and the introduction line(s) the confusion
> e.g. [2], between implementation details (this document) and
> conceptual operation can be clearly separated.
> 
> Philip
> 
> [1] http://article.gmane.org/gmane.comp.version-control.git/197111
> [1.8.0] use 'stage' term consistently
> 
> [2] http://raflabs.com/blogs/silence-is-foo/2011/04/07/staging-area-index-cache-git/
>    ... Git it's a little bit confusing to undersand some of its terminology

Thanks for your suggestion. I've changed the documentation to take this
into account.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-13 19:50             ` Michael Haggerty
@ 2012-05-14 15:01               ` Thomas Gummerer
  2012-05-14 21:08                 ` Michael Haggerty
  0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-14 15:01 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, trast, gitster, peff, spearce, davidbarr



On 05/13, Michael Haggerty wrote:
> On 05/11/2012 07:12 PM, Thomas Gummerer wrote:
> >Thanks for your feedback! To get clearer code I've now written a
> >working reader for the v5 index format in Python. The full reader
> >would probably be to long for the mailing list, but here is the
> >interesting part:
> >
> >[...]
> >The full reader can be found here:
> >https://github.com/tgummerer/git/blob/pythonprototype/git-read-index-v5.py
> 
> Good.
> 
> I tried to review your code 3fe08f9b:git-read-index-v5.py and
> compare it to your file spec f858cf6a9:Index-format-v5.textile.  I
> have the following comments (some of them already discussed in IRC):
> 
> 1. Your script seems to be reading a different version of the file
> than described in the spec.  [When I mentioned this on IRC you
> pushed a new version a4ee558ea of the spec.]
> 
> 2. Your script seems to assume that the index file has no
> extensions. It would be better (for documentation purposes and to
> ensure that there are no surprises) to make sure that the code knows
> how to handle extensions.
> 
> 3. Please document briefly how the scripts should be used.
> 
> 4. Please limit line length to 80 columns (like the main git project).
> 
> 5. Python has a nicer way to initialize dictionaries whose keys are
> all valid identifiers; for example:
> 
> -        return dict({"signature": signature, "vnr": header[0],
> "ndir": header[1], "nfile": header[2], "next": header[3]})
> +        return dict(signature=signature, vnr=header[0], ndir=header[1],
> +                    nfile=header[2], next=header[3])
> 
> 6. Some of your print statements are just begging to be written
> using string interpolation; e.g.,
> 
> -        print d["pathname"] + " " + str(d["flags"]) + " " +
> str(d["foffset"]) + " " + str(d["cr"]) + " " + str(d["ncr"]) + " " +
> str(d["nsubtrees"]) + " " + str(d["nfiles"]) + " " +
> str(d["nentries"]) + " " + str(binascii.hexlify(d["objname"]))
> +        print ("%(pathname)s %(flags)s %(foffset)s %(cr)s %(ncr)s "
> +               "%(nsubtrees)s %(nfiles)s %(nentries)s " % d
> +               + str(binascii.hexlify(d["objname"])))
> 
> printheader() can be rewritten similarly.
> 
> 7. You have a couple of while loops that would be easier to read if
> written as for loops.
> 
> 8. There is no need to use global variables.  Global variables have
> lots of disadvantages, one of which is that it is hard to tell what
> functions have side effects via the global variables.  It is better
> to pass the needed variables explicitly to functions that need them.
> 
> 9. ...after you eliminate the global variables, you will see that
> the checksums are mostly needed over limited areas of code then can
> be discarded.  Rewriting the checksum handling in this way would
> make it easier to see exactly what range of bytes is included in a
> particular checksum.
> 
> 10. There is no need to keep track of all of the data that will go
> into a checksum.  The CRC32 checksum can be computed incrementally
> via the second argument of binascii.crc32(data, crc).  Therefore,
> you only need to retain a 32-bit running checksum instead of the
> filedata array of data strings.
> 
> 11. It is bad style to generate output from within the
> readindexentries() function.  Given that it reads the whole array of
> file entries anyway, it would be cleaner to return the array to the
> caller and let the caller print out what it wants.
> 
> 12. Your handling of checksum errors is inconsistent.  In some
> places you generate exceptions; in another you simply print an error
> to stdout (not stderr!) and proceed to use the corrupt data.
> 
> 13. It is probably clearer to unpack the tuples returned by
> struct.unpack() directly into local variables with meaningful names
> instead of carrying them around as a tuple; e.g.,
> 
> -    header = struct.unpack('!IIII', checksum.add(f.read(16)))
> +    (vnr, ndir, nfile, next) = struct.unpack('!IIII', fread(16))
> 
> 14. It is more correct to check the file signature and version
> explicitly before plowing into the rest of the file (that's what
> they're there for!)
> 
> That's as far as I've got.
> 
> Michael

Thanks a lot for your feedback. I've now refactored the code and
thanks to your suggestions hopefully made it simpler and easier
to read. The reader should now read exactly the data from the
spec.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-14 15:01               ` Thomas Gummerer
@ 2012-05-14 21:08                 ` Michael Haggerty
  2012-05-14 22:10                   ` Thomas Rast
  2012-05-15 13:49                   ` Thomas Gummerer
  0 siblings, 2 replies; 49+ messages in thread
From: Michael Haggerty @ 2012-05-14 21:08 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, gitster, peff, spearce, davidbarr

On 05/14/2012 05:01 PM, Thomas Gummerer wrote:
> Thanks a lot for your feedback. I've now refactored the code and
> thanks to your suggestions hopefully made it simpler and easier
> to read. The reader should now read exactly the data from the
> spec.

Yes, the style is much better now.  Here is the next round of feedback:

1. f is still a global variable.  This is unnecessary.

2. read_name() is indented incorrectly.

3. The signature and version number should be checked within 
read_header() *before* reading anything else.  Otherwise, if some random 
file is given to the script, it will read some random, probably very 
large number for "nextensions" and then try to read that number of 
extension offsets into RAM.  It would be a pretty innocent error to have 
the wrong version of the index file lying around, and the script should 
detect such an error quickly and painlessly rather than triggering a lot 
of pointless disk activity (and likely an out-of-memory error) before 
figuring out that it shouldn't be reading the file in the first place.

4. For readability reasons, it is better to raise an exception 
immediately in an error situation rather than put the error-handling 
code in the "else" part of an if statement; i.e., instead of

     if not error_condition:
         non-error-actions
     else:
         raise exception

use

     if error_condition:
         raise exception

     non-error-actions

The reasons are: (a) the reader can see immediately what will happen in 
the error case, then forget it and continue on with the "normal" flow 
instead of having to remember the "if" condition through the whole long 
"then" block before finally seeing the point in the "else" block; (b) 
then the "normal" case doesn't have to be within the if statement at 
all, thereby requiring less block nesting and less indentation.  For 
example:

> -    if crc == datacrc:
> -        return dict(signature=signature, vnr=vnr, ndir=ndir, nfile=nfile,
> -                nextensions=nextensions, extoffsets=extoffsets)
> -    else:
> -        raise Exception("Wrong header crc")
> +    if crc != datacrc:
> +        raise Exception("Wrong header crc")
> +
> +    return dict(signature=signature, vnr=vnr, ndir=ndir, nfile=nfile,
> +                nextensions=nextensions, extoffsets=extoffsets)

5. If the first limit of a range() or xrange() is zero, it is usually 
omitted; e.g., "xrange(0, nextensions)" -> "xrange(nextensions)".

6. It is possible to precompile "struct" patterns, which should be 
faster and also allows you to ask the Struct object its size, reducing 
the number of magic numbers needed in the code.  And fixed-length 
strings can also be read via struct.  For example:

> DIR_DATA_STRUCT = struct.Struct("!HIIIIII 20s")
>
>
> def read_dirs(f, ndir):
>     dirs = list()
>     for i in xrange(0, ndir):
>         (pathname, partialcrc) = read_name(f)
>
>         (filedata, partialcrc) = read_calc_crc(f, DIR_DATA_STRUCT.size, partialcrc)
>         (flags, foffset, cr, ncr, nsubtrees, nfiles,
>                 nentries, objname) = DIR_DATA_STRUCT.unpack(filedata)
>     # ...

7. The "if dirnr == 0" stuff in read_files() should probably be done in 
read_index_entries() (the first invocation is the only place dirnr==0, 
right?)

8. read_files() only returns a result "if len(directories) > dirnr". 
But actually I don't see the purpose for the check.  Earlier in the 
function you dereference directories[dirnr] several times, so it *must* 
be that dirnr < len(directories).  I think you can take out this test 
and unindent its block.

9. read_files() doesn't need to return "entries".  Since entries is an 
array that is only mutated in place, the return value will always be the 
same as the "entries" argument (albeit fuller).

10. It would make sense to extract a function read_file(), which reads a 
single file entry from the current position in the current file (using 
code from read_files()).  Similarly for read_dir()/read_dirs().

11. It is good form to move the file-level code into a main() function, 
then call that from the bottom of the file, something like this:

 > def main(args):
 >     ....
 >
 > main(sys.argv[1:])

This avoids creating global variables that are accidentally used within 
functions.

What is your plan for testing this code, and later the C version?  For 
example, you might want to have a suite of index files with various 
contents, and compare the "git ls-files --debug" output with the output 
that is expected.  How would you create index files like this?  Via git 
commands?  Or should one of your Python scripts be taught how to do it?

To make testing easier, you probably don't want to hard-code the name of 
the input file in git-read-index-v5.py, so that you can use it to read 
arbitrary files.  For example, you might want to honor the 
GIT_INDEX_FILES environment variable in some form, or to take the name 
of the index file as a command-line argument.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-14 21:08                 ` Michael Haggerty
@ 2012-05-14 22:10                   ` Thomas Rast
  2012-05-15  6:43                     ` Michael Haggerty
  2012-05-15 13:49                   ` Thomas Gummerer
  1 sibling, 1 reply; 49+ messages in thread
From: Thomas Rast @ 2012-05-14 22:10 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Thomas Gummerer, git, trast, gitster, peff, spearce, davidbarr

Michael Haggerty <mhagger@alum.mit.edu> writes:

First of all, many thanks for taking up this time-consuming job while I
was away!  There's not much I can add at this point, just a few minor
points:

> 9. read_files() doesn't need to return "entries".  Since entries is an
> array that is only mutated in place, the return value will always be
> the same as the "entries" argument (albeit fuller).

(Ab)using an array in this fashion is somewhat iffy.  It seems
unavoidable in this case (while still retaining the runtime), but try
not to do it too often, and perhaps name the parameter something that
makes this clear (such as 'out').  Usually changing it to use a
generator function (with 'yield') helps.

> 11. It is good form to move the file-level code into a main()
> function, then call that from the bottom of the file, something like
> this:
>
>> def main(args):
>>     ....
>>
>> main(sys.argv[1:])

It's customary to wrap it as

if __name__ == '__main__':
    main(sys.argv[1:])

That way your script becomes 'import'-able, which can be handy (if only
for testing).

Cheers,
Thomas

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-14 22:10                   ` Thomas Rast
@ 2012-05-15  6:43                     ` Michael Haggerty
  0 siblings, 0 replies; 49+ messages in thread
From: Michael Haggerty @ 2012-05-15  6:43 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Thomas Gummerer, git, gitster, peff, spearce, davidbarr

On 05/15/2012 12:10 AM, Thomas Rast wrote:
> Michael Haggerty<mhagger@alum.mit.edu>  writes:
>> 9. read_files() doesn't need to return "entries".  Since entries is an
>> array that is only mutated in place, the return value will always be
>> the same as the "entries" argument (albeit fuller).
>
> (Ab)using an array in this fashion is somewhat iffy.  It seems
> unavoidable in this case (while still retaining the runtime), but try
> not to do it too often, and perhaps name the parameter something that
> makes this clear (such as 'out').  Usually changing it to use a
> generator function (with 'yield') helps.

If the goal were an ideal Python program, then by all means generators 
are the way to go.  But since the goal is a prototype for a C program, 
then a change to using generators would just have to be undone when 
converting to C.

>> 11. It is good form to move the file-level code into a main()
>> function, then call that from the bottom of the file, something like
>> this:
>>
>>> def main(args):
>>>      ....
>>>
>>> main(sys.argv[1:])
>
> It's customary to wrap it as
>
> if __name__ == '__main__':
>      main(sys.argv[1:])
>
> That way your script becomes 'import'-able, which can be handy (if only
> for testing).

+1

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-14 21:08                 ` Michael Haggerty
  2012-05-14 22:10                   ` Thomas Rast
@ 2012-05-15 13:49                   ` Thomas Gummerer
  2012-05-15 15:02                     ` Michael Haggerty
  2012-05-16  5:01                     ` Michael Haggerty
  1 sibling, 2 replies; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-15 13:49 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, trast, gitster, peff, spearce, davidbarr



On 05/14, Michael Haggerty wrote:
> On 05/14/2012 05:01 PM, Thomas Gummerer wrote:
> >Thanks a lot for your feedback. I've now refactored the code and
> >thanks to your suggestions hopefully made it simpler and easier
> >to read. The reader should now read exactly the data from the
> >spec.
> 
> Yes, the style is much better now.  Here is the next round of feedback:
> [...]
> 
> >DIR_DATA_STRUCT = struct.Struct("!HIIIIII 20s")
> >
> >
> >def read_dirs(f, ndir):
> >    dirs = list()
> >    for i in xrange(0, ndir):
> >        (pathname, partialcrc) = read_name(f)
> >
> >        (filedata, partialcrc) = read_calc_crc(f, DIR_DATA_STRUCT.size, partialcrc)
> >        (flags, foffset, cr, ncr, nsubtrees, nfiles,
> >                nentries, objname) = DIR_DATA_STRUCT.unpack(filedata)
> >    # ...
> 
> [...]

Thanks again for your feedback. I've refactored the code again,
thanks to your suggestions. If I'm correct it's fine to have the
compiled structs global?

> What is your plan for testing this code, and later the C version?
> For example, you might want to have a suite of index files with
> various contents, and compare the "git ls-files --debug" output with
> the output that is expected.  How would you create index files like
> this?  Via git commands?  Or should one of your Python scripts be
> taught how to do it?

I thought of using real world examples for this, for example the
WebKit index, which is pretty large, and some others, for example the
git index and the linux kernel index.

There would be some changes necessary to the output format of 
git ls-files --debug, to work with the new index format, but those
should be fairly simple.

> To make testing easier, you probably don't want to hard-code the
> name of the input file in git-read-index-v5.py, so that you can use
> it to read arbitrary files.  For example, you might want to honor
> the GIT_INDEX_FILES environment variable in some form, or to take
> the name of the index file as a command-line argument.

I've changed this in the script, which now takes a --file argument
with the file name of the index file that should be read.
(git-read-index-v5.py --file=FILENAME)

-- 
Thomas

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-15 13:49                   ` Thomas Gummerer
@ 2012-05-15 15:02                     ` Michael Haggerty
  2012-05-18 15:38                       ` Thomas Gummerer
  2012-05-16  5:01                     ` Michael Haggerty
  1 sibling, 1 reply; 49+ messages in thread
From: Michael Haggerty @ 2012-05-15 15:02 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, gitster, peff, spearce, davidbarr

On 05/15/2012 03:49 PM, Thomas Gummerer wrote:
> Thanks again for your feedback. I've refactored the code again,
> thanks to your suggestions.

Good.  I'll try to review the new version as soon as possible.

I suggest that you apply the same kinds of cleanups to 
git-convert-index.py (which I personally haven't looked at yet at all). 
  If you want my feedback on that script, please let me know when you 
think it is ready.

 > If I'm correct it's fine to have the
> compiled structs global?

Yes, that's OK because they are constants so there is no risk of them 
propagating side-effects.  (Of course, Python doesn't enforce the 
constness of identifiers, but by convention ALL_CAPS identifiers are 
constants and it would be an obvious no-no to modify one.)

A real Pythonic solution would probably encapsulate all of your code 
(including the constants) in classes.  But since you will eventually 
translate the code to C, I don't think that step is crucial.  If, on the 
other hand, you propose to include Python scripts in the git 
distribution, then making them Pythonic would definitely be on the agenda.

>> What is your plan for testing this code, and later the C version?
>> For example, you might want to have a suite of index files with
>> various contents, and compare the "git ls-files --debug" output with
>> the output that is expected.  How would you create index files like
>> this?  Via git commands?  Or should one of your Python scripts be
>> taught how to do it?
>
> I thought of using real world examples for this, for example the
> WebKit index, which is pretty large, and some others, for example the
> git index and the linux kernel index.

It is good to do such manual tests, but you probably also want small, 
hand-constructed, automatable tests to make sure that all of the 
codepaths are exercised.  For example,

* A directory containing only files

* A directory containing only subdirectories

* A mixed directory whose last element is a file / last element is a 
subdirectory

* Various types of conflicts

...etc.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-15 13:49                   ` Thomas Gummerer
  2012-05-15 15:02                     ` Michael Haggerty
@ 2012-05-16  5:01                     ` Michael Haggerty
  2012-05-16 21:54                       ` Thomas Gummerer
  1 sibling, 1 reply; 49+ messages in thread
From: Michael Haggerty @ 2012-05-16  5:01 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, gitster, peff, spearce, davidbarr

On 05/15/2012 03:49 PM, Thomas Gummerer wrote:
> Thanks again for your feedback. I've refactored the code again,
> thanks to your suggestions.

I just reviewed version 1369bd855b86 of your script, and it is MUCH 
better.  It's easy to read and review.  The functions that it defines 
are now self-contained and could therefore be reused for other purposes. 
  There are fewer magic numbers (though there are still a few; I wonder 
if there is a way to get rid of those?)  You've done a nice job 
polishing up the code.

I have only a few remaining niggles::

1. The struct module can handle fixed-length strings, so you could read 
and parse the SHA1s as part of FILE_DATA_STRUCT and DIR_DATA_STRUCT 
rather than handling them separately.

2. At least some of the functions deserve docstrings, especially when 
they are nontrivial.  For example, what arguments read_files() needs and 
how they are used is far from obvious.

3. It would be easier to read the multiline string formatting templates 
if they are written as multiline strings (even though this kindof 
requires that they be made file-level constants); e.g.,

FILE_FORMAT = """\
%(name)s (%(objhash)s)
mtime: %(mtimes)s:%(mtimens)s
mode: %(mode)s flags: %(flags)s
statcrc: """

In the future, please try to commit one self-contained change at a time 
and make your commit messages really describe what is changed in the 
commit.  For example, commit fb2654c648a does at least three things, 
only one of which is mentioned in its commit message.  It would be 
better to break this into three commits with three commit messages.  Use 
"git rebase -i" and the other commit-rewriting tools liberally to tidy 
up commits before publishing them (but not after publishing them!); for 
example, commit be8d01c22c should really have been squashed on top of 
"Changed main to a function" so that the rest of the world doesn't have 
to see the broken latter commit.

But I don't want to detract from the fact that altogether you have made 
very nice improvements to the script, and I hope to see more like this 
from you in the future!

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-16  5:01                     ` Michael Haggerty
@ 2012-05-16 21:54                       ` Thomas Gummerer
  2012-05-19  5:40                         ` Michael Haggerty
  0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-16 21:54 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, trast, gitster, peff, spearce, davidbarr



On 05/16, Michael Haggerty wrote:
> I just reviewed version 1369bd855b86 of your script, and it is MUCH
> better.  It's easy to read and review.  The functions that it
> defines are now self-contained and could therefore be reused for
> other purposes.  There are fewer magic numbers (though there are
> still a few; I wonder if there is a way to get rid of those?)
> You've done a nice job polishing up the code.

Thanks for the feedback! I could get rid of the magic numbers for
the crc code, but I'm not sure if it makes sense to replace the
others with constants, since they only occur once in the file. I
added comments instead explaining where those numbers come from
instead.

> I have only a few remaining niggles::
> 
> 1. The struct module can handle fixed-length strings, so you could
> read and parse the SHA1s as part of FILE_DATA_STRUCT and
> DIR_DATA_STRUCT rather than handling them separately.
> 
> 2. At least some of the functions deserve docstrings, especially
> when they are nontrivial.  For example, what arguments read_files()
> needs and how they are used is far from obvious.
> 
> 3. It would be easier to read the multiline string formatting
> templates if they are written as multiline strings (even though this
> kindof requires that they be made file-level constants); e.g.,
> 
> FILE_FORMAT = """\
> %(name)s (%(objhash)s)
> mtime: %(mtimes)s:%(mtimens)s
> mode: %(mode)s flags: %(flags)s
> statcrc: """

Thanks, I have changed those. I added the docstrings for all read
functions, they however don't seem to make sense for the print
functions, since you're probably faster just reading the code for
them.

One since I changed in addition is to those changes, is that I
gave the exceptions names to make them more meaningful.

> In the future, please try to commit one self-contained change at a
> time and make your commit messages really describe what is changed
> in the commit.  For example, commit fb2654c648a does at least three
> things, only one of which is mentioned in its commit message.  It
> would be better to break this into three commits with three commit
> messages.  Use "git rebase -i" and the other commit-rewriting tools
> liberally to tidy up commits before publishing them (but not after
> publishing them!); for example, commit be8d01c22c should really have
> been squashed on top of "Changed main to a function" so that the
> rest of the world doesn't have to see the broken latter commit.

Ok, I'll try my best to do that.

--
Thomas

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-15 15:02                     ` Michael Haggerty
@ 2012-05-18 15:38                       ` Thomas Gummerer
  2012-05-19 13:00                         ` Michael Haggerty
  0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-18 15:38 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, trast, gitster, peff, spearce, davidbarr

> I suggest that you apply the same kinds of cleanups to
> git-convert-index.py (which I personally haven't looked at yet at
> all).  If you want my feedback on that script, please let me know
> when you think it is ready.

That would be great, if you have the time to do it. I'm not
completely finished with it (docstrings and conflicted data writing
are still missing).

I'm not sure about the read_tree_extensiondata method, if I should
extract a method, which only reads one entry, but I'm not sure that
would make any sense, since there would be a lot of parameters and
return values to the function.

The same thing is in the main method, where I'm not sure if it's
better to extract the read_index and write_index functions, or
just leave the code in the main method. My guess is that it makes
sense in the main method, since there are less calls, but it
doesn't make sense in the read_tree_extensiondata method?

Another thing I'm unsure about is the write_directory_data method,
if there is any way to replace the try/except with something
simpler?

--
Thomas

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-16 21:54                       ` Thomas Gummerer
@ 2012-05-19  5:40                         ` Michael Haggerty
  2012-05-21 20:30                           ` Thomas Gummerer
  0 siblings, 1 reply; 49+ messages in thread
From: Michael Haggerty @ 2012-05-19  5:40 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, gitster, peff, spearce, davidbarr

On 05/16/2012 11:54 PM, Thomas Gummerer wrote:
> On 05/16, Michael Haggerty wrote:
>> I just reviewed version 1369bd855b86 of your script, and it is MUCH
>> better.  It's easy to read and review.  The functions that it
>> defines are now self-contained and could therefore be reused for
>> other purposes.  There are fewer magic numbers (though there are
>> still a few; I wonder if there is a way to get rid of those?)
>> You've done a nice job polishing up the code.
>
> Thanks for the feedback! I could get rid of the magic numbers for
> the crc code, but I'm not sure if it makes sense to replace the
> others with constants, since they only occur once in the file. I
> added comments instead explaining where those numbers come from
> instead.

I think it is possible to remove the last magic number and also to make 
the CRC handling easier.  I have pushed some suggested changes to github 
[1]:

1. With the current code, trying to read a file that is less than 24 
bytes long would result in a struct.error (because it would try to 
unpack a string that is shorter than the struct) whereas the underlying 
error in this case should almost always be reported as a signature 
error.  So it is correct to read the signature separately from the rest 
of the header, but it is even more correct to check the signature 
(including its length) before reading on.

2. I introduce a class CRC to hold checksums, so that (a) the code for 
handling checksums can be encapsulated, and (b) an instance of this 
class can be passed into functions and mutated in-place, which is less 
cumbersome than requiring functions to return (data, crc) tuples.  This, 
in turn, makes possible...

3. ...a new function read_struct(f, s, crc), which reads the data for a 
struct.Struct from f, checksums it, and returns the unpacked data.  This 
function is more convenient to use than the old read_calc_crc().

4. The checksum instance can also be made responsible for verifying that 
the next four bytes in the file agree with the expected checksum.  This 
removes some more code duplication.  (See CRC.matches().)

5. You read the extension offsets using CRC_STRUCT.size, which is 
technically correct but misleading.  In fact, the extension offsets 
should be documented using their own EXTENSION_INDEX_STRUCT.  Also, it 
makes more sense to store the unpacked integer offsets to extoffsets 
rather than the raw 4-byte strings.

6. With a couple of more minor changes it is possible to replace the 
magic number "24" in read_index_entries().  With this change the 
computation is documented very explicitly and is also (somewhat) robust 
against future changes in the format.

Look over my changes and take whatever you want.

> Thanks, I have changed those. I added the docstrings for all read
> functions, they however don't seem to make sense for the print
> functions, since you're probably faster just reading the code for
> them.

That's fine.  If you document nontrivial functions you are already doing 
much better than the git project average.

> One since I changed in addition is to those changes, is that I
> gave the exceptions names to make them more meaningful.

Good.

Michael

[1] https://github.com/mhagger/git/tree/pythonprototype

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-18 15:38                       ` Thomas Gummerer
@ 2012-05-19 13:00                         ` Michael Haggerty
  2012-05-21  7:45                           ` Thomas Gummerer
  0 siblings, 1 reply; 49+ messages in thread
From: Michael Haggerty @ 2012-05-19 13:00 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, gitster, peff, spearce, davidbarr

On 05/18/2012 05:38 PM, Thomas Gummerer wrote:
>
>> I suggest that you apply the same kinds of cleanups to
>> git-convert-index.py (which I personally haven't looked at yet at
>> all).  If you want my feedback on that script, please let me know
>> when you think it is ready.
>
> That would be great, if you have the time to do it. I'm not
> completely finished with it (docstrings and conflicted data writing
> are still missing).

I've looked over the writing side of git-convert-index.py version
81411fe6c98, and here are my first comments:

* Please remove trailing whitespace from the source code.

* I suggest that you move constants and code shared by
   git-convert-index.py and git-read-index-v5.py into a library.  Though
   actually, given that git doesn't seem to have infrastructure for
   dealing with Python libraries, this might take some improvisation.

* Please use constants for all of the struct formats.  Constants have
   names, making them mostly self-documenting.

* write_directories() currently writes pathnames and fake data and
   stores file offsets in memory.  Later write_directory_data() runs
   through the file again, seek()ing over the filenames and filling in
   real data.

   Wouldn't it be easier for the first pass just to *compute* and
   record the offsets of the entries to RAM, without writing anything
   to disk, and leave all of the writing to the second pass?

* Instead of writing blank data, it is possible to seek() past it and
   start writing the next thing.  The skipped-over file contents are
   logically initialized to zero.

* When working with iteritems(), it is clearer to unpack the item
   pairs and give them names rather than working with d[0] and d[1];
   for example,

     -    for d in sorted(dirdata.iteritems()):
     +    for (pathname,entry) in sorted(dirdata.iteritems()):

* write_directories() returns a "dirdata" that is just an empty
   defaultdict.  This seems pointless.  Do you have future plans to
   change write_directories() to store something into the dictionary?

* The documentation for binascii.crc32() mentions that it gives
   inconsistent results (signed vs. unsigned) for different versions of
   Python.  Please ensure that you are using it in a way that is
   maximally portable.  (That seems to imply using (binascii.crc32(...)
   & 0xffffffff) and treating the result as unsigned.)

* At first I thought it was a little bit odd that you pass data
   structures around as dictionaries, but I didn't object.  But as I
   look at more and more code it seems more and more cumbersome.
   Therefore, I suggest that you define classes to hold the various
   entities that are manipulated by your programs, because:

   * A class definition is a good place to document exactly what fields
     an object is expected to have, and what they mean.

   * Access of instance fields (entry.path) is easier to read and type
     than dictionary access (entry["path"]).

   * The class definitions will translate pretty directly to C structs.

   The fact that class instances use a bit more memory than
   dictionaries is, I think, unimportant.  But if that really bothers
   you, you can use __slots__ to save some of the instance memory.

At a higher level:

* What if the offsets to each section were stored in the header, and
   the offsets recorded for dirs and files were relative to the start
   of the section (rather than relative to the start of the file)?  I
   think that this would leave open the possibility of formatting the
   sections in memory in parallel in a single pass, then dumping the
   sections to disk in a few big writes (though I'm not saying that this
   should be the *default* way of writing).

* Do you plan to write prototypes for some of the cool new
   functionality that v5 is intended to make possible?  For example,

   * reading a few specific entries out of an index file

   * updating single entries

   * adding/removing conflict data to an existing file

   * dealing with all of the issues that will come with supporting the
     mutation of an existing index file (i.e., locking, consistency
     checks, etc)

   As you probably know from discussions on IRC, I think that the last
   of these is the biggest risk to the success of the project.

> I'm not sure about the read_tree_extensiondata method, if I should
> extract a method, which only reads one entry, but I'm not sure that
> would make any sense, since there would be a lot of parameters and
> return values to the function.

If the index were represented by a class instance, then all of the 
information would be grouped together as a coherent whole that is easy 
to pass around.

> The same thing is in the main method, where I'm not sure if it's
> better to extract the read_index and write_index functions, or
> just leave the code in the main method. My guess is that it makes
> sense in the main method, since there are less calls, but it
> doesn't make sense in the read_tree_extensiondata method?

Ditto.

> Another thing I'm unsure about is the write_directory_data method,
> if there is any way to replace the try/except with something
> simpler?

With dictionaries, you can do

-        try:
-            flags = d[1]["flags"]
-        except KeyError:
-            flags = 0
+        flags = d[1].get("flags", 0)

If you convert to class instances, then presumably the constructor would 
set valid default values for all of the fields.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-19 13:00                         ` Michael Haggerty
@ 2012-05-21  7:45                           ` Thomas Gummerer
  0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-21  7:45 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, trast, gitster, peff, spearce, davidbarr


Thanks a lot for your feedback.

On 05/19, Michael Haggerty wrote:
> I've looked over the writing side of git-convert-index.py version
> 81411fe6c98, and here are my first comments:
> 
> * Please remove trailing whitespace from the source code.
> 
> * I suggest that you move constants and code shared by
>   git-convert-index.py and git-read-index-v5.py into a library.  Though
>   actually, given that git doesn't seem to have infrastructure for
>   dealing with Python libraries, this might take some improvisation.

I've created a directory python/lib, where I'll put the python libraries.
I'm not entirely sure this is the correct way to do it, however since
the python code will not be in the main git, but is just a prototype,
I think it's fine.

For now I moved the format strings, the structs, the exceptions and
the new calculate_crc method to the library. Once I go over
git-read-index-v5.py I'll probably move more code there.

> * Please use constants for all of the struct formats.  Constants have
>   names, making them mostly self-documenting.
> 
> * write_directories() currently writes pathnames and fake data and
>   stores file offsets in memory.  Later write_directory_data() runs
>   through the file again, seek()ing over the filenames and filling in
>   real data.
> 
>   Wouldn't it be easier for the first pass just to *compute* and
>   record the offsets of the entries to RAM, without writing anything
>   to disk, and leave all of the writing to the second pass?

I don't think that would be easier, since I have to go over all the
data when writing anyway. It might however be faster.

> * Instead of writing blank data, it is possible to seek() past it and
>   start writing the next thing.  The skipped-over file contents are
>   logically initialized to zero.
> 
> * When working with iteritems(), it is clearer to unpack the item
>   pairs and give them names rather than working with d[0] and d[1];
>   for example,
> 
>     -    for d in sorted(dirdata.iteritems()):
>     +    for (pathname,entry) in sorted(dirdata.iteritems()):
> 
> * write_directories() returns a "dirdata" that is just an empty
>   defaultdict.  This seems pointless.  Do you have future plans to
>   change write_directories() to store something into the dictionary?
> 
> * The documentation for binascii.crc32() mentions that it gives
>   inconsistent results (signed vs. unsigned) for different versions of
>   Python.  Please ensure that you are using it in a way that is
>   maximally portable.  (That seems to imply using (binascii.crc32(...)
>   & 0xffffffff) and treating the result as unsigned.)
> 
> * At first I thought it was a little bit odd that you pass data
>   structures around as dictionaries, but I didn't object.  But as I
>   look at more and more code it seems more and more cumbersome.
>   Therefore, I suggest that you define classes to hold the various
>   entities that are manipulated by your programs, because:
> 
>   * A class definition is a good place to document exactly what fields
>     an object is expected to have, and what they mean.
> 
>   * Access of instance fields (entry.path) is easier to read and type
>     than dictionary access (entry["path"]).
> 
>   * The class definitions will translate pretty directly to C structs.
> 
>   The fact that class instances use a bit more memory than
>   dictionaries is, I think, unimportant.  But if that really bothers
>   you, you can use __slots__ to save some of the instance memory.
> 
> At a higher level:
> 
> * What if the offsets to each section were stored in the header, and
>   the offsets recorded for dirs and files were relative to the start
>   of the section (rather than relative to the start of the file)?  I
>   think that this would leave open the possibility of formatting the
>   sections in memory in parallel in a single pass, then dumping the
>   sections to disk in a few big writes (though I'm not saying that this
>   should be the *default* way of writing).

I'm thinking if there are any drawbacks doing it that way, but until
now no drawbacks came to my mind. Otherwise this sounds like a good
idea, and I'll include it in the next version.

> * Do you plan to write prototypes for some of the cool new
>   functionality that v5 is intended to make possible?  For example,
> 
>   * reading a few specific entries out of an index file
> 
>   * updating single entries
> 
>   * adding/removing conflict data to an existing file
> 
>   * dealing with all of the issues that will come with supporting the
>     mutation of an existing index file (i.e., locking, consistency
>     checks, etc)
> 
>   As you probably know from discussions on IRC, I think that the last
>   of these is the biggest risk to the success of the project.

I thought of implementing prototypes as the proejct goes on, but not
all of them now. I'd rather first start implementing the reader,
because otherwise the time could get a problem for the midterm.

--
Thomas

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Index format v5
  2012-05-19  5:40                         ` Michael Haggerty
@ 2012-05-21 20:30                           ` Thomas Gummerer
  0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-21 20:30 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, trast, gitster, peff, spearce, davidbarr



On 05/19, Michael Haggerty wrote:
> I think it is possible to remove the last magic number and also to
> make the CRC handling easier.  I have pushed some suggested changes
> to github [1]:
> 
> 1. With the current code, trying to read a file that is less than 24
> bytes long would result in a struct.error (because it would try to
> unpack a string that is shorter than the struct) whereas the
> underlying error in this case should almost always be reported as a
> signature error.  So it is correct to read the signature separately
> from the rest of the header, but it is even more correct to check
> the signature (including its length) before reading on.

I looked at the current git code, and the filesize is used for
checking if the index size is big enough. I'll implement it that way
in the prototype for now.

> 2. I introduce a class CRC to hold checksums, so that (a) the code
> for handling checksums can be encapsulated, and (b) an instance of
> this class can be passed into functions and mutated in-place, which
> is less cumbersome than requiring functions to return (data, crc)
> tuples.  This, in turn, makes possible...
> 
> 3. ...a new function read_struct(f, s, crc), which reads the data
> for a struct.Struct from f, checksums it, and returns the unpacked
> data.  This function is more convenient to use than the old
> read_calc_crc().
> 
> 4. The checksum instance can also be made responsible for verifying
> that the next four bytes in the file agree with the expected
> checksum.  This removes some more code duplication.  (See
> CRC.matches().)
>
> 5. You read the extension offsets using CRC_STRUCT.size, which is
> technically correct but misleading.  In fact, the extension offsets
> should be documented using their own EXTENSION_INDEX_STRUCT.  Also,
> it makes more sense to store the unpacked integer offsets to
> extoffsets rather than the raw 4-byte strings.
> 
> 6. With a couple of more minor changes it is possible to replace the
> magic number "24" in read_index_entries().  With this change the
> computation is documented very explicitly and is also (somewhat)
> robust against future changes in the format.
> 
> Look over my changes and take whatever you want.

Thanks, I didn't merge them directly, but I took the basics out and
merged them with my code.

--
Thomas

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2012-05-21 20:31 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-03 17:25 Index format v5 Thomas Gummerer
2012-05-03 18:16 ` Thomas Rast
2012-05-03 19:03   ` Junio C Hamano
2012-05-04  7:12   ` Michael Haggerty
2012-05-07 22:18     ` Robin Rosenberg
2012-05-03 18:21 ` Ronan Keryell
2012-05-03 20:36   ` Thomas Gummerer
2012-05-03 18:54 ` Junio C Hamano
2012-05-03 19:11   ` Thomas Rast
2012-05-03 19:31   ` Thomas Rast
2012-05-03 19:32     ` Thomas Rast
2012-05-03 20:32       ` Junio C Hamano
2012-05-03 21:38   ` Thomas Gummerer
2012-05-07 18:57     ` Robin Rosenberg
2012-05-03 19:38 ` solo-git
2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
2012-05-04 15:44   ` Thomas Gummerer
2012-05-04 13:25 ` Philip Oakley
2012-05-04 15:46   ` Junio C Hamano
2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
2012-05-07 13:44   ` Thomas Gummerer
2012-05-06 16:49 ` Phil Hord
2012-05-07 13:08   ` Thomas Gummerer
2012-05-07 15:15 ` Michael Haggerty
2012-05-08 14:11   ` Thomas Gummerer
2012-05-08 14:25     ` Nguyen Thai Ngoc Duy
2012-05-08 14:34       ` Nguyen Thai Ngoc Duy
2012-05-10  6:53         ` Thomas Gummerer
2012-05-10 11:06           ` Nguyen Thai Ngoc Duy
2012-05-09  8:37     ` Michael Haggerty
2012-05-10 12:19       ` Thomas Gummerer
2012-05-10 18:17         ` Michael Haggerty
2012-05-11 17:12           ` Thomas Gummerer
2012-05-13 19:50             ` Michael Haggerty
2012-05-14 15:01               ` Thomas Gummerer
2012-05-14 21:08                 ` Michael Haggerty
2012-05-14 22:10                   ` Thomas Rast
2012-05-15  6:43                     ` Michael Haggerty
2012-05-15 13:49                   ` Thomas Gummerer
2012-05-15 15:02                     ` Michael Haggerty
2012-05-18 15:38                       ` Thomas Gummerer
2012-05-19 13:00                         ` Michael Haggerty
2012-05-21  7:45                           ` Thomas Gummerer
2012-05-16  5:01                     ` Michael Haggerty
2012-05-16 21:54                       ` Thomas Gummerer
2012-05-19  5:40                         ` Michael Haggerty
2012-05-21 20:30                           ` Thomas Gummerer
2012-05-13 21:01 ` Philip Oakley
2012-05-14 14:54   ` Thomas Gummerer

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).