* Index format v5
@ 2012-05-03 17:25 Thomas Gummerer
2012-05-03 18:16 ` Thomas Rast
` (9 more replies)
0 siblings, 10 replies; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-03 17:25 UTC (permalink / raw)
To: git; +Cc: trast, gitster, mhagger, peff, spearce, davidbarr
I have been drafting the Version 5 of the index format over the past
few days with the help of Thomas Rast, Michael Haggerty, cmn and
barrbrain on IRC. It will save with prefix compression on the path, and
using a crc32 over the stat data, instead of the full data, since it is only
used for checking if the file is changed. (Thanks Michael Haggerty for
this hint. Unless we are missing something this will save another
~4 MB on the Webkit index.
GIT index format
================
= The git index file has the following format
All binary numbers are in network byte order. Version 5 is described
here.
- A 16-byte header consisting of
4-byte signature:
The signature is { 'D', 'I', 'R', 'C' } (stands for "dircache")
4-byte version number:
The current supported versions are 2, 3, 4 and 5.
32-bit number of directories.
32-bit number of file entries.
- Offset to the extensions.
32-bit number of extensions.
32-bit number offset to the extension. (Possibly none, as many as
indicated in the 4-byte number of extensions)
- A number of directory offsets (see below). [1]
- A number of sorted directories (see below). [2]
- 32-bit crc32 checksum for the header, extension offsets and directories.
- A number of file offsets (see below). [1]
- A number of file entries (see below).
- A number of entries for conflicted data/resolved conflicts (see below).
- Extensions
Extensions are identified by signature. Optional extensions can
be ignored if GIT does not understand them.
GIT supports an arbitrary number of extension, but currently none
is implemented. [3]
4-byte extension signature. If the first byte is 'A'..'Z' the
extension is optional and can be ignored.
32-bit size of the extension
32-bit crc32 checksum of the extension signature and size
Extension data
== Directory entry offsets
32-bit offset to the directory.
This part is needed for making the directory entries bisectable and
thus allowing a binary search.
== Directory entry
Directory entries are sorted in lexicographic order by the name
of their path starting with the root.
Path names (variable length) relative to top level directory (without the
leading slash). '/' is used as path separator. '.' indicates the root
directory. The special patch components ".." and ".git" (without quotes)
are disallowed. Trailing slash is also disallowed.
1 nul byte to terminate the path.
32-bit offset to the first file of a directory
32-bit offset to conflicted/resolved data at the end of the index.
0 if there is no such data. [4]
4-byte number of subtrees this tree has
4-byte number of entries in the index that is covered by the tree this
entry represents. (entry_count) (-1 if the entry is invalid)
160-bit object name for the object that would result from writing
this span of index as a tree.
The last 24 bytes are for the cache tree. An entry can be in an
invalidated state which is represented by having -1 in the entry_count
field. If an entry is in invalidated state, the next entry will begin
after the number of subtrees, and the 160-bit object name is dropped.
The entries are written out in the top-down, depth-first order. The
first entry represents the root level of the repository, followed by
the first subtree - let's call it A - of the root level, followed by
the first subtree of A, ...
== File entry offsets
32-bit offset to the directory.
This part is needed for making the directory entries bisectable and
thus allowing a binary search.
== File entry
File entries are sorted in ascending order on the name field, after the
respective offset given by the directory entries.
File name (variable length). Nul bytes are not allowed in file names and
they have no leading slash. They are 7-bit ASCII encoded.
1 nul byte to terminate the filename.
A 16-bit 'flags' field split into (high to low bits)
1-bit assume-valid flag
1-bit conflict flag
2-bit stage (during merge)
2-bit mode (0 = 1000644 (regular file without execution
permission), 1 = 1000755 (regular file with execution
permission), 2 = 1010000 (symbolic link), 3 = 1110
(gitlink)) [5]
1-bit skip-worktree flag (used by sparse checkout)
1-bit intent-to-add flag (used by "git add -N")
8-bit unused, must be zero [6]
32-bit mtime seconds, the last time a file's data changed
this is stat(2) data
32-bit mtime nanosecond fractions
this is stat(2) data
32-bit crc32 checksum over ctime seconds, ctime nanoseconds,
ino, file size, dev, uid, gid (All stat(2) data except mtime) [7]
160-bit SHA-1 for the represented object
32-bit crc32 checksum for the file entry
== Conflicted data
A conflict is represented in the index as a set of higher stage entries.
These entries are stored at the end of the index. When a conflict is
resolved (e.g. with "git add path"). A bit is flipped, to indicate that
the conflict is resolved, but the entries will be kept, so that
conflicts can be recreated (e.g. with "git checkout -m", in case users
want to redo a conflict resolution from scratch.
- NUL-terminated filename of the entry
- A 8-bit 'flags' field split into:
- 1-bit conflicted state (conflicted/resolved) (1 if conflicted)
- 7-bit unused
- Three 4-byte octal numbers, entry mode of entries in stage 1 to 3 (a
missing stage is represented by "0" in this field);
and
- At most three 160-bit object names of the entry in stages from 1 to 3
(nothing is written for a missing stage).
- 32-bit crc32 checksum over one conflicted entry.
== Design explanations
[1] The directory and file offsets are included in the index format
to enable bisectability of the index, for binary searches.Updating
a single entry and partial reading will benefit from this.
[2] The directories are saved in their own block, to be able to
quickly search for a directory in the index. They include a
offset to the (lexically) first file in the directory.
[3] The data of the cache-tree extension and the resolve undo
extension is now part of the index itself, but if other extensions
come up in the future, there is no need to change the index, they
can simply be added at the end.
[4] To avoid rewrites of the whole index when there are conflicts or
conflicts are being resolved, conflicted data will be stored at
the end of the index. To mark the conflict resolved, just a bit
has to be flipped. The data will still be there, if a user wants
to redo the conflict resolution.
[5] Since only 4 modes are effectively allowed in git but 32-bit are
used to store them, having a two bit flag for the mode is enough
and saves 4 byte per entry.
[6] The length of the file name was dropped, since each file name is
nul terminated anyway.
[7] Since all stat data (except mtime and ctime) is just used for
checking if a file has changed a checksum of the data is enough.
In addition to that Thomas Rast suggested ctime could be ditched
completely (core.trustctime=false) and thus included in the
checksum. This would save 24 bytes per index entry, which would
be about 4 MB on the Webkit index.
(Thanks for the suggestion to Michael Haggerty)
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-03 17:25 Index format v5 Thomas Gummerer
@ 2012-05-03 18:16 ` Thomas Rast
2012-05-03 19:03 ` Junio C Hamano
2012-05-04 7:12 ` Michael Haggerty
2012-05-03 18:21 ` Ronan Keryell
` (8 subsequent siblings)
9 siblings, 2 replies; 49+ messages in thread
From: Thomas Rast @ 2012-05-03 18:16 UTC (permalink / raw)
To: Thomas Gummerer; +Cc: git, trast, gitster, mhagger, peff, spearce, davidbarr
Thomas Gummerer <t.gummerer@gmail.com> writes:
> 32-bit crc32 checksum over ctime seconds, ctime nanoseconds,
> ino, file size, dev, uid, gid (All stat(2) data except mtime) [7]
[...]
> [7] Since all stat data (except mtime and ctime) is just used for
> checking if a file has changed a checksum of the data is enough.
> In addition to that Thomas Rast suggested ctime could be ditched
> completely (core.trustctime=false) and thus included in the
> checksum. This would save 24 bytes per index entry, which would
> be about 4 MB on the Webkit index.
> (Thanks for the suggestion to Michael Haggerty)
This is the part I'm most curious about. Are we missing anything?
Michael brought it up on IRC: the stat() results are only used to test
whether they are still the same, with the exception of the mtime (which
also undergoes raciness checks).
As far as I can see, none of st_{ino,dev,uid,gid} are useful for
anything. st_size might conceivably be used as a hint for a buffer
size, but nobody actually does that. The ctime undergoes stricter
checks, but AFAICS it's also all about whether it has changed, and
besides that can be turned off. We think all of those fields can be
replaced by an arbitrary hash/CRC and only tested for equality. 32 bits
should be plenty, probably even if we just xor the values together.
So what's wrong in this thinking?
[The one flaw I found so far is that this makes it impossible to convert
back to v2-4 without at the very least refreshing the index. Do we
care?]
--
Thomas Rast
trast@{inf,student}.ethz.ch
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-03 17:25 Index format v5 Thomas Gummerer
2012-05-03 18:16 ` Thomas Rast
@ 2012-05-03 18:21 ` Ronan Keryell
2012-05-03 20:36 ` Thomas Gummerer
2012-05-03 18:54 ` Junio C Hamano
` (7 subsequent siblings)
9 siblings, 1 reply; 49+ messages in thread
From: Ronan Keryell @ 2012-05-03 18:21 UTC (permalink / raw)
To: Thomas Gummerer; +Cc: git
>>>>> On Thu, 3 May 2012 19:25:12 +0200, Thomas Gummerer <t.gummerer@gmail.com> said:
Thomas> I have been drafting the Version 5 of the index format over
Thomas> the past few days with the help of Thomas Rast, Michael
Thomas> Haggerty, cmn and barrbrain on IRC. It will save with prefix
Thomas> compression on the path, and using a crc32 over the stat
Thomas> data, instead of the full data, since it is only used for
Thomas> checking if the file is changed. (Thanks Michael Haggerty
Thomas> for this hint. Unless we are missing something this will
Thomas> save another ~4 MB on the Webkit index.
Great!
But I wonder whether it may not worth to investigate a 64-bit version for
the offsets and so on, just in case...
--
Ronan KERYELL |\/ Phone: +1 408 658 9453
Wild Systems / HPC Project |/)
5201 Great America Parkway, Suite 320 K Ronan.Keryell@wild-systems.com
Santa Clara, CA 95054 |\ skype:keryell
USA | \ http://wild-systems.com
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-03 17:25 Index format v5 Thomas Gummerer
2012-05-03 18:16 ` Thomas Rast
2012-05-03 18:21 ` Ronan Keryell
@ 2012-05-03 18:54 ` Junio C Hamano
2012-05-03 19:11 ` Thomas Rast
` (2 more replies)
2012-05-03 19:38 ` solo-git
` (6 subsequent siblings)
9 siblings, 3 replies; 49+ messages in thread
From: Junio C Hamano @ 2012-05-03 18:54 UTC (permalink / raw)
To: Thomas Gummerer; +Cc: git, trast, gitster, mhagger, peff, spearce, davidbarr
Thomas Gummerer <t.gummerer@gmail.com> writes:
> I have been drafting the Version 5 of the index format over the past
> few days with the help of Thomas Rast, Michael Haggerty, cmn and
> barrbrain on IRC.
Hrm, so if there is anything glaringly wrong below, should I reduce the
"trustable reviewer karma point" from these people? Or did you forget to
say "but remaining errors are mine" ;-)?
> GIT index format
> ================
>
> = The git index file has the following format
>
> All binary numbers are in network byte order. Version 5 is described
> here.
>
> - A 16-byte header consisting of
>
> 4-byte signature:
> The signature is { 'D', 'I', 'R', 'C' } (stands for "dircache")
>
> 4-byte version number:
> The current supported versions are 2, 3, 4 and 5.
>
> 32-bit number of directories.
> 32-bit number of file entries.
I take these two numbers mean "we have this many directory entries in the
index" and "we have this many file entries in the index"; I found it
unclear during my first scan.
Have you considered expressing these new "directories", "files" and
associated data as a new (mandatory) extension?
> - Offset to the extensions.
>
> 32-bit number of extensions.
Why is this necessary? It means that the writer needs to enumerate how
many extensions it is going to write before starting to write, unless it
is willing to seek back and fill this. And it is not like the reader will
first allocate an array to hold uniformly sized extension and will be
helped to have the number of entries upfront in order to do the
allocation.
What purpose does this serve?
> 32-bit number offset to the extension. (Possibly none, as many as
> indicated in the 4-byte number of extensions)
Why is this needed? It appears that there is no field that points at the
beginning of array that stores "directory offsets", nor "file offsets", so
I am assuming that you will be scanning the index file to find them. Why
can't the extensions be handled exactly the same way?
> - A number of directory offsets (see below). [1]
>
> - A number of sorted directories (see below). [2]
>
> - 32-bit crc32 checksum for the header, extension offsets and directories.
What does "32" in "crc32" stand for ? ;-)
> - A number of file offsets (see below). [1]
>
> - A number of file entries (see below).
>
> - A number of entries for conflicted data/resolved conflicts (see below).
>
> - Extensions
>
> Extensions are identified by signature. Optional extensions can
> be ignored if GIT does not understand them.
>
> GIT supports an arbitrary number of extension, but currently none
> is implemented. [3]
>
> 4-byte extension signature. If the first byte is 'A'..'Z' the
> extension is optional and can be ignored.
>
> 32-bit size of the extension
>
> 32-bit crc32 checksum of the extension signature and size
>
> Extension data
>
>
> == Directory entry offsets
The name "directory entry offset" does not seem to appear anywhere before
this, so it is unclear what this section of the documentation is trying to
describe. Is this the same as "directory offsets" above? In other words,
"A number of directory offsets (see below)." above is an array of something,
and this section is trying to describe what that something is?
> 32-bit offset to the directory.
I take it is "offset relative to the beginning of the file in the index".
> This part is needed for making the directory entries bisectable and
> thus allowing a binary search.
>
> == Directory entry
>
> Directory entries are sorted in lexicographic order by the name
> of their path starting with the root.
>
> Path names (variable length) relative to top level directory (without the
> leading slash). '/' is used as path separator. '.' indicates the root
> directory. The special patch components ".." and ".git" (without quotes)
> are disallowed. Trailing slash is also disallowed.
I understood "the root" to mean "the top level of the directory hierarchy"
(i.e. the directory that corresponds to the top level of the working
tree), but it needs to be explained better. Using '.' for root sounds
somewhat questionable, though. Why not a string of length 0?
When an index represents a D/F conflict, some stages may have a directory
D while others do not have D (but have a regular file at D). Doesn't
directory entry need to have a stage information?
> 1 nul byte to terminate the path.
>
> 32-bit offset to the first file of a directory
What does this point at? Does it point into the array of "file offsets"
or the array of "file entries"?
> 32-bit offset to conflicted/resolved data at the end of the index.
> 0 if there is no such data. [4]
>
> 4-byte number of subtrees this tree has
Which is an undefined number unless you specify which stage you are
talking about.
> 4-byte number of entries in the index that is covered by the tree this
> entry represents. (entry_count) (-1 if the entry is invalid)
>
> 160-bit object name for the object that would result from writing
> this span of index as a tree.
>
> The last 24 bytes are for the cache tree. An entry can be in an
> invalidated state which is represented by having -1 in the entry_count
> field. If an entry is in invalidated state, the next entry will begin
> after the number of subtrees, and the 160-bit object name is dropped.
By "The last 24 bytes", do you mean the "4-byte number of entries..." and
"160-bit object name"?
> The entries are written out in the top-down, depth-first order. The
> first entry represents the root level of the repository, followed by
> the first subtree - let's call it A - of the root level, followed by
> the first subtree of A, ...
>
> == File entry offsets
>
> 32-bit offset to the directory.
What directory? The containing directory? What does this point at? Does
it point into which array?
> This part is needed for making the directory entries bisectable and
> thus allowing a binary search.
>
> == File entry
>
> File entries are sorted in ascending order on the name field, after the
> respective offset given by the directory entries.
>
> File name (variable length). Nul bytes are not allowed in file names and
> they have no leading slash. They are 7-bit ASCII encoded.
Is this a name relative to its containing directory (i.e. without leading
components)? Or is it a full path relative to the top level of the
working tree?
I have some UTF-8 encoded files in my repository. Are they
now disallowed?
> 1 nul byte to terminate the filename.
>
> A 16-bit 'flags' field split into (high to low bits)
>
> 1-bit assume-valid flag
Is this "assume unchanged"?
> 1-bit conflict flag
>
> 2-bit stage (during merge)
Huh? When stage #0 entry exists for a given path, no other stages for the
same path can exist in the index. By definition, that is how a conflicted
path is resolved. What is this separate "conflict flag" for?
> 2-bit mode (0 = 1000644 (regular file without execution
> permission), 1 = 1000755 (regular file with execution
> permission), 2 = 1010000 (symbolic link), 3 = 1110
> (gitlink)) [5]
Don't penny-pinch bits like this.
> 1-bit skip-worktree flag (used by sparse checkout)
>
> 1-bit intent-to-add flag (used by "git add -N")
>
> 8-bit unused, must be zero [6]
>
> 32-bit mtime seconds, the last time a file's data changed
> this is stat(2) data
>
> 32-bit mtime nanosecond fractions
> this is stat(2) data
>
> 32-bit crc32 checksum over ctime seconds, ctime nanoseconds,
> ino, file size, dev, uid, gid (All stat(2) data except mtime) [7]
Giving occassional false positive to "did this change?" is acceptable, but
any false negative is absolutely unacceptable. How does this work with
something like "racy git" situation (i.e. coming from "mtime happens to be
the same as before") but due to crc32 collisions?
If there is no good answer to the above question, I would have to say that
anybody who suggested or passed this through review loses all the
accumulated reviewer karma points (if s/he has accumulated any, that is).
> 160-bit SHA-1 for the represented object
>
> 32-bit crc32 checksum for the file entry
>
> == Conflicted data
I do not think the data described in this section should be conflicting.
It ought to be data that describe conflicted state. Perhaps you meant
"Conflict data"?
> A conflict is represented in the index as a set of higher stage entries.
> These entries are stored at the end of the index. When a conflict is
> resolved (e.g. with "git add path"). A bit is flipped, to indicate that
> the conflict is resolved, but the entries will be kept, so that
> conflicts can be recreated (e.g. with "git checkout -m", in case users
> want to redo a conflict resolution from scratch.
>
> - NUL-terminated filename of the entry
Is this a name relative to its containing directory (i.e. without leading
components)? Or a full path relative to the top-level of the working
tree?
> - A 8-bit 'flags' field split into:
>
> - 1-bit conflicted state (conflicted/resolved) (1 if conflicted)
>
> - 7-bit unused
>
> - Three 4-byte octal numbers, entry mode of entries in stage 1 to 3 (a
> missing stage is represented by "0" in this field);
> and
>
> - At most three 160-bit object names of the entry in stages from 1 to 3
> (nothing is written for a missing stage).
It is allowed to have more than 1 entries in stage #1 to represent
multiple merge-base, so this needs to be rethought.
> - 32-bit crc32 checksum over one conflicted entry.
There is no definition of "one conflicted entry"; be consistent and say
"Conflicted data" as what the section header claims to describe.
> == Design explanations
>
> [1] The directory and file offsets are included in the index format
> to enable bisectability of the index, for binary searches.Updating
> a single entry and partial reading will benefit from this.
>
> [2] The directories are saved in their own block, to be able to
> quickly search for a directory in the index. They include a
> offset to the (lexically) first file in the directory.
>
> [3] The data of the cache-tree extension and the resolve undo
> extension is now part of the index itself, but if other extensions
> come up in the future, there is no need to change the index, they
> can simply be added at the end.
>
> [4] To avoid rewrites of the whole index when there are conflicts or
> conflicts are being resolved, conflicted data will be stored at
> the end of the index. To mark the conflict resolved, just a bit
> has to be flipped. The data will still be there, if a user wants
> to redo the conflict resolution.
>
> [5] Since only 4 modes are effectively allowed in git but 32-bit are
> used to store them, having a two bit flag for the mode is enough
> and saves 4 byte per entry.
>
> [6] The length of the file name was dropped, since each file name is
> nul terminated anyway.
This is micronit, but I think we do this to save one strlen() for each
read of the entry, except for unusually long paths where we fall back to
strlen(). A change like this needs to be justified better than simply
saying "because we _could_ compute in a different way by spending extra
cycles".
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-03 18:16 ` Thomas Rast
@ 2012-05-03 19:03 ` Junio C Hamano
2012-05-04 7:12 ` Michael Haggerty
1 sibling, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2012-05-03 19:03 UTC (permalink / raw)
To: Thomas Rast
Cc: Thomas Gummerer, git, gitster, mhagger, peff, spearce, davidbarr
Thomas Rast <trast@student.ethz.ch> writes:
> So what's wrong in this thinking?
We tolerate giving occasional false positive "Yes" to "Has this changed?",
but do not allow false negative "No". So the value st_{ino,dev,uid,gid}
gives us is strictly "these are not likely to change, but if even a single
bit in them change, we need to suspect that it may have changed." The
primary field that protect us is mtime, and all the rest are more or less
belt-and-suspender safety.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-03 18:54 ` Junio C Hamano
@ 2012-05-03 19:11 ` Thomas Rast
2012-05-03 19:31 ` Thomas Rast
2012-05-03 21:38 ` Thomas Gummerer
2 siblings, 0 replies; 49+ messages in thread
From: Thomas Rast @ 2012-05-03 19:11 UTC (permalink / raw)
To: Junio C Hamano
Cc: Thomas Gummerer, git, trast, mhagger, peff, spearce, davidbarr
Junio C Hamano <gitster@pobox.com> writes:
>> [6] The length of the file name was dropped, since each file name is
>> nul terminated anyway.
>
> This is micronit, but I think we do this to save one strlen() for each
> read of the entry, except for unusually long paths where we fall back to
> strlen(). A change like this needs to be justified better than simply
> saying "because we _could_ compute in a different way by spending extra
> cycles".
(Partially also answering the whole "what are these offsets" confusion)
The bisectability has evolved to the point where we envision the
structure of a "flat list" (the directories, and the files in each
directory) to have the format
offset to entry 1 [1]
....
offset to entry n
entry 1, consisting of:
name, nul-terminated
rest of data:
for dirs: cache-tree sha1, offset to files, etc.
for files: stat data, content sha1, flags, etc.
That makes bisection very easy: the offsets point at the start of each
string, so you just strcmp() and get on with it.
On the other hand, by the time you can look at the flags, it's too late
for the strlen() optimization anyway, so meh. If you think it's
important, we can perhaps lay it out so the rest of the data goes
immediately after the pointer. But so far it wasn't clear whether it's
fixed-size, or uses some smart compression scheme. Tonight's edition
has a fixed length, so perhaps it would be preferable to keep the
length.
Footnotes:
[1] It probably doesn't matter whether this is relative to the position
of the offset, or absolute (in terms of file pointer).
--
Thomas Rast
trast@{inf,student}.ethz.ch
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-03 18:54 ` Junio C Hamano
2012-05-03 19:11 ` Thomas Rast
@ 2012-05-03 19:31 ` Thomas Rast
2012-05-03 19:32 ` Thomas Rast
2012-05-03 21:38 ` Thomas Gummerer
2 siblings, 1 reply; 49+ messages in thread
From: Thomas Rast @ 2012-05-03 19:31 UTC (permalink / raw)
To: Junio C Hamano
Cc: Thomas Gummerer, git, trast, mhagger, peff, spearce, davidbarr
Junio C Hamano <gitster@pobox.com> writes:
> Thomas Gummerer <t.gummerer@gmail.com> writes:
>
>> I have been drafting the Version 5 of the index format over the past
>> few days with the help of Thomas Rast, Michael Haggerty, cmn and
>> barrbrain on IRC.
>
> Hrm, so if there is anything glaringly wrong below, should I reduce the
> "trustable reviewer karma point" from these people? Or did you forget to
> say "but remaining errors are mine" ;-)?
Heh.
Partly it's my fault, I told Thomas to get this out tonight for a single
reason: Michael's CRC-over-stat idea is so radical that I wanted to know
whether there's anything wrong with it. So the misunderstanding is
really that this is not anywhere near the final result.
But yeah:
> > 32-bit crc32 checksum over ctime seconds, ctime nanoseconds,
> > ino, file size, dev, uid, gid (All stat(2) data except mtime) [7]
>
> Giving occassional false positive to "did this change?" is acceptable, but
> any false negative is absolutely unacceptable. How does this work with
> something like "racy git" situation (i.e. coming from "mtime happens to be
> the same as before") but due to crc32 collisions?
>
> If there is no good answer to the above question, I would have to say that
> anybody who suggested or passed this through review loses all the
> accumulated reviewer karma points (if s/he has accumulated any, that is).
If this is a problem, then the fault is with me (and I do hope I have
some karma to lose...).
Note that the scenario you outlined is not an issue. The entries other
than mtime and ctime are really only compared for equality, see
e.g. ce_match_stat_basic(). Comparisons for equality never have false
negatives with any hash function. Collisions are false positives.
Upon closer reading I noticed that the ie_match_stat and ce_match_stat_*
family actually to distinguish between basically all the fields that can
change. But this knowledge is never actually put to use, except that
there's an optimized code path where
* mode/type difference implies changed entry
* size difference implies changed entry
So that does constitute an argument to not put the size in the
stat-hash. Mode and type aren't part of it in the proposed format
anyway.
--
Thomas Rast
trast@{inf,student}.ethz.ch
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-03 19:31 ` Thomas Rast
@ 2012-05-03 19:32 ` Thomas Rast
2012-05-03 20:32 ` Junio C Hamano
0 siblings, 1 reply; 49+ messages in thread
From: Thomas Rast @ 2012-05-03 19:32 UTC (permalink / raw)
To: Thomas Rast
Cc: Junio C Hamano, Thomas Gummerer, git, mhagger, peff, spearce,
davidbarr
Thomas Rast <trast@student.ethz.ch> writes:
> Comparisons for equality never have false negatives with any hash
> function. Collisions are false positives.
Bah, shouldn't hit "send" so fast. There goes my karma :-D
--
Thomas Rast
trast@{inf,student}.ethz.ch
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-03 17:25 Index format v5 Thomas Gummerer
` (2 preceding siblings ...)
2012-05-03 18:54 ` Junio C Hamano
@ 2012-05-03 19:38 ` solo-git
2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
` (5 subsequent siblings)
9 siblings, 0 replies; 49+ messages in thread
From: solo-git @ 2012-05-03 19:38 UTC (permalink / raw)
To: git
On Thu, May 03, 2012 at 07:25:12PM +0200, Thomas Gummerer wrote:
> 160-bit object name for the object that would result from writing
> this span of index as a tree.
Would now be a fun time to bring up the idea of eventually migrating
away from 160-bit hashes? I love to keep reminding people that SHA-1
is showing its age[1].
Even if we don't add support for variable length hashes at this point,
it would be nice to at least have the code written in a relatively
hash-length-independent way, such that someone could come along and
make a new, very similar, format without rewriting much of the code.
If there is going to be support for different hash algorithms, or even
different length hashes, we need space in the file (before the first
hash, obviously) to indicate what we're using; currently reserved.
For this, we need to decide on how we're going to store the algorithm;
my guesses would be either:
* a (1-octet) enum, like tls (0: sha1/160, 1: sha2-256/160 [sha2's
256-bit mode truncated to 160-bits using a recommended truncation mode],
2: keccak-512, etc.); stopping at 127 (i.e. reserving the high-bit for
future use), or
* a lowascii string, with prefixed 1 (or 2?)-octet length, like ssh,
using the above format. If people believe that the chance of this being
implemented (within the lifespan of v5) is tiny, this octet can be
reserved; i.e. a zero-length string means sha1/160.
tl;dr: Please reserve one or more octet(s) early on for the details of
the hash, and think about people who want more than 20/40 octets
available when implementing it.
--
Chris West (FauxFaux).
[1]: http://csrc.nist.gov/groups/ST/hash/policy.html "...and must use
the SHA-2 family of hash functions for these applications after 2010."
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-03 19:32 ` Thomas Rast
@ 2012-05-03 20:32 ` Junio C Hamano
0 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2012-05-03 20:32 UTC (permalink / raw)
To: Thomas Rast; +Cc: Thomas Gummerer, git, mhagger, peff, spearce, davidbarr
Thomas Rast <trast@student.ethz.ch> writes:
> Thomas Rast <trast@student.ethz.ch> writes:
>
>> Comparisons for equality never have false negatives with any hash
>> function. Collisions are false positives.
>
> Bah, shouldn't hit "send" so fast. There goes my karma :-D
Heh, don't take "karma" thing too seriously. I was merely trying to be
funny and I seem to have failed the attempt.
It depends on what question you are asking, and against "Did this change?"
question, saying "No, it didn't" when the right answer is "Yes, it did" is
a false negative. You are asking for a false negative if you substitute
byte-for-byte comparison with comparison of hashed results of the values
that needs to be compared.
But as I said, mtime is the primary thing that protects us, so it may not
be such a big deal.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-03 18:21 ` Ronan Keryell
@ 2012-05-03 20:36 ` Thomas Gummerer
0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-03 20:36 UTC (permalink / raw)
To: Ronan Keryell; +Cc: git
On Thu, May 3, 2012 at 8:21 PM, Ronan Keryell
<Ronan.Keryell@hpc-project.com> wrote:
>>>>>> On Thu, 3 May 2012 19:25:12 +0200, Thomas Gummerer <t.gummerer@gmail.com> said:
>
> Thomas> I have been drafting the Version 5 of the index format over
> Thomas> the past few days with the help of Thomas Rast, Michael
> Thomas> Haggerty, cmn and barrbrain on IRC. It will save with prefix
> Thomas> compression on the path, and using a crc32 over the stat
> Thomas> data, instead of the full data, since it is only used for
> Thomas> checking if the file is changed. (Thanks Michael Haggerty
> Thomas> for this hint. Unless we are missing something this will
> Thomas> save another ~4 MB on the Webkit index.
>
> Great!
>
> But I wonder whether it may not worth to investigate a 64-bit version for
> the offsets and so on, just in case...
64-bit versions of the offsets were taken into consideration, but currently
the Webkit index (the largest I know) has a size of 26 Mb, which is
reduced to about 15 MB or less with the v5 format. With 32-bit we can
address 4GB, which is about 266 times the Webkit index. Therefore there
probably is no use for 64-bit offsets in the years to come.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-03 18:54 ` Junio C Hamano
2012-05-03 19:11 ` Thomas Rast
2012-05-03 19:31 ` Thomas Rast
@ 2012-05-03 21:38 ` Thomas Gummerer
2012-05-07 18:57 ` Robin Rosenberg
2 siblings, 1 reply; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-03 21:38 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, trast, mhagger, peff, spearce, davidbarr
On Thu, May 3, 2012 at 8:54 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Thomas Gummerer <t.gummerer@gmail.com> writes:
>
>> I have been drafting the Version 5 of the index format over the past
>> few days with the help of Thomas Rast, Michael Haggerty, cmn and
>> barrbrain on IRC.
>
> Hrm, so if there is anything glaringly wrong below, should I reduce the
> "trustable reviewer karma point" from these people? Or did you forget to
> say "but remaining errors are mine" ;-)?
Heh yes most of the remaining errors are mine ;-) (And I hope nothing is
*that* wrong ;-) )
>> GIT index format
>> ================
>>
>> = The git index file has the following format
>>
>> All binary numbers are in network byte order. Version 5 is described
>> here.
>>
>> - A 16-byte header consisting of
>>
>> 4-byte signature:
>> The signature is { 'D', 'I', 'R', 'C' } (stands for "dircache")
>>
>> 4-byte version number:
>> The current supported versions are 2, 3, 4 and 5.
>>
>> 32-bit number of directories.
>
>> 32-bit number of file entries.
>
> I take these two numbers mean "we have this many directory entries in the
> index" and "we have this many file entries in the index"; I found it
> unclear during my first scan.
Yes, that's exactly what they mean. I'll write this down more clearly in the
next version.
> Have you considered expressing these new "directories", "files" and
> associated data as a new (mandatory) extension?
Since they are the main part of the index, I don't consider them
extensions, unless we say "everything in the index is a extension"
>> - Offset to the extensions.
>>
>> 32-bit number of extensions.
>
> Why is this necessary? It means that the writer needs to enumerate how
> many extensions it is going to write before starting to write, unless it
> is willing to seek back and fill this. And it is not like the reader will
> first allocate an array to hold uniformly sized extension and will be
> helped to have the number of entries upfront in order to do the
> allocation.
>
> What purpose does this serve?
It's to determine how many offsets the reader has to read (or to skip
to get to the first directory offset).
>> 32-bit number offset to the extension. (Possibly none, as many as
>> indicated in the 4-byte number of extensions)
>
> Why is this needed? It appears that there is no field that points at the
> beginning of array that stores "directory offsets", nor "file offsets", so
> I am assuming that you will be scanning the index file to find them. Why
> can't the extensions be handled exactly the same way?
The offsets are necessary to quickly skip to the extension data, in case it
makes sense without knowing about the index/without reading the whole
index. The directories and files are different in the sense that the reader
needs to read to the first "directory offset" entry in any case, since it needs
to read the header in any case, then the number of extensions (and
if it doesn't know about any extensions skipping them), and then there come
the directory offsets. Those then have the pointer to the actual directory
data, which itself has the pointer to it's files. And since the files don't make
sense without knowing about the directory, there is no need for offsets for
the files.
>> - A number of directory offsets (see below). [1]
>>
>> - A number of sorted directories (see below). [2]
>>
>> - 32-bit crc32 checksum for the header, extension offsets and directories.
>
> What does "32" in "crc32" stand for ? ;-)
32 bit I guess ;-) It's just standing there to make it clear on first
sight that there
is a 32-bit entry, since every other entry starts with the bit/byte count too.
>> - A number of file offsets (see below). [1]
>>
>> - A number of file entries (see below).
>>
>> - A number of entries for conflicted data/resolved conflicts (see below).
>>
>> - Extensions
>>
>> Extensions are identified by signature. Optional extensions can
>> be ignored if GIT does not understand them.
>>
>> GIT supports an arbitrary number of extension, but currently none
>> is implemented. [3]
>>
>> 4-byte extension signature. If the first byte is 'A'..'Z' the
>> extension is optional and can be ignored.
>>
>> 32-bit size of the extension
>>
>> 32-bit crc32 checksum of the extension signature and size
>>
>> Extension data
>>
>>
>> == Directory entry offsets
>
> The name "directory entry offset" does not seem to appear anywhere before
> this, so it is unclear what this section of the documentation is trying to
> describe. Is this the same as "directory offsets" above? In other words,
> "A number of directory offsets (see below)." above is an array of something,
> and this section is trying to describe what that something is?
Yes, exactly, it's the directory offsets above. I'll write this down more
clearly too.
>> 32-bit offset to the directory.
>
> I take it is "offset relative to the beginning of the file in the index".
Yes, that will be written more clearly too.
>> This part is needed for making the directory entries bisectable and
>> thus allowing a binary search.
>>
>> == Directory entry
>>
>> Directory entries are sorted in lexicographic order by the name
>> of their path starting with the root.
>>
>> Path names (variable length) relative to top level directory (without the
>> leading slash). '/' is used as path separator. '.' indicates the root
>> directory. The special patch components ".." and ".git" (without quotes)
>> are disallowed. Trailing slash is also disallowed.
>
> I understood "the root" to mean "the top level of the directory hierarchy"
> (i.e. the directory that corresponds to the top level of the working
> tree), but it needs to be explained better. Using '.' for root sounds
> somewhat questionable, though. Why not a string of length 0?
Yes, that's right. I thought about '.' for root since *nix environments do it
that way too. But the string of length 0 would also work better with the
lexicographic sorting.
> When an index represents a D/F conflict, some stages may have a directory
> D while others do not have D (but have a regular file at D). Doesn't
> directory entry need to have a stage information?
I didn't think about that, I'll include that in the next version.
>> 1 nul byte to terminate the path.
>>
>> 32-bit offset to the first file of a directory
>
> What does this point at? Does it point into the array of "file offsets"
> or the array of "file entries"?
It points to the "file offsets".
>> 32-bit offset to conflicted/resolved data at the end of the index.
>> 0 if there is no such data. [4]
>>
>> 4-byte number of subtrees this tree has
>
> Which is an undefined number unless you specify which stage you are
> talking about.
I'm talking about the number of subtrees, that are in the index. This
is used for further compressing the index, and not being redundant
with path components. E.g: we have a directory A/A then the index
saves it as
A...1Subtree...
A...0Subtrees...
(... is the rest of the data that's stored)
>> 4-byte number of entries in the index that is covered by the tree this
>> entry represents. (entry_count) (-1 if the entry is invalid)
>>
>> 160-bit object name for the object that would result from writing
>> this span of index as a tree.
>>
>> The last 24 bytes are for the cache tree. An entry can be in an
>> invalidated state which is represented by having -1 in the entry_count
>> field. If an entry is in invalidated state, the next entry will begin
>> after the number of subtrees, and the 160-bit object name is dropped.
>
> By "The last 24 bytes", do you mean the "4-byte number of entries..." and
> "160-bit object name"?
Yes, exactly.
>> The entries are written out in the top-down, depth-first order. The
>> first entry represents the root level of the repository, followed by
>> the first subtree - let's call it A - of the root level, followed by
>> the first subtree of A, ...
>>
>> == File entry offsets
>>
>> 32-bit offset to the directory.
>
> What directory? The containing directory? What does this point at? Does
> it point into which array?
That's actually wrong it should say file entry, as it should below. That's what
I get from copy and pasting ;-)
>> This part is needed for making the directory entries bisectable and
>> thus allowing a binary search.
>>
>> == File entry
>>
>> File entries are sorted in ascending order on the name field, after the
>> respective offset given by the directory entries.
>>
>> File name (variable length). Nul bytes are not allowed in file names and
>> they have no leading slash. They are 7-bit ASCII encoded.
>
> Is this a name relative to its containing directory (i.e. without leading
> components)? Or is it a full path relative to the top level of the
> working tree?
The path is relative to its containing directory, to save space, as in your
v4 index format.
> I have some UTF-8 encoded files in my repository. Are they
> now disallowed?
Right I should have taken the old format there, and leave the
files in undefined encoding?
>> 1 nul byte to terminate the filename.
>>
>> A 16-bit 'flags' field split into (high to low bits)
>>
>> 1-bit assume-valid flag
>
> Is this "assume unchanged"?
I have taken this from the old index format documentation, which
describes it as assume-valid flag, but I guess it's assume unchanged then.
>> 1-bit conflict flag
>>
>> 2-bit stage (during merge)
>
> Huh? When stage #0 entry exists for a given path, no other stages for the
> same path can exist in the index. By definition, that is how a conflicted
> path is resolved. What is this separate "conflict flag" for?
I probably got this wrong, and the conflict flag isn't needed. So if I have a
stage #1 entry in the index, I'm sure that it's conflicted?
>> 2-bit mode (0 = 1000644 (regular file without execution
>> permission), 1 = 1000755 (regular file with execution
>> permission), 2 = 1010000 (symbolic link), 3 = 1110
>> (gitlink)) [5]
>
> Don't penny-pinch bits like this.
Heh ok, I thought this saves 4-bytes per entry so it would be good,
but if it's better the other way we'll invest 16-bits for this.
>> 1-bit skip-worktree flag (used by sparse checkout)
>>
>> 1-bit intent-to-add flag (used by "git add -N")
>>
>> 8-bit unused, must be zero [6]
>>
>> 32-bit mtime seconds, the last time a file's data changed
>> this is stat(2) data
>>
>> 32-bit mtime nanosecond fractions
>> this is stat(2) data
>>
>> 32-bit crc32 checksum over ctime seconds, ctime nanoseconds,
>> ino, file size, dev, uid, gid (All stat(2) data except mtime) [7]
>
> Giving occassional false positive to "did this change?" is acceptable, but
> any false negative is absolutely unacceptable. How does this work with
> something like "racy git" situation (i.e. coming from "mtime happens to be
> the same as before") but due to crc32 collisions?
>
> If there is no good answer to the above question, I would have to say that
> anybody who suggested or passed this through review loses all the
> accumulated reviewer karma points (if s/he has accumulated any, that is).
>
>> 160-bit SHA-1 for the represented object
>>
>> 32-bit crc32 checksum for the file entry
>>
>> == Conflicted data
>
> I do not think the data described in this section should be conflicting.
> It ought to be data that describe conflicted state. Perhaps you meant
> "Conflict data"?
Yes, it's conflict data. It's the data that used to have more entries in the
in the index, and it has been moved to the far end to avoid lots of rewrites
of the index, when a conflict is resolved.
>> A conflict is represented in the index as a set of higher stage entries.
>> These entries are stored at the end of the index. When a conflict is
>> resolved (e.g. with "git add path"). A bit is flipped, to indicate that
>> the conflict is resolved, but the entries will be kept, so that
>> conflicts can be recreated (e.g. with "git checkout -m", in case users
>> want to redo a conflict resolution from scratch.
>>
>> - NUL-terminated filename of the entry
>
> Is this a name relative to its containing directory (i.e. without leading
> components)? Or a full path relative to the top-level of the working
> tree?
This is again relative to it's containing directory. That's what the second
offsets in the directory entry part is about. (The offset is pointing to the
filename)
>> - A 8-bit 'flags' field split into:
>>
>> - 1-bit conflicted state (conflicted/resolved) (1 if conflicted)
>>
>> - 7-bit unused
>>
>> - Three 4-byte octal numbers, entry mode of entries in stage 1 to 3 (a
>> missing stage is represented by "0" in this field);
>> and
>>
>> - At most three 160-bit object names of the entry in stages from 1 to 3
>> (nothing is written for a missing stage).
>
> It is allowed to have more than 1 entries in stage #1 to represent
> multiple merge-base, so this needs to be rethought.
Ok, I'll rethink this. Thought it's this way, because it currently
works this way
in the resolve undo extension.
>> - 32-bit crc32 checksum over one conflicted entry.
>
> There is no definition of "one conflicted entry"; be consistent and say
> "Conflicted data" as what the section header claims to describe.
Ok, I'll change that.
>> == Design explanations
>>
>> [1] The directory and file offsets are included in the index format
>> to enable bisectability of the index, for binary searches.Updating
>> a single entry and partial reading will benefit from this.
>>
>> [2] The directories are saved in their own block, to be able to
>> quickly search for a directory in the index. They include a
>> offset to the (lexically) first file in the directory.
>>
>> [3] The data of the cache-tree extension and the resolve undo
>> extension is now part of the index itself, but if other extensions
>> come up in the future, there is no need to change the index, they
>> can simply be added at the end.
>>
>> [4] To avoid rewrites of the whole index when there are conflicts or
>> conflicts are being resolved, conflicted data will be stored at
>> the end of the index. To mark the conflict resolved, just a bit
>> has to be flipped. The data will still be there, if a user wants
>> to redo the conflict resolution.
>>
>> [5] Since only 4 modes are effectively allowed in git but 32-bit are
>> used to store them, having a two bit flag for the mode is enough
>> and saves 4 byte per entry.
>>
>> [6] The length of the file name was dropped, since each file name is
>> nul terminated anyway.
>
> This is micronit, but I think we do this to save one strlen() for each
> read of the entry, except for unusually long paths where we fall back to
> strlen(). A change like this needs to be justified better than simply
> saying "because we _could_ compute in a different way by spending extra
> cycles".
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-03 18:16 ` Thomas Rast
2012-05-03 19:03 ` Junio C Hamano
@ 2012-05-04 7:12 ` Michael Haggerty
2012-05-07 22:18 ` Robin Rosenberg
1 sibling, 1 reply; 49+ messages in thread
From: Michael Haggerty @ 2012-05-04 7:12 UTC (permalink / raw)
To: Thomas Rast; +Cc: Thomas Gummerer, git, gitster, peff, spearce, davidbarr
On 05/03/2012 08:16 PM, Thomas Rast wrote:
> Thomas Gummerer<t.gummerer@gmail.com> writes:
>
>> 32-bit crc32 checksum over ctime seconds, ctime nanoseconds,
>> ino, file size, dev, uid, gid (All stat(2) data except mtime) [7]
> [...]
>> [7] Since all stat data (except mtime and ctime) is just used for
>> checking if a file has changed a checksum of the data is enough.
>> In addition to that Thomas Rast suggested ctime could be ditched
>> completely (core.trustctime=false) and thus included in the
>> checksum. This would save 24 bytes per index entry, which would
>> be about 4 MB on the Webkit index.
>> (Thanks for the suggestion to Michael Haggerty)
>
> This is the part I'm most curious about. Are we missing anything?
> Michael brought it up on IRC: the stat() results are only used to test
> whether they are still the same, with the exception of the mtime (which
> also undergoes raciness checks).
>
> As far as I can see, none of st_{ino,dev,uid,gid} are useful for
> anything. st_size might conceivably be used as a hint for a buffer
> size, but nobody actually does that. The ctime undergoes stricter
> checks, but AFAICS it's also all about whether it has changed, and
> besides that can be turned off. We think all of those fields can be
> replaced by an arbitrary hash/CRC and only tested for equality. 32 bits
> should be plenty, probably even if we just xor the values together.
XOR is definitely *not* adequate; for example, changing uid=gid="you" to
uid=gid="me" would not affect the XOR of the values (assuming, as is
often the case, that each user has his own uid/gid with the same
numerical values).
Which hash to use depends on some estimate of the likelihood that the
hashes collide and simultaneously that the other metadata coincide. It
seems to me that CRC-32 would be adequate. But if not, a longer hash
could be used (albeit with less space savings).
Michael
--
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-03 17:25 Index format v5 Thomas Gummerer
` (3 preceding siblings ...)
2012-05-03 19:38 ` solo-git
@ 2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
2012-05-04 15:44 ` Thomas Gummerer
2012-05-04 13:25 ` Philip Oakley
` (4 subsequent siblings)
9 siblings, 1 reply; 49+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-05-04 13:20 UTC (permalink / raw)
To: Thomas Gummerer; +Cc: git, trast, gitster, mhagger, peff, spearce, davidbarr
On Fri, May 4, 2012 at 12:25 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> GIT index format
> ================
>
> = The git index file has the following format
>
> All binary numbers are in network byte order. Version 5 is described
> here.
> ...
> - A number of directory offsets (see below). [1]
>
> - A number of sorted directories (see below). [2]
>
> - 32-bit crc32 checksum for the header, extension offsets and directories.
So we use one checksum for all dirs? I thought we could do checksum
per dir, so if I'm interested in path/to/here only, I only need to
verify data of three directories.
> == Directory entry offsets
>
> 32-bit offset to the directory.
>
> This part is needed for making the directory entries bisectable and
> thus allowing a binary search.
How is this (I assume) array ordered? The same top-down depth-first
with "Directory entry" section below? I can see ordering as
top-down/breadth-first help bsearch though.
> == Directory entry
>
> Directory entries are sorted in lexicographic order by the name
> of their path starting with the root.
>
> Path names (variable length) relative to top level directory (without the
> leading slash). '/' is used as path separator. '.' indicates the root
> directory. The special patch components ".." and ".git" (without quotes)
> are disallowed. Trailing slash is also disallowed.
>
> 1 nul byte to terminate the path.
I don't see it mention prefix compression here, nor in "file entry"
section. Does it use it here? If so I don't think prefix compression
plays well with bsearch (on path name). In the worst case you may have
to process up to the first entry in order to get a path name (e.g. a
directory with entries "a", "aa", "aaa", "aaaa"...)
> The entries are written out in the top-down, depth-first order. The
> first entry represents the root level of the repository, followed by
> the first subtree - let's call it A - of the root level, followed by
> the first subtree of A, ...
So depth-first traversal becomes natural even without the help of
directory offset table above. Nice.
> == File entry
>
> File entries are sorted in ascending order on the name field, after the
> respective offset given by the directory entries.
I wonder if we need to keep file entry table separate from directory
entry. It feels more natural to put the sequence of file entries of a
directory right after the directory entry, might help read-ahead too
during traversal. You save 4 bytes (for file entry offset) in each
directory entry. You still have file offset table for random access.
> File name (variable length). Nul bytes are not allowed in file names and
> they have no leading slash. They are 7-bit ASCII encoded.
Why can't it be 8-bit? I suppose file name is also prefix compressed?
--
Duy
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-03 17:25 Index format v5 Thomas Gummerer
` (4 preceding siblings ...)
2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
@ 2012-05-04 13:25 ` Philip Oakley
2012-05-04 15:46 ` Junio C Hamano
2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
` (3 subsequent siblings)
9 siblings, 1 reply; 49+ messages in thread
From: Philip Oakley @ 2012-05-04 13:25 UTC (permalink / raw)
To: Thomas Gummerer, git; +Cc: trast, gitster, mhagger, peff, spearce, davidbarr
From: "Thomas Gummerer" <t.gummerer@gmail.com> Sent: Thursday, May 03, 2012
6:25 PM
>I have been drafting the Version 5 of the index format over the past
> few days with the help of Thomas Rast, Michael Haggerty, cmn and
> barrbrain on IRC. It will save with prefix compression on the path, and
> using a crc32 over the stat data, instead of the full data, since it is
> only
> used for checking if the file is changed. (Thanks Michael Haggerty for
> this hint. Unless we are missing something this will save another
> ~4 MB on the Webkit index.
>
>
>
> GIT index format
> ================
>
> = The git index file has the following format
>
xxx
>
> == Directory entry
>
> Directory entries are sorted in lexicographic order by the name
> of their path starting with the root.
>
> Path names (variable length) relative to top level directory (without the
> leading slash). '/' is used as path separator. '.' indicates the root
> directory. The special patch components ".." and ".git" (without
> quotes)
> are disallowed. Trailing slash is also disallowed.
>
Does the prohibition of ".git" prevent the potential for versioning of the
.git directory itself (e.g. .gitignore the pack & objects themselves)?
It should be possible for one's current repo status to be tracked and
summarised by a single sha1, say as an indepenedent branch, as you would the
code itself.
[my use case is in a managed environment where one has plenty of networked
project storage, but no real availability of a network server, so anybody
and his mate could corrupt ones network repo with badly thought through
tweaks. The lack of a server is the root cause, but that's not something the
project can fix, so thinks ... 'version the meta data...']
I can see that the lack of the leading "/", and the ".." are a policy
statement about paths being relative to top level directory, with direct
path referencing. The typical avoidance of ".git" should be a note about how
regular git works, rather than an absolute prohibition.
Philip
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
@ 2012-05-04 15:44 ` Thomas Gummerer
0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-04 15:44 UTC (permalink / raw)
To: Nguyen Thai Ngoc Duy
Cc: git, trast, gitster, mhagger, peff, spearce, davidbarr
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 4205 bytes --]
On 05/04, Nguyen Thai Ngoc Duy wrote:
> On Fri, May 4, 2012 at 12:25 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> > GIT index format
> > ================
> >
> > = The git index file has the following format
> >
> > All binary numbers are in network byte order. Version 5 is described
> > here.
> > ...
> > - A number of directory offsets (see below). [1]
> >
> > - A number of sorted directories (see below). [2]
> >
> > - 32-bit crc32 checksum for the header, extension offsets and directories.
>
> So we use one checksum for all dirs? I thought we could do checksum
> per dir, so if I'm interested in path/to/here only, I only need to
> verify data of three directories.
Good point. Not sure how they could exactly be implemented, but probably
one checksum for offset + directory data. I'll definitely think about this.
> > == Directory entry offsets
> >
> > 32-bit offset to the directory.
> >
> > This part is needed for making the directory entries bisectable and
> > thus allowing a binary search.
>
> How is this (I assume) array ordered? The same top-down depth-first
> with "Directory entry" section below? I can see ordering as
> top-down/breadth-first help bsearch though.
True, the breadth-first approach might be better, since we are using
prefix compression for the pathname. It will need some more offsets
(or calculation, but should still be faster)
> > == Directory entry
> >
> > Directory entries are sorted in lexicographic order by the name
> > of their path starting with the root.
> >
> > Path names (variable length) relative to top level directory (without the
> > leading slash). '/' is used as path separator. '.' indicates the root
> > directory. The special patch components ".." and ".git" (without quotes)
> > are disallowed. Trailing slash is also disallowed.
> >
> > 1 nul byte to terminate the path.
>
> I don't see it mention prefix compression here, nor in "file entry"
> section. Does it use it here? If so I don't think prefix compression
> plays well with bsearch (on path name). In the worst case you may have
> to process up to the first entry in order to get a path name (e.g. a
> directory with entries "a", "aa", "aaa", "aaaa"...)
I planned to use prefix compression here, which would benefit especially
the reader (we're reading more often then writing). By designing the
offsets carefully we should still be able to get log(n) (n = number of
directories in the index) search time for a directory.
> > The entries are written out in the top-down, depth-first order. The
> > first entry represents the root level of the repository, followed by
> > the first subtree - let's call it A - of the root level, followed by
> > the first subtree of A, ...
>
> So depth-first traversal becomes natural even without the help of
> directory offset table above. Nice.
>
> > == File entry
> >
> > File entries are sorted in ascending order on the name field, after the
> > respective offset given by the directory entries.
>
> I wonder if we need to keep file entry table separate from directory
> entry. It feels more natural to put the sequence of file entries of a
> directory right after the directory entry, might help read-ahead too
> during traversal. You save 4 bytes (for file entry offset) in each
> directory entry. You still have file offset table for random access.
The reason for this design choice is the fast searching of a directory,
(for partial reading or changing a single file in the index). Keeping
them separate also simplifies the reading of the cache-tree, which will
be included in the directory section. Instead of offsets to the first file
we'd need offsets to the next directory to enable fast reading of the
cache-tree.
> > File name (variable length). Nul bytes are not allowed in file names and
> > they have no leading slash. They are 7-bit ASCII encoded.
>
> Why can't it be 8-bit? I suppose file name is also prefix compressed?
I changed that, the file name can have UTF8 or ASCII encoding, as it was
allowed in the old index.
--
Thomas
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-04 13:25 ` Philip Oakley
@ 2012-05-04 15:46 ` Junio C Hamano
0 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2012-05-04 15:46 UTC (permalink / raw)
To: Philip Oakley
Cc: Thomas Gummerer, git, trast, mhagger, peff, spearce, davidbarr
"Philip Oakley" <philipoakley@iee.org> writes:
> Does the prohibition of ".git" prevent the potential for versioning of
> the .git directory itself (e.g. .gitignore the pack & objects
> themselves)?
It is a good thing that you are forbidden from adding .git/anything to the
index and the trees. The information inside your .git/ repository does
have to change from time to time. For example, you may change your e-mail
address and update user.email in your .git/config file.
But the history of them and the history of the evolution of the project
itself are distinct [*1*]. When you clone git.git, you have no business
learning what is in .git/config in my repository. The project should not
be able to overwrite what is in your .git/hooks/. These are only a few
reasons why an attempt to "git add .git/something" is always a mistake.
And it is irrelevant to this discussion, as the rest of .git does not
allow putting .git anyway. In that sense, the index "format" does not
have to enforce it. For that reason, I would suggest dropping the mention
of ".git" (but not "..") from the format description.
[Footnote]
*1* You could choose to track the contents of .git/ in another repository
with creative uses of GIT_DIR/GIT_WORK_TREE/core.worktree and friends.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-03 17:25 Index format v5 Thomas Gummerer
` (5 preceding siblings ...)
2012-05-04 13:25 ` Philip Oakley
@ 2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
2012-05-07 13:44 ` Thomas Gummerer
2012-05-06 16:49 ` Phil Hord
` (2 subsequent siblings)
9 siblings, 1 reply; 49+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-05-06 10:23 UTC (permalink / raw)
To: Thomas Gummerer; +Cc: git, trast, gitster, mhagger, peff, spearce, davidbarr
On Fri, May 4, 2012 at 12:25 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> == Directory entry
>
> Directory entries are sorted in lexicographic order by the name
> of their path starting with the root.
>
> Path names (variable length) relative to top level directory (without the
> leading slash). '/' is used as path separator. '.' indicates the root
> directory. The special patch components ".." and ".git" (without quotes)
> are disallowed. Trailing slash is also disallowed.
>
> 1 nul byte to terminate the path.
>
> 32-bit offset to the first file of a directory
>
> 32-bit offset to conflicted/resolved data at the end of the index.
> 0 if there is no such data. [4]
If it's non-zero, how do we know how many conflict entries we have?
> 4-byte number of subtrees this tree has
let's name this nr_subtrees
> 4-byte number of entries in the index that is covered by the tree this
> entry represents. (entry_count) (-1 if the entry is invalid)
and this nr_entries.
So how do we know how many entries (including all dirs, files, staged
files) this directory has? I assume if enry_count != -1, the number
would be nr_subtrees + nr_entries (or just nr_entries, depending on
your definition). When entry_count == -1, how do we calculate this
number?
> 160-bit object name for the object that would result from writing
> this span of index as a tree.
>
> The last 24 bytes are for the cache tree. An entry can be in an
> invalidated state which is represented by having -1 in the entry_count
> field. If an entry is in invalidated state, the next entry will begin
> after the number of subtrees, and the 160-bit object name is dropped.
>
> The entries are written out in the top-down, depth-first order. The
> first entry represents the root level of the repository, followed by
> the first subtree - let's call it A - of the root level, followed by
> the first subtree of A, ...
Assume the command is "git diff -- path/to/h*", we don't need full
index, just stuff in "path/to/h*" from the index. I'm trying to see
how to load just those paths from index, not full index.
I assume again that you won't invent a new function and use
tree_entry_interesting() to do tree pruning while loading index.
t_e_i() is designed to read tree objects. But I think we can make it
read on-disk directory/file entries with a few small changes. t_e_i()
is recursive and fits quite well with depth-first directory layout in
the proposed index format.
I have difficulties figuring out how you skip subtrees though. Assume
we are at "path" and we are not interested in anything there until we
meet "path/to", how do you skip subtrees "path/abc" and "path/def"?
Processing directory entries sequentially will eventually get us to
"path/to", but that could be a lot of entries if "path/abc" is deep. A
file offset pointer to the next sibling directory entry might help.
Does such a pointer exist but I did not see it, or you have other
means to do this?
Also the file/dir separation makes it more difficult to match the last
"h*" part, if there are both "here" directory and "howto" file.
--
Duy
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-03 17:25 Index format v5 Thomas Gummerer
` (6 preceding siblings ...)
2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
@ 2012-05-06 16:49 ` Phil Hord
2012-05-07 13:08 ` Thomas Gummerer
2012-05-07 15:15 ` Michael Haggerty
2012-05-13 21:01 ` Philip Oakley
9 siblings, 1 reply; 49+ messages in thread
From: Phil Hord @ 2012-05-06 16:49 UTC (permalink / raw)
To: Thomas Gummerer; +Cc: git, trast, gitster, mhagger, peff, spearce, davidbarr
On Thu, May 3, 2012 at 1:25 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> I have been drafting the Version 5 of the index format over the past
> few days with the help of Thomas Rast, Michael Haggerty, cmn and
> barrbrain on IRC. It will save with prefix compression on the path, and
> using a crc32 over the stat data, instead of the full data, since it is only
> used for checking if the file is changed. (Thanks Michael Haggerty for
> this hint. Unless we are missing something this will save another
> ~4 MB on the Webkit index.
...
> == Directory entry
>
> Directory entries are sorted in lexicographic order by the name
> of their path starting with the root.
>
> Path names (variable length) relative to top level directory (without the
> leading slash). '/' is used as path separator. '.' indicates the root
> directory. The special patch components ".." and ".git" (without quotes)
> are disallowed. Trailing slash is also disallowed.
typo: The special _path_ components ".." and ".git" ...
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-06 16:49 ` Phil Hord
@ 2012-05-07 13:08 ` Thomas Gummerer
0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-07 13:08 UTC (permalink / raw)
To: Phil Hord; +Cc: git, trast, gitster, mhagger, peff, spearce, davidbarr
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 386 bytes --]
> > Path names (variable length) relative to top level directory (without the
> > leading slash). '/' is used as path separator. '.' indicates the root
> > directory. The special patch components ".." and ".git" (without quotes)
> > are disallowed. Trailing slash is also disallowed.
>
> typo: The special _path_ components ".." and ".git" ...
Thanks, I changed it.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
@ 2012-05-07 13:44 ` Thomas Gummerer
0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-07 13:44 UTC (permalink / raw)
To: Nguyen Thai Ngoc Duy
Cc: git, trast, gitster, mhagger, peff, spearce, davidbarr
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 4265 bytes --]
On 05/06, Nguyen Thai Ngoc Duy wrote:
> On Fri, May 4, 2012 at 12:25 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> > == Directory entry
> >
> > Directory entries are sorted in lexicographic order by the name
> > of their path starting with the root.
> >
> > Path names (variable length) relative to top level directory (without the
> > leading slash). '/' is used as path separator. '.' indicates the root
> > directory. The special patch components ".." and ".git" (without quotes)
> > are disallowed. Trailing slash is also disallowed.
> >
> > 1 nul byte to terminate the path.
> >
> > 32-bit offset to the first file of a directory
> >
> > 32-bit offset to conflicted/resolved data at the end of the index.
> > 0 if there is no such data. [4]
>
> If it's non-zero, how do we know how many conflict entries we have?
Right, I'll add this to the next version.
(32-bit number of conflicted/resolved data entries at the end of the
index if the offset is non 0.)
> > 4-byte number of subtrees this tree has
>
> let's name this nr_subtrees
>
> > 4-byte number of entries in the index that is covered by the tree this
> > entry represents. (entry_count) (-1 if the entry is invalid)
>
> and this nr_entries.
>
> So how do we know how many entries (including all dirs, files, staged
> files) this directory has? I assume if enry_count != -1, the number
> would be nr_subtrees + nr_entries (or just nr_entries, depending on
> your definition). When entry_count == -1, how do we calculate this
> number?
You're right. In order to maintain the bisectability we'll have to
keep the entry count up to date.
> > 160-bit object name for the object that would result from writing
> > this span of index as a tree.
> >
> > The last 24 bytes are for the cache tree. An entry can be in an
> > invalidated state which is represented by having -1 in the entry_count
> > field. If an entry is in invalidated state, the next entry will begin
> > after the number of subtrees, and the 160-bit object name is dropped.
> >
> > The entries are written out in the top-down, depth-first order. The
> > first entry represents the root level of the repository, followed by
> > the first subtree - let's call it A - of the root level, followed by
> > the first subtree of A, ...
>
> Assume the command is "git diff -- path/to/h*", we don't need full
> index, just stuff in "path/to/h*" from the index. I'm trying to see
> how to load just those paths from index, not full index.
>
> I assume again that you won't invent a new function and use
> tree_entry_interesting() to do tree pruning while loading index.
> t_e_i() is designed to read tree objects. But I think we can make it
> read on-disk directory/file entries with a few small changes. t_e_i()
> is recursive and fits quite well with depth-first directory layout in
> the proposed index format.
>
> I have difficulties figuring out how you skip subtrees though. Assume
> we are at "path" and we are not interested in anything there until we
> meet "path/to", how do you skip subtrees "path/abc" and "path/def"?
> Processing directory entries sequentially will eventually get us to
> "path/to", but that could be a lot of entries if "path/abc" is deep. A
> file offset pointer to the next sibling directory entry might help.
> Does such a pointer exist but I did not see it, or you have other
> means to do this?
I have changed the index format slightly, not to use prefix compression
on the directory entries, so that binary search through the index gets
simple. Using the directory offsets, we can binary search to the
path/to. We always have log(n) (n is the number of directories) search
time for each path.
> Also the file/dir separation makes it more difficult to match the last
> "h*" part, if there are both "here" directory and "howto" file.
It doesn't make it more difficult to find, since we can first just
check if there exists the directory with a binary search (and
possibly more directories, around it) and then search for the file
in the superdirectory (can you say that?) with another binary search.
--
Thomas
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-03 17:25 Index format v5 Thomas Gummerer
` (7 preceding siblings ...)
2012-05-06 16:49 ` Phil Hord
@ 2012-05-07 15:15 ` Michael Haggerty
2012-05-08 14:11 ` Thomas Gummerer
2012-05-13 21:01 ` Philip Oakley
9 siblings, 1 reply; 49+ messages in thread
From: Michael Haggerty @ 2012-05-07 15:15 UTC (permalink / raw)
To: Thomas Gummerer; +Cc: git, trast, gitster, peff, spearce, davidbarr
On 05/03/2012 07:25 PM, Thomas Gummerer wrote:
> I have been drafting the Version 5 of the index format over the past
> few days with the help of Thomas Rast, Michael Haggerty, cmn and
> barrbrain on IRC. It will save with prefix compression on the path, and
> using a crc32 over the stat data, instead of the full data, since it is only
> used for checking if the file is changed. (Thanks Michael Haggerty for
> this hint. Unless we are missing something this will save another
> ~4 MB on the Webkit index.
>
>
>
> GIT index format
> ================
> [...]
Here are some comments about the format document, version 55047b3d.
They probably overlap with other feedback that you have gotten, but I
don't have time to cross-correlate everything. Hope it helps.
Overall
=======
I find the format definition pretty difficult to read. The following
suggestions might make it easier to follow.
* You might want to give each of the elements C-like names; e.g.,
"ndir (32 bits): number of directories in the index". These names
could be used as unambiguous references from other parts of the
documentation. The names could also be used in the implementing
code, to make the correspondence with the format definition easier
to follow.
* Name things consistently. For example, "A number of sorted
directories (see below)" should be changed to "A number of directory
entries (see below), sorted by path name".
* Putting the above two things together, the description "A number of
sorted directories" might become "direntries (ndir * directory
entries): one directory entry for each of the ndir directories in
the index, sorted by path name".
* Are all "offsets" relative to the start of the index file? If so,
it would be good to state this explicitly at the start (and maybe
rename them "file positions"?). If not, please be very explicit
about where each offset is measured from, and whether it is signed
or unsigned.
* You seem to switch randomly between counting sizes in bits vs bytes.
Please try to be more consistent. BTW, I think the size of an SHA1
object name is more commonly described as "20 bytes" rather than
"160 bits".
* The details of the extension data blocks are described in the first
(overview) section, whereas it seems like they should be described
in their own section following the "conflict data" section. But
wouldn't the presence of extension data blocks prevent the addition
of conflict data?
* Are there situations (for example during conflicts?) when it is
allowed to have directory and file entries with the same name?
* Does the index file include directory entries for empty directories?
What about directories that contain only other directories?
Overview
========
* Does "32-bit size of the extension" include the whole extension data
block (including header) or only the "extension data"?
Directory entry
===============
* "4-byte number of entries in the index that is covered by the tree
this entry represents." What does this include?
Files/directories/both? Recursive or non-recursive?
* "160-bit object name for the object that would result from writing
this span of index as a tree." Is this always valid?
* It might be convenient to store directory names with trailing '/'
characters (except for the top-level directory, which can be stored
as ""). That way (1) it is trivial to concatenate the directory
name and the filename to produce the file path; (2) directories and
files can be distinguished by name and therefore intermingled in
lists; (3) such lists can be sorted using strcmp() without having to
simulate an invisible '/' character.
File entry
==========
* I believe that only the basename is stored, not the whole path. But
then why is the encoding for '/' specified (there should be no '/'
characters)?
* Why is the encoding of '.' specified? Is it somehow significant for
the index file format?
* Are file entries sorted by entire path or only by the basename?
Flat loading
============
* I found the explanation pretty incomprehensible. Perhaps some
pseudo-code would make it clearer?
* Since I can't understand the explanation, I'm not sure if this
comment makes any sense. But when traversing into a subdirectory,
don't *all* of the remaining files from the parent directory need to
be tucked away somewhere?
* At an even higher level of abstraction, your directory entry
contains a count "number of entries in the index that is covered by
the tree this entry represents". If this count is recursive, then
it seems like it would be enough to know how many entries will come
from traversing a whole subdirectory tree. So it should be possible
to skip that many entries in the in-memory array and continue
reading the file entries for the parent subdirectory. For example,
suppose our files are [A, B/1, B/2/a, B/2/b, B/3, C]. If I start by
reading the files in the root directory, then I can fill the index
array entries
[A, -, -, -, -, C]
because when I see that "B" is a directory containing a total of
four entries, I just leave fours spaces for them in the array and
continue with the next file, "C". Of course I would have to
remember (e.g., on a queue) that directory "B" still needs to be
processed, and the starting index in the array where its entries
have to be filled in. Still, this jumping around would be in the
RAM holding the index array pointers rather than in the file
positions.
Michael
--
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-03 21:38 ` Thomas Gummerer
@ 2012-05-07 18:57 ` Robin Rosenberg
0 siblings, 0 replies; 49+ messages in thread
From: Robin Rosenberg @ 2012-05-07 18:57 UTC (permalink / raw)
To: Thomas Gummerer
Cc: Junio C Hamano, git, trast, mhagger, peff, spearce, davidbarr
Thomas Gummerer skrev 2012-05-03 23.38:
> On Thu, May 3, 2012 at 8:54 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> Thomas Gummerer <t.gummerer@gmail.com> writes:
>>> 1 nul byte to terminate the filename.
>>>
>>> A 16-bit 'flags' field split into (high to low bits)
>>>
>>> 1-bit assume-valid flag
>>
>> Is this "assume unchanged"?
>
> I have taken this from the old index format documentation, which
> describes it as assume-valid flag, but I guess it's assume unchanged then.
The bit for "assume unchanged" is called CE_VALID in the code.
-- robin
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-04 7:12 ` Michael Haggerty
@ 2012-05-07 22:18 ` Robin Rosenberg
0 siblings, 0 replies; 49+ messages in thread
From: Robin Rosenberg @ 2012-05-07 22:18 UTC (permalink / raw)
To: Michael Haggerty
Cc: Thomas Rast, Thomas Gummerer, git, gitster, peff, spearce,
davidbarr
Michael Haggerty skrev 2012-05-04 09.12:
> On 05/03/2012 08:16 PM, Thomas Rast wrote:
>> Thomas Gummerer<t.gummerer@gmail.com> writes:
>>
>>> 32-bit crc32 checksum over ctime seconds, ctime nanoseconds,
>>> ino, file size, dev, uid, gid (All stat(2) data except mtime) [7]
>> [...]
>>> [7] Since all stat data (except mtime and ctime) is just used for
>>> checking if a file has changed a checksum of the data is enough.
>>> In addition to that Thomas Rast suggested ctime could be ditched
>>> completely (core.trustctime=false) and thus included in the
>>> checksum. This would save 24 bytes per index entry, which would
>>> be about 4 MB on the Webkit index.
>>> (Thanks for the suggestion to Michael Haggerty)
>>
>> This is the part I'm most curious about. Are we missing anything?
>> Michael brought it up on IRC: the stat() results are only used to test
>> whether they are still the same, with the exception of the mtime (which
>> also undergoes raciness checks).
>>
>> As far as I can see, none of st_{ino,dev,uid,gid} are useful for
>> anything. st_size might conceivably be used as a hint for a buffer
>> size, but nobody actually does that. The ctime undergoes stricter
>> checks, but AFAICS it's also all about whether it has changed, and
>> besides that can be turned off. We think all of those fields can be
>> replaced by an arbitrary hash/CRC and only tested for equality. 32 bits
>> should be plenty, probably even if we just xor the values together.
>
> XOR is definitely *not* adequate; for example, changing uid=gid="you" to uid=gid="me"
> would not affect the XOR of the values (assuming, as is often the case, that each user
> has his own uid/gid with the same numerical values).
If you change uid/gid, that has no relevance for the content that git tracks. If the CRC
is equal you have to check the content. Ideally a change that does not change the content
should not change the CRC either, so there is really no absolute need to see that change.
I assume the idea is that if you do "tar xvf" or something like that, then changes in file,
mtime etc could be picked up by looking at these attributes, but it seems that those that
mess with mtime such that it goes back in time are out of luck with git anyway.
> Which hash to use depends on some estimate of the likelihood that the hashes collide and
> simultaneously that the other metadata coincide. It seems to me that CRC-32 would
> be adequate. But if not, a longer hash could be used (albeit with less space savings).
>
> Michael
>
JGit simply ignores ctime, ino, dev, uid and gid. The real reason is of course that
standard Java does not have an API for these extra attributes. On the the other hand
nobody is going to fix this bug. The reason is that if you follow the rule that mtime
must always change to "now" if content change, then all changes will be found simply
by looking at mtime or performing a content check for the racy case. Those that mess
with mtime tend to be unhappy anyway.
Then there is the issue of how often we can detect changes without checking content. Ino
usually changes, but when it changes mtime usually does too, so how often does it speed
up.
Has anyone instrumented git to see how much the different attributes actually
contribute to performance and accuracy?
I'd like to extend the size field to 64 bits. We rarely need the extra bits, but we
cannot differ between 3 bytes and 4294967299 bytes so avoiding the very expensive
content check there would be welcome, even it it's a rare event. I haven't thought
too much about this though. I just felt uncomfortable when looking at the code and
knowing that performing a content check of a 4 GB file could take a minute or two.
-- robin
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-07 15:15 ` Michael Haggerty
@ 2012-05-08 14:11 ` Thomas Gummerer
2012-05-08 14:25 ` Nguyen Thai Ngoc Duy
2012-05-09 8:37 ` Michael Haggerty
0 siblings, 2 replies; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-08 14:11 UTC (permalink / raw)
To: Michael Haggerty; +Cc: git, trast, gitster, peff, spearce, davidbarr
On 05/07, Michael Haggerty wrote:
> Here are some comments about the format document, version 55047b3d.
> They probably overlap with other feedback that you have gotten, but
> I don't have time to cross-correlate everything. Hope it helps.
Thanks for the feedback!
For those who may not know yet, I've created a wiki on github, since
this file is going through a lot of revisions, it may not be a good
idea to post every revision on the mailing list.
https://github.com/tgummerer/git/wiki/Index-format-v5
> Overall
> =======
>
> I find the format definition pretty difficult to read. The following
> suggestions might make it easier to follow.
Thanks, I've incorporated your feedback. I hope it's easier readable now.
> [...]
> * You seem to switch randomly between counting sizes in bits vs bytes.
> Please try to be more consistent. BTW, I think the size of an SHA1
> object name is more commonly described as "20 bytes" rather than
> "160 bits".
All sizes are now in bits. I have taken the 160 bits for the SHA1 from
the old index documentation in Documentation/technical. I chose to
leave it as 160bits for now, as all the other sizes were changed to
bits too.
> * The details of the extension data blocks are described in the first
> (overview) section, whereas it seems like they should be described
> in their own section following the "conflict data" section. But
> wouldn't the presence of extension data blocks prevent the addition
> of conflict data?
Only the details that should be there for every extension are described
in the overview (the header of the extension), to make sure every
extension has the same header format, and thus a reader which doesn't
understand a specific extension still can read its header and know
what's going on.
They won't prevent the addition of conflicted data, since when a
conflict is created, other files were probably added and the index has
to be rewritten anyway. Once the conflict is resolved however only a
bit has to be flipped, so there is no rewrite necessary.
> * Are there situations (for example during conflicts?) when it is
> allowed to have directory and file entries with the same name?
Yes, that's why I have added the stage data to the directory.
> * Does the index file include directory entries for empty directories?
> What about directories that contain only other directories?
In theory the index is able to include empty directories. I'm however
not sure if this should be implemented. I'd be happy to get more
feedback there.
> Overview
> ========
>
> * Does "32-bit size of the extension" include the whole extension data
> block (including header) or only the "extension data"?
It includes only the extension data. It's clarified in the documentation
now.
> Directory entry
> ===============
>
> * "4-byte number of entries in the index that is covered by the tree
> this entry represents." What does this include?
> Files/directories/both? Recursive or non-recursive?
This is from the cache-tree. I'm not sure but I think it includes both
files and directories, recursively.
> * "160-bit object name for the object that would result from writing
> this span of index as a tree." Is this always valid?
No, this is only valid if the entry count is not -1. It's clarified
now.
> * It might be convenient to store directory names with trailing '/'
> characters (except for the top-level directory, which can be stored
> as ""). That way (1) it is trivial to concatenate the directory
> name and the filename to produce the file path; (2) directories and
> files can be distinguished by name and therefore intermingled in
> lists; (3) such lists can be sorted using strcmp() without having to
> simulate an invisible '/' character.
Good point. Changed this in the documentation.
> File entry
> ==========
>
> * I believe that only the basename is stored, not the whole path. But
> then why is the encoding for '/' specified (there should be no '/'
> characters)?
>
> * Why is the encoding of '.' specified? Is it somehow significant for
> the index file format?
Yes, you are right, only the basename is stored. '.' and '/' don't need
a specific encoding, it's removed from the documentation.
> * Are file entries sorted by entire path or only by the basename?
They are sorted by the basename, in the respective block of their
directories.
Example: paths: a/a a/z b/b
File entries in the index:
a ...
z ...
b ...
> Flat loading
> ============
>
> * I found the explanation pretty incomprehensible. Perhaps some
> pseudo-code would make it clearer?
>
> * Since I can't understand the explanation, I'm not sure if this
> comment makes any sense. But when traversing into a subdirectory,
> don't *all* of the remaining files from the parent directory need to
> be tucked away somewhere?
>
> * At an even higher level of abstraction, your directory entry
> contains a count "number of entries in the index that is covered by
> the tree this entry represents". If this count is recursive, then
> it seems like it would be enough to know how many entries will come
> from traversing a whole subdirectory tree. So it should be possible
> to skip that many entries in the in-memory array and continue
> reading the file entries for the parent subdirectory. For example,
> suppose our files are [A, B/1, B/2/a, B/2/b, B/3, C]. If I start by
> reading the files in the root directory, then I can fill the index
> array entries
>
> [A, -, -, -, -, C]
>
> because when I see that "B" is a directory containing a total of
> four entries, I just leave fours spaces for them in the array and
> continue with the next file, "C". Of course I would have to
> remember (e.g., on a queue) that directory "B" still needs to be
> processed, and the starting index in the array where its entries
> have to be filled in. Still, this jumping around would be in the
> RAM holding the index array pointers rather than in the file
> positions.
The entry_count in the index is only valid, if the cache-tree is valid,
which is not always the case. Therefore it's impossible to rely on that
for the reading. I have changed the flat loading in the documentation,
hope it's more understandable now.
--
Thomas
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-08 14:11 ` Thomas Gummerer
@ 2012-05-08 14:25 ` Nguyen Thai Ngoc Duy
2012-05-08 14:34 ` Nguyen Thai Ngoc Duy
2012-05-09 8:37 ` Michael Haggerty
1 sibling, 1 reply; 49+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-05-08 14:25 UTC (permalink / raw)
To: Thomas Gummerer
Cc: Michael Haggerty, git, trast, gitster, peff, spearce, davidbarr
On Tue, May 8, 2012 at 9:11 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> * "160-bit object name for the object that would result from writing
>> this span of index as a tree." Is this always valid?
>
> No, this is only valid if the entry count is not -1. It's clarified
> now.
..and..
> The entry_count in the index is only valid, if the cache-tree is valid,
> which is not always the case.
I think your trees are the cache-trees already. For invalid
cache-trees, you can just use all-zero sha-1 as the indicator. Then
entry_count can go away.
--
Duy
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-08 14:25 ` Nguyen Thai Ngoc Duy
@ 2012-05-08 14:34 ` Nguyen Thai Ngoc Duy
2012-05-10 6:53 ` Thomas Gummerer
0 siblings, 1 reply; 49+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-05-08 14:34 UTC (permalink / raw)
To: Thomas Gummerer
Cc: Michael Haggerty, git, trast, gitster, peff, spearce, davidbarr
Sorry I replied too fast.
On Tue, May 8, 2012 at 9:25 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Tue, May 8, 2012 at 9:11 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>>> * "160-bit object name for the object that would result from writing
>>> this span of index as a tree." Is this always valid?
>>
>> No, this is only valid if the entry count is not -1. It's clarified
>> now.
>
> ..and..
>
>> The entry_count in the index is only valid, if the cache-tree is valid,
>> which is not always the case.
>
> I think your trees are the cache-trees already. For invalid
> cache-trees, you can just use all-zero sha-1 as the indicator. Then
> entry_count can go away.
Furthermore, in directory entry format:
The last 24 bytes (4-byte number of entries + 160-bit object name) are
for the cache tree. An entry can be in an invalidated state which is
represented by having -1 in the entry_count field. If an entry is in
invalidated state, the next entry will begin after the number of
subtrees, and the 160-bit object name is dropped.
Dropping objname out of invalid (cache-)trees is a bad idea. When you
generate tree objects (aka cache_tree_update), you'll need objname
field again, which means structure change and directory entry rewrite.
If objname is always there, you can just overwrite objname with new
value. Though this may bring race condition issue back to directory
entries. The same approach on file entries might be reused.
--
Duy
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-08 14:11 ` Thomas Gummerer
2012-05-08 14:25 ` Nguyen Thai Ngoc Duy
@ 2012-05-09 8:37 ` Michael Haggerty
2012-05-10 12:19 ` Thomas Gummerer
1 sibling, 1 reply; 49+ messages in thread
From: Michael Haggerty @ 2012-05-09 8:37 UTC (permalink / raw)
To: Thomas Gummerer; +Cc: git, trast, gitster, peff, spearce, davidbarr
On 05/08/2012 04:11 PM, Thomas Gummerer wrote:
>> * The details of the extension data blocks are described in the first
>> (overview) section, whereas it seems like they should be described
>> in their own section following the "conflict data" section. But
>> wouldn't the presence of extension data blocks prevent the addition
>> of conflict data?
>
> Only the details that should be there for every extension are described
> in the overview (the header of the extension), to make sure every
> extension has the same header format, and thus a reader which doesn't
> understand a specific extension still can read its header and know
> what's going on.
>
> They won't prevent the addition of conflicted data, since when a
> conflict is created, other files were probably added and the index has
> to be rewritten anyway. Once the conflict is resolved however only a
> bit has to be flipped, so there is no rewrite necessary.
In other words, the presence of extensions *does indeed* prevent the
addition of conflict data, but you don't think that it is a problem.
Moving the conflict data to after the extensions, on the other hand,
would mean that conflict data can sometimes be added without a rewrite.
I cannot judge whether this would be useful.
Handling conflict data *as* an extension would allow the conflict data
to be added at any time without rewriting. I cannot judge whether this
would be useful.
>> * Does the index file include directory entries for empty directories?
>> What about directories that contain only other directories?
>
> In theory the index is able to include empty directories. I'm however
> not sure if this should be implemented. I'd be happy to get more
> feedback there.
Currently git does not keep track of empty directories. Even though
there have been proposals to fix this, it is far beyond the scope of
your project to implement the handling of empty directories. The
question is whether your format definition *forbids* the presence of
empty directories in the index file (in the interest of definiteness,
and it might make the reader implementation a little bit simpler, but it
imposes a constraint on the writer). Obviously empty directories, even
if present, mustn't have an effect on the SHA1 of the trees containing them.
>> Directory entry
>> ===============
>>
>> * "4-byte number of entries in the index that is covered by the tree
>> this entry represents." What does this include?
>> Files/directories/both? Recursive or non-recursive?
>
> This is from the cache-tree. I'm not sure but I think it includes both
> files and directories, recursively.
Please figure this out for the final spec.
>> File entry
>> ==========
>> [...]
>
>> * Are file entries sorted by entire path or only by the basename?
>
> They are sorted by the basename, in the respective block of their
> directories.
> Example: paths: a/a a/z b/b
> File entries in the index:
> a ...
> z ...
> b ...
OK, so in other words, the file entries of all files in a directory (not
including files in subdirectories) are stored contiguously, sorted by
basename. (The thing that wasn't immediately clear is whether files
from subdirectories are intermingled with those of the parent directory.)
>> Flat loading
>> ============
>>
>> * I found the explanation pretty incomprehensible. Perhaps some
>> pseudo-code would make it clearer?
>> [...]
> [...] I have changed the flat loading in the documentation,
> hope it's more understandable now.
Maybe it's just be, but I still don't think it is very clear. Here is
version fbf8add1b026:
> == Flat loading
>
> Since internally git expects and works with lexicografic ordering,
> a simple linear scan throught the subdirectories doesn't give
> the right internal sorting. To achieve the right internal sorting
> the loading will be done in the following way:
>
> 1. Start with the root directory, and read also the name of the
> first subdirectory (=next directory in the list).
>
> 1a. Use the next directory (the one against which the filenames
> were checked previously), and read the next directory name,
> to check the files against.
>
> 2. Check the stack if the element at the top is < then the current
> directoryname.
>
> If it's < then current directory name, add files from the stack
> to the entry list, until the file name is > then the
> directory name.
>
> 2. While filename < directoryname add the filenames to the entry
> list
>
> 3. Add the rest of the files to a stack.
>
> 4. Continue with 1a, if there are more directories left.
>
> 5. Add the rest of the files from the stack to the end of the
> entry list.
Aside from the fact that there are two number (2)s,
* What does "Use the next directory (the one against which the filenames
were checked previously)" mean? What does it mean to "use a directory"?
Does it mean to recurse into the directory? Is the stack preserved
passed down to the recursive function calls, or does each level of the
recursion have its own stack? What does "against which the filenames
were checked previously" mean (there are no filenames mentioned in the
earlier steps)?
* You talk about a stack, and "Add the rest of the files to a stack".
But when you retrieve entries from a stack, they come out in reverse
order. So are you imagining that each element of the stack is an array
of file entries? Or do you push the files onto the stack in reverse
order? Or do you really mean a queue rather than a stack?
* Are the file entries read before they are put on the stack, or does
the stack just remember where to read them from when their turn comes?
* "Continue with 1a, if there are more directories left": I assume you
mean subdirectories of the current directory, but maybe you are talking
about all directories?
There is a reason that I asked for pseudocode, namely because it forces
you to be more precise in your description. I can certainly imagine
several workable algorithms for reading the index file, and the
different algorithms have different tradeoffs particularly regarding the
amount of temporary space needed and locality of reference in the index
file (which, I understand, will be mmapped when practical but it is not
practical on all platforms). Once you express the algorithm in
pseudocode it is possible to be sure which variant you have chosen and
consider whether it is really workable.
Michael
--
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-08 14:34 ` Nguyen Thai Ngoc Duy
@ 2012-05-10 6:53 ` Thomas Gummerer
2012-05-10 11:06 ` Nguyen Thai Ngoc Duy
0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-10 6:53 UTC (permalink / raw)
To: Nguyen Thai Ngoc Duy
Cc: Michael Haggerty, git, trast, gitster, peff, spearce, davidbarr
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 1925 bytes --]
On 05/08, Nguyen Thai Ngoc Duy wrote:
> Sorry I replied too fast.
>
> On Tue, May 8, 2012 at 9:25 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> > On Tue, May 8, 2012 at 9:11 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> >>> * "160-bit object name for the object that would result from writing
> >>> this span of index as a tree." Is this always valid?
> >>
> >> No, this is only valid if the entry count is not -1. It's clarified
> >> now.
> >
> > ..and..
> >
> >> The entry_count in the index is only valid, if the cache-tree is valid,
> >> which is not always the case.
> >
> > I think your trees are the cache-trees already. For invalid
> > cache-trees, you can just use all-zero sha-1 as the indicator. Then
> > entry_count can go away.
How is it a cache-tree already? The subtree is covered, but the
entry_count is calculated recursively, while nfiles only keeps track of
the files directly in the directory, which is used for bisectability.
> Furthermore, in directory entry format:
>
> The last 24 bytes (4-byte number of entries + 160-bit object name) are
> for the cache tree. An entry can be in an invalidated state which is
> represented by having -1 in the entry_count field. If an entry is in
> invalidated state, the next entry will begin after the number of
> subtrees, and the 160-bit object name is dropped.
>
> Dropping objname out of invalid (cache-)trees is a bad idea. When you
> generate tree objects (aka cache_tree_update), you'll need objname
> field again, which means structure change and directory entry rewrite.
> If objname is always there, you can just overwrite objname with new
> value. Though this may bring race condition issue back to directory
> entries. The same approach on file entries might be reused.
Yes you're right, at least the field for the object name (even if 0)
should always be there.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-10 6:53 ` Thomas Gummerer
@ 2012-05-10 11:06 ` Nguyen Thai Ngoc Duy
0 siblings, 0 replies; 49+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-05-10 11:06 UTC (permalink / raw)
To: Thomas Gummerer
Cc: Michael Haggerty, git, trast, gitster, peff, spearce, davidbarr
On Thu, May 10, 2012 at 1:53 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>
>
> On 05/08, Nguyen Thai Ngoc Duy wrote:
>> Sorry I replied too fast.
>>
>> On Tue, May 8, 2012 at 9:25 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
>> > On Tue, May 8, 2012 at 9:11 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> >>> * "160-bit object name for the object that would result from writing
>> >>> this span of index as a tree." Is this always valid?
>> >>
>> >> No, this is only valid if the entry count is not -1. It's clarified
>> >> now.
>> >
>> > ..and..
>> >
>> >> The entry_count in the index is only valid, if the cache-tree is valid,
>> >> which is not always the case.
>> >
>> > I think your trees are the cache-trees already. For invalid
>> > cache-trees, you can just use all-zero sha-1 as the indicator. Then
>> > entry_count can go away.
>
> How is it a cache-tree already? The subtree is covered, but the
> entry_count is calculated recursively, while nfiles only keeps track of
> the files directly in the directory, which is used for bisectability.
Cache-tree maintains the information that what part of the index
already has corresponding tree objects (and what sha-1). If a tree is
changed, the tree all the way up must be invalidated (i.e. throw out
the old sha-1 of trees). You have trees and objname for each tree.
Invalidating a tree is simply erasing objname. Before performing a
commit, we just need to traverse the trees from leaves up, generate
tree objects for trees that do not have valid objname yet.
--
Duy
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-09 8:37 ` Michael Haggerty
@ 2012-05-10 12:19 ` Thomas Gummerer
2012-05-10 18:17 ` Michael Haggerty
0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-10 12:19 UTC (permalink / raw)
To: Michael Haggerty; +Cc: git, trast, gitster, peff, spearce, davidbarr
On 05/09, Michael Haggerty wrote:
> On 05/08/2012 04:11 PM, Thomas Gummerer wrote:
> >>* The details of the extension data blocks are described in the first
> >> (overview) section, whereas it seems like they should be described
> >> in their own section following the "conflict data" section. But
> >> wouldn't the presence of extension data blocks prevent the addition
> >> of conflict data?
> >
> >Only the details that should be there for every extension are described
> >in the overview (the header of the extension), to make sure every
> >extension has the same header format, and thus a reader which doesn't
> >understand a specific extension still can read its header and know
> >what's going on.
> >
> >They won't prevent the addition of conflicted data, since when a
> >conflict is created, other files were probably added and the index has
> >to be rewritten anyway. Once the conflict is resolved however only a
> >bit has to be flipped, so there is no rewrite necessary.
>
> In other words, the presence of extensions *does indeed* prevent the
> addition of conflict data, but you don't think that it is a problem.
Exactly.
> Moving the conflict data to after the extensions, on the other hand,
> would mean that conflict data can sometimes be added without a
> rewrite. I cannot judge whether this would be useful.
>
> Handling conflict data *as* an extension would allow the conflict
> data to be added at any time without rewriting. I cannot judge
> whether this would be useful.
Since there are offsets in the directory data to the conflicted data
I don't think it's good to call this data extension data. It may
however be beneficial to have the conflict data after the extension.
I'll investigate this.
> >>* Does the index file include directory entries for empty directories?
> >> What about directories that contain only other directories?
> >
> >In theory the index is able to include empty directories. I'm however
> >not sure if this should be implemented. I'd be happy to get more
> >feedback there.
>
> Currently git does not keep track of empty directories. Even though
> there have been proposals to fix this, it is far beyond the scope of
> your project to implement the handling of empty directories. The
> question is whether your format definition *forbids* the presence of
> empty directories in the index file (in the interest of
> definiteness, and it might make the reader implementation a little
> bit simpler, but it imposes a constraint on the writer). Obviously
> empty directories, even if present, mustn't have an effect on the
> SHA1 of the trees containing them.
No, the index format doesn't forbid the presence of empty directories.
Empty directories will have a fileoffset of 0, and the reader will
just ignore them as long as there is no empty directory tracking.
> >>Directory entry
> >>===============
> >>
> >>* "4-byte number of entries in the index that is covered by the tree
> >> this entry represents." What does this include?
> >> Files/directories/both? Recursive or non-recursive?
> >
> >This is from the cache-tree. I'm not sure but I think it includes both
> >files and directories, recursively.
>
> Please figure this out for the final spec.
It includes only files, in a recursive manner. I've written this down
in the spec.
> >>File entry
> >>==========
> >>[...]
> >
> >>* Are file entries sorted by entire path or only by the basename?
> >
> >They are sorted by the basename, in the respective block of their
> >directories.
> >Example: paths: a/a a/z b/b
> >File entries in the index:
> >a ...
> >z ...
> >b ...
>
> OK, so in other words, the file entries of all files in a directory
> (not including files in subdirectories) are stored contiguously,
> sorted by basename. (The thing that wasn't immediately clear is
> whether files from subdirectories are intermingled with those of the
> parent directory.)
Yes, exactly.
> >>Flat loading
> >>============
> >>
> >>* I found the explanation pretty incomprehensible. Perhaps some
> >> pseudo-code would make it clearer?
> >>[...]
> >[...] I have changed the flat loading in the documentation,
> >hope it's more understandable now.
>
> Maybe it's just be, but I still don't think it is very clear. Here
> is version fbf8add1b026:
>
> >== Flat loading
> >
> >Since internally git expects and works with lexicografic ordering,
> >a simple linear scan throught the subdirectories doesn't give
> >the right internal sorting. To achieve the right internal sorting
> >the loading will be done in the following way:
> >
> >1. Start with the root directory, and read also the name of the
> > first subdirectory (=next directory in the list).
> >
> >1a. Use the next directory (the one against which the filenames
> > were checked previously), and read the next directory name,
> > to check the files against.
> >
> >2. Check the stack if the element at the top is < then the current
> > directoryname.
> >
> > If it's < then current directory name, add files from the stack
> > to the entry list, until the file name is > then the
> > directory name.
> >
> >2. While filename < directoryname add the filenames to the entry
> > list
> >
> >3. Add the rest of the files to a stack.
> >
> >4. Continue with 1a, if there are more directories left.
> >
> >5. Add the rest of the files from the stack to the end of the
> > entry list.
>
> [..]
> There is a reason that I asked for pseudocode, namely because it
> forces you to be more precise in your description. I can certainly
> imagine several workable algorithms for reading the index file, and
> the different algorithms have different tradeoffs particularly
> regarding the amount of temporary space needed and locality of
> reference in the index file (which, I understand, will be mmapped
> when practical but it is not practical on all platforms). Once you
> express the algorithm in pseudocode it is possible to be sure which
> variant you have chosen and consider whether it is really workable.
Ok, here is the variant in pseudo code. I hope it's understandable
this way. It needs some temporary space, but never more then the
actual entries will need in the end anyway.
== Flat loading
Since internally git expects and works with lexicografic ordering,
a simple linear scan throught the subdirectories doesn't give
the right internal sorting. To achieve the right internal sorting
the loading will be done in the following way:
The data structure is a stack of queues, to allow continous reading
of the file.
s -> queue1
t -> queue2
a -> queue3
c -> queue4
k -> queue5
dirs = read_all_directories
foreach dir in dirs do
file = read_next_file
while element_on_top_of_stack.first_element < nextdir
indexentries.append(dequeue(element_on_top_of_stack))
if element_on_top_of_stack == emtpy:
remove_element_on_top_of_stack
if file[filename] < nextdir
indexentries.append(file)
else
queue.add(file)
foreach f in rest_of_files_in_directory:
queue.add(f)
stack.push(queue)
foreach queue in stack:
foreach entry in queue:
indexentry.append(entry)
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-10 12:19 ` Thomas Gummerer
@ 2012-05-10 18:17 ` Michael Haggerty
2012-05-11 17:12 ` Thomas Gummerer
0 siblings, 1 reply; 49+ messages in thread
From: Michael Haggerty @ 2012-05-10 18:17 UTC (permalink / raw)
To: Thomas Gummerer; +Cc: git, trast, gitster, peff, spearce, davidbarr
On 05/10/2012 02:19 PM, Thomas Gummerer wrote:
> == Flat loading
>
> Since internally git expects and works with lexicografic ordering,
> a simple linear scan throught the subdirectories doesn't give
> the right internal sorting. To achieve the right internal sorting
> the loading will be done in the following way:
>
> The data structure is a stack of queues, to allow continous reading
> of the file.
>
> s -> queue1
> t -> queue2
> a -> queue3
> c -> queue4
> k -> queue5
>
> dirs = read_all_directories
>
> foreach dir in dirs do
> file = read_next_file
>
> while element_on_top_of_stack.first_element< nextdir
> indexentries.append(dequeue(element_on_top_of_stack))
> if element_on_top_of_stack == emtpy:
> remove_element_on_top_of_stack
>
> if file[filename]< nextdir
> indexentries.append(file)
> else
> queue.add(file)
> foreach f in rest_of_files_in_directory:
> queue.add(f)
> stack.push(queue)
>
> foreach queue in stack:
> foreach entry in queue:
> indexentry.append(entry)
1. It seems to me that the first time that a file with a filename before
nextdir is encountered, the reading of the directory containing the file
will be terminated. This would, for example, be the case for any
directory that contains multiple files but no subdirectories.
2. There is still a lot that is unnecessarily obscure. For example, I
suppose (but you don't say) that "rest_of_files_in_directory" means to
read the files at that moment. It would be more explicit (and no more
verbose) to write
while (f = read_next_file()) != NULL:
queue.add(f)
3. You don't cover corner cases, like when read_next_file() is called
but there are no files left in the directory, or when there is no
nextdir (which itself is not defined). OK, this pseudocode is only
meant to be illustrative, so I guess we can wait until your real
implementation to see such details. On the other hand, you probably
want to get all the details straight in pseudocode or Python before you
start translating it into C.
4. I think the algorithm would be easier to understand and implement if
it were written recursively. The call stack would replace your explicit
stack (but you would still need one queue per directory level). A key
observation is that when "nextdir" precedes the next file, then all of
the files in subdirectories of nextdir do as well. Thus an obvious
recursion would be to call a function like
read_all_files_under_dir(indexentries, dirs, dirindex) at this point,
which would consume all of the directories that are subdirectories of
dirs[dirindex] (i.e., all directories whose names have the string
dirs[dirindex] as a prefix). Using this method would mean that there is
no need to compare each file under dirs[dirindex] against the next file
in the outer directory.
Michael
--
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-10 18:17 ` Michael Haggerty
@ 2012-05-11 17:12 ` Thomas Gummerer
2012-05-13 19:50 ` Michael Haggerty
0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-11 17:12 UTC (permalink / raw)
To: Michael Haggerty; +Cc: git, trast, gitster, peff, spearce, davidbarr
On 05/10, Michael Haggerty wrote:
> On 05/10/2012 02:19 PM, Thomas Gummerer wrote:
> >== Flat loading
> >
> >Since internally git expects and works with lexicografic ordering,
> >a simple linear scan throught the subdirectories doesn't give
> >the right internal sorting. To achieve the right internal sorting
> >the loading will be done in the following way:
> >
> >The data structure is a stack of queues, to allow continous reading
> >of the file.
> >
> >s -> queue1
> >t -> queue2
> >a -> queue3
> >c -> queue4
> >k -> queue5
> >
> >dirs = read_all_directories
> >
> >foreach dir in dirs do
> > file = read_next_file
> >
> > while element_on_top_of_stack.first_element< nextdir
> > indexentries.append(dequeue(element_on_top_of_stack))
> > if element_on_top_of_stack == emtpy:
> > remove_element_on_top_of_stack
> >
> > if file[filename]< nextdir
> > indexentries.append(file)
> > else
> > queue.add(file)
> > foreach f in rest_of_files_in_directory:
> > queue.add(f)
> > stack.push(queue)
> >
> >foreach queue in stack:
> > foreach entry in queue:
> > indexentry.append(entry)
>
> 1. It seems to me that the first time that a file with a filename
> before nextdir is encountered, the reading of the directory
> containing the file will be terminated. This would, for example, be
> the case for any directory that contains multiple files but no
> subdirectories.
>
> 2. There is still a lot that is unnecessarily obscure. For example,
> I suppose (but you don't say) that "rest_of_files_in_directory"
> means to read the files at that moment. It would be more explicit
> (and no more verbose) to write
>
> while (f = read_next_file()) != NULL:
> queue.add(f)
>
> 3. You don't cover corner cases, like when read_next_file() is
> called but there are no files left in the directory, or when there
> is no nextdir (which itself is not defined). OK, this pseudocode is
> only meant to be illustrative, so I guess we can wait until your
> real implementation to see such details. On the other hand, you
> probably want to get all the details straight in pseudocode or
> Python before you start translating it into C.
>
> 4. I think the algorithm would be easier to understand and implement
> if it were written recursively. The call stack would replace your
> explicit stack (but you would still need one queue per directory
> level). A key observation is that when "nextdir" precedes the next
> file, then all of the files in subdirectories of nextdir do as well.
> Thus an obvious recursion would be to call a function like
> read_all_files_under_dir(indexentries, dirs, dirindex) at this
> point, which would consume all of the directories that are
> subdirectories of dirs[dirindex] (i.e., all directories whose names
> have the string dirs[dirindex] as a prefix). Using this method
> would mean that there is no need to compare each file under
> dirs[dirindex] against the next file in the outer directory.
>
> Michael
Thanks for your feedback! To get clearer code I've now written a
working reader for the v5 index format in Python. The full reader
would probably be to long for the mailing list, but here is the
interesting part:
def readfiles(directories, dirnr, entries):
global filedata
f.seek(directories[dirnr]["foffset"])
offset = struct.unpack("!I", fread(4))[0]
f.seek(offset)
filedata = list()
queue = list()
i = 0
while i < directories[dirnr]["nfiles"]:
filedata.append(struct.pack("!I", f.tell()))
filename = ""
byte = fread(1)
while byte != '\0':
filename += byte
byte = fread(1)
data = struct.unpack("!HHIII", fread(16))
objhash = fread(20)
readcrc = struct.pack("!i", binascii.crc32("".join(filedata)))
crc = f.read(4)
if readcrc != crc:
print "Wrong CRC: " + filename
filedata = list()
i += 1
queue.append(dict({"name": directories[dirnr]["pathname"] + filename, "flags": data[0], "mode": data[1], "mtimes": data[2], "mtimens": data[3], "statcrc": data[4], "objhash": binascii.hexlify(objhash)}))
if len(directories) > dirnr:
i = 0
while i < len(queue):
if len(directories) - 1 > dirnr and queue[i]["name"] > directories[dirnr + 1]["pathname"]:
entries, dirnr = readfiles(directories, dirnr + 1, entries)
else:
entries.append(queue[i])
i += 1
return entries, dirnr
The full reader can be found here:
https://github.com/tgummerer/git/blob/pythonprototype/git-read-index-v5.py
--
Thomas
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-11 17:12 ` Thomas Gummerer
@ 2012-05-13 19:50 ` Michael Haggerty
2012-05-14 15:01 ` Thomas Gummerer
0 siblings, 1 reply; 49+ messages in thread
From: Michael Haggerty @ 2012-05-13 19:50 UTC (permalink / raw)
To: Thomas Gummerer; +Cc: git, trast, gitster, peff, spearce, davidbarr
On 05/11/2012 07:12 PM, Thomas Gummerer wrote:
> Thanks for your feedback! To get clearer code I've now written a
> working reader for the v5 index format in Python. The full reader
> would probably be to long for the mailing list, but here is the
> interesting part:
>
> [...]
> The full reader can be found here:
> https://github.com/tgummerer/git/blob/pythonprototype/git-read-index-v5.py
Good.
I tried to review your code 3fe08f9b:git-read-index-v5.py and compare it
to your file spec f858cf6a9:Index-format-v5.textile. I have the
following comments (some of them already discussed in IRC):
1. Your script seems to be reading a different version of the file than
described in the spec. [When I mentioned this on IRC you pushed a new
version a4ee558ea of the spec.]
2. Your script seems to assume that the index file has no extensions.
It would be better (for documentation purposes and to ensure that there
are no surprises) to make sure that the code knows how to handle extensions.
3. Please document briefly how the scripts should be used.
4. Please limit line length to 80 columns (like the main git project).
5. Python has a nicer way to initialize dictionaries whose keys are all
valid identifiers; for example:
- return dict({"signature": signature, "vnr": header[0], "ndir":
header[1], "nfile": header[2], "next": header[3]})
+ return dict(signature=signature, vnr=header[0], ndir=header[1],
+ nfile=header[2], next=header[3])
6. Some of your print statements are just begging to be written using
string interpolation; e.g.,
- print d["pathname"] + " " + str(d["flags"]) + " " +
str(d["foffset"]) + " " + str(d["cr"]) + " " + str(d["ncr"]) + " " +
str(d["nsubtrees"]) + " " + str(d["nfiles"]) + " " + str(d["nentries"])
+ " " + str(binascii.hexlify(d["objname"]))
+ print ("%(pathname)s %(flags)s %(foffset)s %(cr)s %(ncr)s "
+ "%(nsubtrees)s %(nfiles)s %(nentries)s " % d
+ + str(binascii.hexlify(d["objname"])))
printheader() can be rewritten similarly.
7. You have a couple of while loops that would be easier to read if
written as for loops.
8. There is no need to use global variables. Global variables have lots
of disadvantages, one of which is that it is hard to tell what functions
have side effects via the global variables. It is better to pass the
needed variables explicitly to functions that need them.
9. ...after you eliminate the global variables, you will see that the
checksums are mostly needed over limited areas of code then can be
discarded. Rewriting the checksum handling in this way would make it
easier to see exactly what range of bytes is included in a particular
checksum.
10. There is no need to keep track of all of the data that will go into
a checksum. The CRC32 checksum can be computed incrementally via the
second argument of binascii.crc32(data, crc). Therefore, you only need
to retain a 32-bit running checksum instead of the filedata array of
data strings.
11. It is bad style to generate output from within the
readindexentries() function. Given that it reads the whole array of
file entries anyway, it would be cleaner to return the array to the
caller and let the caller print out what it wants.
12. Your handling of checksum errors is inconsistent. In some places
you generate exceptions; in another you simply print an error to stdout
(not stderr!) and proceed to use the corrupt data.
13. It is probably clearer to unpack the tuples returned by
struct.unpack() directly into local variables with meaningful names
instead of carrying them around as a tuple; e.g.,
- header = struct.unpack('!IIII', checksum.add(f.read(16)))
+ (vnr, ndir, nfile, next) = struct.unpack('!IIII', fread(16))
14. It is more correct to check the file signature and version
explicitly before plowing into the rest of the file (that's what they're
there for!)
That's as far as I've got.
Michael
--
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-03 17:25 Index format v5 Thomas Gummerer
` (8 preceding siblings ...)
2012-05-07 15:15 ` Michael Haggerty
@ 2012-05-13 21:01 ` Philip Oakley
2012-05-14 14:54 ` Thomas Gummerer
9 siblings, 1 reply; 49+ messages in thread
From: Philip Oakley @ 2012-05-13 21:01 UTC (permalink / raw)
To: Thomas Gummerer, Git List
Cc: trast, gitster, mhagger, peff, spearce, davidbarr
From: "Thomas Gummerer" <t.gummerer@gmail.com> Sent: Thursday, May 03, 2012 6:25 PM
> I have been drafting the Version 5 of the index format over the past
> few days with the help of Thomas Rast, Michael Haggerty, cmn and
> barrbrain on IRC.
>
> GIT index format
> ================
>
> = The git index file has the following format
>
Given the discussions on the list about the general naming of Staging vs Index [1],
would a careful change to the title, and the adding of an introductory line
help in putting the index file format in the appropriate (implementation) context?
I'm thinking that perhaps -
Title: "GIT index file format (V5)", i.e. add the 'file' qualifier.
Introduction line:
"The git index file (.git/index) documents the status of the files in the git staging area."
i.e. this is an implementation document for this particular file, but using the terms
suggested in [1]. Followed by
"The staging area is used for preparing commits, merging, etc.".
i.e. show the purpose of this index relative to the overall 'staging area'.
IIRC the use of the staging area for merging was one of Linus's key features;-)
By crafting the title and the introduction line(s) the confusion e.g. [2], between implementation
details (this document) and conceptual operation can be clearly separated.
Philip
[1] http://article.gmane.org/gmane.comp.version-control.git/197111
[1.8.0] use 'stage' term consistently
[2] http://raflabs.com/blogs/silence-is-foo/2011/04/07/staging-area-index-cache-git/
... Git it's a little bit confusing to undersand some of its terminology
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-13 21:01 ` Philip Oakley
@ 2012-05-14 14:54 ` Thomas Gummerer
0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-14 14:54 UTC (permalink / raw)
To: Philip Oakley; +Cc: Git List, trast, gitster, mhagger, peff, spearce, davidbarr
On 05/13, Philip Oakley wrote:
> From: "Thomas Gummerer" <t.gummerer@gmail.com> Sent: Thursday, May 03, 2012 6:25 PM
> >I have been drafting the Version 5 of the index format over the past
> >few days with the help of Thomas Rast, Michael Haggerty, cmn and
> >barrbrain on IRC.
>
> >
> >GIT index format
> >================
> >
> >= The git index file has the following format
> >
>
> Given the discussions on the list about the general naming of
> Staging vs Index [1], would a careful change to the title, and the
> adding of an introductory line help in putting the index file format
> in the appropriate (implementation) context?
>
> I'm thinking that perhaps -
> Title: "GIT index file format (V5)", i.e. add the 'file' qualifier.
>
> Introduction line:
> "The git index file (.git/index) documents the status of the files in the git staging area."
> i.e. this is an implementation document for this particular file, but using the terms
> suggested in [1]. Followed by
> "The staging area is used for preparing commits, merging, etc.".
> i.e. show the purpose of this index relative to the overall
> 'staging area'. IIRC the use of the staging area for merging was
> one of Linus's key features;-)
>
> By crafting the title and the introduction line(s) the confusion
> e.g. [2], between implementation details (this document) and
> conceptual operation can be clearly separated.
>
> Philip
>
> [1] http://article.gmane.org/gmane.comp.version-control.git/197111
> [1.8.0] use 'stage' term consistently
>
> [2] http://raflabs.com/blogs/silence-is-foo/2011/04/07/staging-area-index-cache-git/
> ... Git it's a little bit confusing to undersand some of its terminology
Thanks for your suggestion. I've changed the documentation to take this
into account.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-13 19:50 ` Michael Haggerty
@ 2012-05-14 15:01 ` Thomas Gummerer
2012-05-14 21:08 ` Michael Haggerty
0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-14 15:01 UTC (permalink / raw)
To: Michael Haggerty; +Cc: git, trast, gitster, peff, spearce, davidbarr
On 05/13, Michael Haggerty wrote:
> On 05/11/2012 07:12 PM, Thomas Gummerer wrote:
> >Thanks for your feedback! To get clearer code I've now written a
> >working reader for the v5 index format in Python. The full reader
> >would probably be to long for the mailing list, but here is the
> >interesting part:
> >
> >[...]
> >The full reader can be found here:
> >https://github.com/tgummerer/git/blob/pythonprototype/git-read-index-v5.py
>
> Good.
>
> I tried to review your code 3fe08f9b:git-read-index-v5.py and
> compare it to your file spec f858cf6a9:Index-format-v5.textile. I
> have the following comments (some of them already discussed in IRC):
>
> 1. Your script seems to be reading a different version of the file
> than described in the spec. [When I mentioned this on IRC you
> pushed a new version a4ee558ea of the spec.]
>
> 2. Your script seems to assume that the index file has no
> extensions. It would be better (for documentation purposes and to
> ensure that there are no surprises) to make sure that the code knows
> how to handle extensions.
>
> 3. Please document briefly how the scripts should be used.
>
> 4. Please limit line length to 80 columns (like the main git project).
>
> 5. Python has a nicer way to initialize dictionaries whose keys are
> all valid identifiers; for example:
>
> - return dict({"signature": signature, "vnr": header[0],
> "ndir": header[1], "nfile": header[2], "next": header[3]})
> + return dict(signature=signature, vnr=header[0], ndir=header[1],
> + nfile=header[2], next=header[3])
>
> 6. Some of your print statements are just begging to be written
> using string interpolation; e.g.,
>
> - print d["pathname"] + " " + str(d["flags"]) + " " +
> str(d["foffset"]) + " " + str(d["cr"]) + " " + str(d["ncr"]) + " " +
> str(d["nsubtrees"]) + " " + str(d["nfiles"]) + " " +
> str(d["nentries"]) + " " + str(binascii.hexlify(d["objname"]))
> + print ("%(pathname)s %(flags)s %(foffset)s %(cr)s %(ncr)s "
> + "%(nsubtrees)s %(nfiles)s %(nentries)s " % d
> + + str(binascii.hexlify(d["objname"])))
>
> printheader() can be rewritten similarly.
>
> 7. You have a couple of while loops that would be easier to read if
> written as for loops.
>
> 8. There is no need to use global variables. Global variables have
> lots of disadvantages, one of which is that it is hard to tell what
> functions have side effects via the global variables. It is better
> to pass the needed variables explicitly to functions that need them.
>
> 9. ...after you eliminate the global variables, you will see that
> the checksums are mostly needed over limited areas of code then can
> be discarded. Rewriting the checksum handling in this way would
> make it easier to see exactly what range of bytes is included in a
> particular checksum.
>
> 10. There is no need to keep track of all of the data that will go
> into a checksum. The CRC32 checksum can be computed incrementally
> via the second argument of binascii.crc32(data, crc). Therefore,
> you only need to retain a 32-bit running checksum instead of the
> filedata array of data strings.
>
> 11. It is bad style to generate output from within the
> readindexentries() function. Given that it reads the whole array of
> file entries anyway, it would be cleaner to return the array to the
> caller and let the caller print out what it wants.
>
> 12. Your handling of checksum errors is inconsistent. In some
> places you generate exceptions; in another you simply print an error
> to stdout (not stderr!) and proceed to use the corrupt data.
>
> 13. It is probably clearer to unpack the tuples returned by
> struct.unpack() directly into local variables with meaningful names
> instead of carrying them around as a tuple; e.g.,
>
> - header = struct.unpack('!IIII', checksum.add(f.read(16)))
> + (vnr, ndir, nfile, next) = struct.unpack('!IIII', fread(16))
>
> 14. It is more correct to check the file signature and version
> explicitly before plowing into the rest of the file (that's what
> they're there for!)
>
> That's as far as I've got.
>
> Michael
Thanks a lot for your feedback. I've now refactored the code and
thanks to your suggestions hopefully made it simpler and easier
to read. The reader should now read exactly the data from the
spec.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-14 15:01 ` Thomas Gummerer
@ 2012-05-14 21:08 ` Michael Haggerty
2012-05-14 22:10 ` Thomas Rast
2012-05-15 13:49 ` Thomas Gummerer
0 siblings, 2 replies; 49+ messages in thread
From: Michael Haggerty @ 2012-05-14 21:08 UTC (permalink / raw)
To: Thomas Gummerer; +Cc: git, trast, gitster, peff, spearce, davidbarr
On 05/14/2012 05:01 PM, Thomas Gummerer wrote:
> Thanks a lot for your feedback. I've now refactored the code and
> thanks to your suggestions hopefully made it simpler and easier
> to read. The reader should now read exactly the data from the
> spec.
Yes, the style is much better now. Here is the next round of feedback:
1. f is still a global variable. This is unnecessary.
2. read_name() is indented incorrectly.
3. The signature and version number should be checked within
read_header() *before* reading anything else. Otherwise, if some random
file is given to the script, it will read some random, probably very
large number for "nextensions" and then try to read that number of
extension offsets into RAM. It would be a pretty innocent error to have
the wrong version of the index file lying around, and the script should
detect such an error quickly and painlessly rather than triggering a lot
of pointless disk activity (and likely an out-of-memory error) before
figuring out that it shouldn't be reading the file in the first place.
4. For readability reasons, it is better to raise an exception
immediately in an error situation rather than put the error-handling
code in the "else" part of an if statement; i.e., instead of
if not error_condition:
non-error-actions
else:
raise exception
use
if error_condition:
raise exception
non-error-actions
The reasons are: (a) the reader can see immediately what will happen in
the error case, then forget it and continue on with the "normal" flow
instead of having to remember the "if" condition through the whole long
"then" block before finally seeing the point in the "else" block; (b)
then the "normal" case doesn't have to be within the if statement at
all, thereby requiring less block nesting and less indentation. For
example:
> - if crc == datacrc:
> - return dict(signature=signature, vnr=vnr, ndir=ndir, nfile=nfile,
> - nextensions=nextensions, extoffsets=extoffsets)
> - else:
> - raise Exception("Wrong header crc")
> + if crc != datacrc:
> + raise Exception("Wrong header crc")
> +
> + return dict(signature=signature, vnr=vnr, ndir=ndir, nfile=nfile,
> + nextensions=nextensions, extoffsets=extoffsets)
5. If the first limit of a range() or xrange() is zero, it is usually
omitted; e.g., "xrange(0, nextensions)" -> "xrange(nextensions)".
6. It is possible to precompile "struct" patterns, which should be
faster and also allows you to ask the Struct object its size, reducing
the number of magic numbers needed in the code. And fixed-length
strings can also be read via struct. For example:
> DIR_DATA_STRUCT = struct.Struct("!HIIIIII 20s")
>
>
> def read_dirs(f, ndir):
> dirs = list()
> for i in xrange(0, ndir):
> (pathname, partialcrc) = read_name(f)
>
> (filedata, partialcrc) = read_calc_crc(f, DIR_DATA_STRUCT.size, partialcrc)
> (flags, foffset, cr, ncr, nsubtrees, nfiles,
> nentries, objname) = DIR_DATA_STRUCT.unpack(filedata)
> # ...
7. The "if dirnr == 0" stuff in read_files() should probably be done in
read_index_entries() (the first invocation is the only place dirnr==0,
right?)
8. read_files() only returns a result "if len(directories) > dirnr".
But actually I don't see the purpose for the check. Earlier in the
function you dereference directories[dirnr] several times, so it *must*
be that dirnr < len(directories). I think you can take out this test
and unindent its block.
9. read_files() doesn't need to return "entries". Since entries is an
array that is only mutated in place, the return value will always be the
same as the "entries" argument (albeit fuller).
10. It would make sense to extract a function read_file(), which reads a
single file entry from the current position in the current file (using
code from read_files()). Similarly for read_dir()/read_dirs().
11. It is good form to move the file-level code into a main() function,
then call that from the bottom of the file, something like this:
> def main(args):
> ....
>
> main(sys.argv[1:])
This avoids creating global variables that are accidentally used within
functions.
What is your plan for testing this code, and later the C version? For
example, you might want to have a suite of index files with various
contents, and compare the "git ls-files --debug" output with the output
that is expected. How would you create index files like this? Via git
commands? Or should one of your Python scripts be taught how to do it?
To make testing easier, you probably don't want to hard-code the name of
the input file in git-read-index-v5.py, so that you can use it to read
arbitrary files. For example, you might want to honor the
GIT_INDEX_FILES environment variable in some form, or to take the name
of the index file as a command-line argument.
Michael
--
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-14 21:08 ` Michael Haggerty
@ 2012-05-14 22:10 ` Thomas Rast
2012-05-15 6:43 ` Michael Haggerty
2012-05-15 13:49 ` Thomas Gummerer
1 sibling, 1 reply; 49+ messages in thread
From: Thomas Rast @ 2012-05-14 22:10 UTC (permalink / raw)
To: Michael Haggerty
Cc: Thomas Gummerer, git, trast, gitster, peff, spearce, davidbarr
Michael Haggerty <mhagger@alum.mit.edu> writes:
First of all, many thanks for taking up this time-consuming job while I
was away! There's not much I can add at this point, just a few minor
points:
> 9. read_files() doesn't need to return "entries". Since entries is an
> array that is only mutated in place, the return value will always be
> the same as the "entries" argument (albeit fuller).
(Ab)using an array in this fashion is somewhat iffy. It seems
unavoidable in this case (while still retaining the runtime), but try
not to do it too often, and perhaps name the parameter something that
makes this clear (such as 'out'). Usually changing it to use a
generator function (with 'yield') helps.
> 11. It is good form to move the file-level code into a main()
> function, then call that from the bottom of the file, something like
> this:
>
>> def main(args):
>> ....
>>
>> main(sys.argv[1:])
It's customary to wrap it as
if __name__ == '__main__':
main(sys.argv[1:])
That way your script becomes 'import'-able, which can be handy (if only
for testing).
Cheers,
Thomas
--
Thomas Rast
trast@{inf,student}.ethz.ch
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-14 22:10 ` Thomas Rast
@ 2012-05-15 6:43 ` Michael Haggerty
0 siblings, 0 replies; 49+ messages in thread
From: Michael Haggerty @ 2012-05-15 6:43 UTC (permalink / raw)
To: Thomas Rast; +Cc: Thomas Gummerer, git, gitster, peff, spearce, davidbarr
On 05/15/2012 12:10 AM, Thomas Rast wrote:
> Michael Haggerty<mhagger@alum.mit.edu> writes:
>> 9. read_files() doesn't need to return "entries". Since entries is an
>> array that is only mutated in place, the return value will always be
>> the same as the "entries" argument (albeit fuller).
>
> (Ab)using an array in this fashion is somewhat iffy. It seems
> unavoidable in this case (while still retaining the runtime), but try
> not to do it too often, and perhaps name the parameter something that
> makes this clear (such as 'out'). Usually changing it to use a
> generator function (with 'yield') helps.
If the goal were an ideal Python program, then by all means generators
are the way to go. But since the goal is a prototype for a C program,
then a change to using generators would just have to be undone when
converting to C.
>> 11. It is good form to move the file-level code into a main()
>> function, then call that from the bottom of the file, something like
>> this:
>>
>>> def main(args):
>>> ....
>>>
>>> main(sys.argv[1:])
>
> It's customary to wrap it as
>
> if __name__ == '__main__':
> main(sys.argv[1:])
>
> That way your script becomes 'import'-able, which can be handy (if only
> for testing).
+1
Michael
--
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-14 21:08 ` Michael Haggerty
2012-05-14 22:10 ` Thomas Rast
@ 2012-05-15 13:49 ` Thomas Gummerer
2012-05-15 15:02 ` Michael Haggerty
2012-05-16 5:01 ` Michael Haggerty
1 sibling, 2 replies; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-15 13:49 UTC (permalink / raw)
To: Michael Haggerty; +Cc: git, trast, gitster, peff, spearce, davidbarr
On 05/14, Michael Haggerty wrote:
> On 05/14/2012 05:01 PM, Thomas Gummerer wrote:
> >Thanks a lot for your feedback. I've now refactored the code and
> >thanks to your suggestions hopefully made it simpler and easier
> >to read. The reader should now read exactly the data from the
> >spec.
>
> Yes, the style is much better now. Here is the next round of feedback:
> [...]
>
> >DIR_DATA_STRUCT = struct.Struct("!HIIIIII 20s")
> >
> >
> >def read_dirs(f, ndir):
> > dirs = list()
> > for i in xrange(0, ndir):
> > (pathname, partialcrc) = read_name(f)
> >
> > (filedata, partialcrc) = read_calc_crc(f, DIR_DATA_STRUCT.size, partialcrc)
> > (flags, foffset, cr, ncr, nsubtrees, nfiles,
> > nentries, objname) = DIR_DATA_STRUCT.unpack(filedata)
> > # ...
>
> [...]
Thanks again for your feedback. I've refactored the code again,
thanks to your suggestions. If I'm correct it's fine to have the
compiled structs global?
> What is your plan for testing this code, and later the C version?
> For example, you might want to have a suite of index files with
> various contents, and compare the "git ls-files --debug" output with
> the output that is expected. How would you create index files like
> this? Via git commands? Or should one of your Python scripts be
> taught how to do it?
I thought of using real world examples for this, for example the
WebKit index, which is pretty large, and some others, for example the
git index and the linux kernel index.
There would be some changes necessary to the output format of
git ls-files --debug, to work with the new index format, but those
should be fairly simple.
> To make testing easier, you probably don't want to hard-code the
> name of the input file in git-read-index-v5.py, so that you can use
> it to read arbitrary files. For example, you might want to honor
> the GIT_INDEX_FILES environment variable in some form, or to take
> the name of the index file as a command-line argument.
I've changed this in the script, which now takes a --file argument
with the file name of the index file that should be read.
(git-read-index-v5.py --file=FILENAME)
--
Thomas
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-15 13:49 ` Thomas Gummerer
@ 2012-05-15 15:02 ` Michael Haggerty
2012-05-18 15:38 ` Thomas Gummerer
2012-05-16 5:01 ` Michael Haggerty
1 sibling, 1 reply; 49+ messages in thread
From: Michael Haggerty @ 2012-05-15 15:02 UTC (permalink / raw)
To: Thomas Gummerer; +Cc: git, trast, gitster, peff, spearce, davidbarr
On 05/15/2012 03:49 PM, Thomas Gummerer wrote:
> Thanks again for your feedback. I've refactored the code again,
> thanks to your suggestions.
Good. I'll try to review the new version as soon as possible.
I suggest that you apply the same kinds of cleanups to
git-convert-index.py (which I personally haven't looked at yet at all).
If you want my feedback on that script, please let me know when you
think it is ready.
> If I'm correct it's fine to have the
> compiled structs global?
Yes, that's OK because they are constants so there is no risk of them
propagating side-effects. (Of course, Python doesn't enforce the
constness of identifiers, but by convention ALL_CAPS identifiers are
constants and it would be an obvious no-no to modify one.)
A real Pythonic solution would probably encapsulate all of your code
(including the constants) in classes. But since you will eventually
translate the code to C, I don't think that step is crucial. If, on the
other hand, you propose to include Python scripts in the git
distribution, then making them Pythonic would definitely be on the agenda.
>> What is your plan for testing this code, and later the C version?
>> For example, you might want to have a suite of index files with
>> various contents, and compare the "git ls-files --debug" output with
>> the output that is expected. How would you create index files like
>> this? Via git commands? Or should one of your Python scripts be
>> taught how to do it?
>
> I thought of using real world examples for this, for example the
> WebKit index, which is pretty large, and some others, for example the
> git index and the linux kernel index.
It is good to do such manual tests, but you probably also want small,
hand-constructed, automatable tests to make sure that all of the
codepaths are exercised. For example,
* A directory containing only files
* A directory containing only subdirectories
* A mixed directory whose last element is a file / last element is a
subdirectory
* Various types of conflicts
...etc.
Michael
--
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-15 13:49 ` Thomas Gummerer
2012-05-15 15:02 ` Michael Haggerty
@ 2012-05-16 5:01 ` Michael Haggerty
2012-05-16 21:54 ` Thomas Gummerer
1 sibling, 1 reply; 49+ messages in thread
From: Michael Haggerty @ 2012-05-16 5:01 UTC (permalink / raw)
To: Thomas Gummerer; +Cc: git, trast, gitster, peff, spearce, davidbarr
On 05/15/2012 03:49 PM, Thomas Gummerer wrote:
> Thanks again for your feedback. I've refactored the code again,
> thanks to your suggestions.
I just reviewed version 1369bd855b86 of your script, and it is MUCH
better. It's easy to read and review. The functions that it defines
are now self-contained and could therefore be reused for other purposes.
There are fewer magic numbers (though there are still a few; I wonder
if there is a way to get rid of those?) You've done a nice job
polishing up the code.
I have only a few remaining niggles::
1. The struct module can handle fixed-length strings, so you could read
and parse the SHA1s as part of FILE_DATA_STRUCT and DIR_DATA_STRUCT
rather than handling them separately.
2. At least some of the functions deserve docstrings, especially when
they are nontrivial. For example, what arguments read_files() needs and
how they are used is far from obvious.
3. It would be easier to read the multiline string formatting templates
if they are written as multiline strings (even though this kindof
requires that they be made file-level constants); e.g.,
FILE_FORMAT = """\
%(name)s (%(objhash)s)
mtime: %(mtimes)s:%(mtimens)s
mode: %(mode)s flags: %(flags)s
statcrc: """
In the future, please try to commit one self-contained change at a time
and make your commit messages really describe what is changed in the
commit. For example, commit fb2654c648a does at least three things,
only one of which is mentioned in its commit message. It would be
better to break this into three commits with three commit messages. Use
"git rebase -i" and the other commit-rewriting tools liberally to tidy
up commits before publishing them (but not after publishing them!); for
example, commit be8d01c22c should really have been squashed on top of
"Changed main to a function" so that the rest of the world doesn't have
to see the broken latter commit.
But I don't want to detract from the fact that altogether you have made
very nice improvements to the script, and I hope to see more like this
from you in the future!
Michael
--
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-16 5:01 ` Michael Haggerty
@ 2012-05-16 21:54 ` Thomas Gummerer
2012-05-19 5:40 ` Michael Haggerty
0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-16 21:54 UTC (permalink / raw)
To: Michael Haggerty; +Cc: git, trast, gitster, peff, spearce, davidbarr
On 05/16, Michael Haggerty wrote:
> I just reviewed version 1369bd855b86 of your script, and it is MUCH
> better. It's easy to read and review. The functions that it
> defines are now self-contained and could therefore be reused for
> other purposes. There are fewer magic numbers (though there are
> still a few; I wonder if there is a way to get rid of those?)
> You've done a nice job polishing up the code.
Thanks for the feedback! I could get rid of the magic numbers for
the crc code, but I'm not sure if it makes sense to replace the
others with constants, since they only occur once in the file. I
added comments instead explaining where those numbers come from
instead.
> I have only a few remaining niggles::
>
> 1. The struct module can handle fixed-length strings, so you could
> read and parse the SHA1s as part of FILE_DATA_STRUCT and
> DIR_DATA_STRUCT rather than handling them separately.
>
> 2. At least some of the functions deserve docstrings, especially
> when they are nontrivial. For example, what arguments read_files()
> needs and how they are used is far from obvious.
>
> 3. It would be easier to read the multiline string formatting
> templates if they are written as multiline strings (even though this
> kindof requires that they be made file-level constants); e.g.,
>
> FILE_FORMAT = """\
> %(name)s (%(objhash)s)
> mtime: %(mtimes)s:%(mtimens)s
> mode: %(mode)s flags: %(flags)s
> statcrc: """
Thanks, I have changed those. I added the docstrings for all read
functions, they however don't seem to make sense for the print
functions, since you're probably faster just reading the code for
them.
One since I changed in addition is to those changes, is that I
gave the exceptions names to make them more meaningful.
> In the future, please try to commit one self-contained change at a
> time and make your commit messages really describe what is changed
> in the commit. For example, commit fb2654c648a does at least three
> things, only one of which is mentioned in its commit message. It
> would be better to break this into three commits with three commit
> messages. Use "git rebase -i" and the other commit-rewriting tools
> liberally to tidy up commits before publishing them (but not after
> publishing them!); for example, commit be8d01c22c should really have
> been squashed on top of "Changed main to a function" so that the
> rest of the world doesn't have to see the broken latter commit.
Ok, I'll try my best to do that.
--
Thomas
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-15 15:02 ` Michael Haggerty
@ 2012-05-18 15:38 ` Thomas Gummerer
2012-05-19 13:00 ` Michael Haggerty
0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-18 15:38 UTC (permalink / raw)
To: Michael Haggerty; +Cc: git, trast, gitster, peff, spearce, davidbarr
> I suggest that you apply the same kinds of cleanups to
> git-convert-index.py (which I personally haven't looked at yet at
> all). If you want my feedback on that script, please let me know
> when you think it is ready.
That would be great, if you have the time to do it. I'm not
completely finished with it (docstrings and conflicted data writing
are still missing).
I'm not sure about the read_tree_extensiondata method, if I should
extract a method, which only reads one entry, but I'm not sure that
would make any sense, since there would be a lot of parameters and
return values to the function.
The same thing is in the main method, where I'm not sure if it's
better to extract the read_index and write_index functions, or
just leave the code in the main method. My guess is that it makes
sense in the main method, since there are less calls, but it
doesn't make sense in the read_tree_extensiondata method?
Another thing I'm unsure about is the write_directory_data method,
if there is any way to replace the try/except with something
simpler?
--
Thomas
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-16 21:54 ` Thomas Gummerer
@ 2012-05-19 5:40 ` Michael Haggerty
2012-05-21 20:30 ` Thomas Gummerer
0 siblings, 1 reply; 49+ messages in thread
From: Michael Haggerty @ 2012-05-19 5:40 UTC (permalink / raw)
To: Thomas Gummerer; +Cc: git, trast, gitster, peff, spearce, davidbarr
On 05/16/2012 11:54 PM, Thomas Gummerer wrote:
> On 05/16, Michael Haggerty wrote:
>> I just reviewed version 1369bd855b86 of your script, and it is MUCH
>> better. It's easy to read and review. The functions that it
>> defines are now self-contained and could therefore be reused for
>> other purposes. There are fewer magic numbers (though there are
>> still a few; I wonder if there is a way to get rid of those?)
>> You've done a nice job polishing up the code.
>
> Thanks for the feedback! I could get rid of the magic numbers for
> the crc code, but I'm not sure if it makes sense to replace the
> others with constants, since they only occur once in the file. I
> added comments instead explaining where those numbers come from
> instead.
I think it is possible to remove the last magic number and also to make
the CRC handling easier. I have pushed some suggested changes to github
[1]:
1. With the current code, trying to read a file that is less than 24
bytes long would result in a struct.error (because it would try to
unpack a string that is shorter than the struct) whereas the underlying
error in this case should almost always be reported as a signature
error. So it is correct to read the signature separately from the rest
of the header, but it is even more correct to check the signature
(including its length) before reading on.
2. I introduce a class CRC to hold checksums, so that (a) the code for
handling checksums can be encapsulated, and (b) an instance of this
class can be passed into functions and mutated in-place, which is less
cumbersome than requiring functions to return (data, crc) tuples. This,
in turn, makes possible...
3. ...a new function read_struct(f, s, crc), which reads the data for a
struct.Struct from f, checksums it, and returns the unpacked data. This
function is more convenient to use than the old read_calc_crc().
4. The checksum instance can also be made responsible for verifying that
the next four bytes in the file agree with the expected checksum. This
removes some more code duplication. (See CRC.matches().)
5. You read the extension offsets using CRC_STRUCT.size, which is
technically correct but misleading. In fact, the extension offsets
should be documented using their own EXTENSION_INDEX_STRUCT. Also, it
makes more sense to store the unpacked integer offsets to extoffsets
rather than the raw 4-byte strings.
6. With a couple of more minor changes it is possible to replace the
magic number "24" in read_index_entries(). With this change the
computation is documented very explicitly and is also (somewhat) robust
against future changes in the format.
Look over my changes and take whatever you want.
> Thanks, I have changed those. I added the docstrings for all read
> functions, they however don't seem to make sense for the print
> functions, since you're probably faster just reading the code for
> them.
That's fine. If you document nontrivial functions you are already doing
much better than the git project average.
> One since I changed in addition is to those changes, is that I
> gave the exceptions names to make them more meaningful.
Good.
Michael
[1] https://github.com/mhagger/git/tree/pythonprototype
--
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-18 15:38 ` Thomas Gummerer
@ 2012-05-19 13:00 ` Michael Haggerty
2012-05-21 7:45 ` Thomas Gummerer
0 siblings, 1 reply; 49+ messages in thread
From: Michael Haggerty @ 2012-05-19 13:00 UTC (permalink / raw)
To: Thomas Gummerer; +Cc: git, trast, gitster, peff, spearce, davidbarr
On 05/18/2012 05:38 PM, Thomas Gummerer wrote:
>
>> I suggest that you apply the same kinds of cleanups to
>> git-convert-index.py (which I personally haven't looked at yet at
>> all). If you want my feedback on that script, please let me know
>> when you think it is ready.
>
> That would be great, if you have the time to do it. I'm not
> completely finished with it (docstrings and conflicted data writing
> are still missing).
I've looked over the writing side of git-convert-index.py version
81411fe6c98, and here are my first comments:
* Please remove trailing whitespace from the source code.
* I suggest that you move constants and code shared by
git-convert-index.py and git-read-index-v5.py into a library. Though
actually, given that git doesn't seem to have infrastructure for
dealing with Python libraries, this might take some improvisation.
* Please use constants for all of the struct formats. Constants have
names, making them mostly self-documenting.
* write_directories() currently writes pathnames and fake data and
stores file offsets in memory. Later write_directory_data() runs
through the file again, seek()ing over the filenames and filling in
real data.
Wouldn't it be easier for the first pass just to *compute* and
record the offsets of the entries to RAM, without writing anything
to disk, and leave all of the writing to the second pass?
* Instead of writing blank data, it is possible to seek() past it and
start writing the next thing. The skipped-over file contents are
logically initialized to zero.
* When working with iteritems(), it is clearer to unpack the item
pairs and give them names rather than working with d[0] and d[1];
for example,
- for d in sorted(dirdata.iteritems()):
+ for (pathname,entry) in sorted(dirdata.iteritems()):
* write_directories() returns a "dirdata" that is just an empty
defaultdict. This seems pointless. Do you have future plans to
change write_directories() to store something into the dictionary?
* The documentation for binascii.crc32() mentions that it gives
inconsistent results (signed vs. unsigned) for different versions of
Python. Please ensure that you are using it in a way that is
maximally portable. (That seems to imply using (binascii.crc32(...)
& 0xffffffff) and treating the result as unsigned.)
* At first I thought it was a little bit odd that you pass data
structures around as dictionaries, but I didn't object. But as I
look at more and more code it seems more and more cumbersome.
Therefore, I suggest that you define classes to hold the various
entities that are manipulated by your programs, because:
* A class definition is a good place to document exactly what fields
an object is expected to have, and what they mean.
* Access of instance fields (entry.path) is easier to read and type
than dictionary access (entry["path"]).
* The class definitions will translate pretty directly to C structs.
The fact that class instances use a bit more memory than
dictionaries is, I think, unimportant. But if that really bothers
you, you can use __slots__ to save some of the instance memory.
At a higher level:
* What if the offsets to each section were stored in the header, and
the offsets recorded for dirs and files were relative to the start
of the section (rather than relative to the start of the file)? I
think that this would leave open the possibility of formatting the
sections in memory in parallel in a single pass, then dumping the
sections to disk in a few big writes (though I'm not saying that this
should be the *default* way of writing).
* Do you plan to write prototypes for some of the cool new
functionality that v5 is intended to make possible? For example,
* reading a few specific entries out of an index file
* updating single entries
* adding/removing conflict data to an existing file
* dealing with all of the issues that will come with supporting the
mutation of an existing index file (i.e., locking, consistency
checks, etc)
As you probably know from discussions on IRC, I think that the last
of these is the biggest risk to the success of the project.
> I'm not sure about the read_tree_extensiondata method, if I should
> extract a method, which only reads one entry, but I'm not sure that
> would make any sense, since there would be a lot of parameters and
> return values to the function.
If the index were represented by a class instance, then all of the
information would be grouped together as a coherent whole that is easy
to pass around.
> The same thing is in the main method, where I'm not sure if it's
> better to extract the read_index and write_index functions, or
> just leave the code in the main method. My guess is that it makes
> sense in the main method, since there are less calls, but it
> doesn't make sense in the read_tree_extensiondata method?
Ditto.
> Another thing I'm unsure about is the write_directory_data method,
> if there is any way to replace the try/except with something
> simpler?
With dictionaries, you can do
- try:
- flags = d[1]["flags"]
- except KeyError:
- flags = 0
+ flags = d[1].get("flags", 0)
If you convert to class instances, then presumably the constructor would
set valid default values for all of the fields.
Michael
--
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-19 13:00 ` Michael Haggerty
@ 2012-05-21 7:45 ` Thomas Gummerer
0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-21 7:45 UTC (permalink / raw)
To: Michael Haggerty; +Cc: git, trast, gitster, peff, spearce, davidbarr
Thanks a lot for your feedback.
On 05/19, Michael Haggerty wrote:
> I've looked over the writing side of git-convert-index.py version
> 81411fe6c98, and here are my first comments:
>
> * Please remove trailing whitespace from the source code.
>
> * I suggest that you move constants and code shared by
> git-convert-index.py and git-read-index-v5.py into a library. Though
> actually, given that git doesn't seem to have infrastructure for
> dealing with Python libraries, this might take some improvisation.
I've created a directory python/lib, where I'll put the python libraries.
I'm not entirely sure this is the correct way to do it, however since
the python code will not be in the main git, but is just a prototype,
I think it's fine.
For now I moved the format strings, the structs, the exceptions and
the new calculate_crc method to the library. Once I go over
git-read-index-v5.py I'll probably move more code there.
> * Please use constants for all of the struct formats. Constants have
> names, making them mostly self-documenting.
>
> * write_directories() currently writes pathnames and fake data and
> stores file offsets in memory. Later write_directory_data() runs
> through the file again, seek()ing over the filenames and filling in
> real data.
>
> Wouldn't it be easier for the first pass just to *compute* and
> record the offsets of the entries to RAM, without writing anything
> to disk, and leave all of the writing to the second pass?
I don't think that would be easier, since I have to go over all the
data when writing anyway. It might however be faster.
> * Instead of writing blank data, it is possible to seek() past it and
> start writing the next thing. The skipped-over file contents are
> logically initialized to zero.
>
> * When working with iteritems(), it is clearer to unpack the item
> pairs and give them names rather than working with d[0] and d[1];
> for example,
>
> - for d in sorted(dirdata.iteritems()):
> + for (pathname,entry) in sorted(dirdata.iteritems()):
>
> * write_directories() returns a "dirdata" that is just an empty
> defaultdict. This seems pointless. Do you have future plans to
> change write_directories() to store something into the dictionary?
>
> * The documentation for binascii.crc32() mentions that it gives
> inconsistent results (signed vs. unsigned) for different versions of
> Python. Please ensure that you are using it in a way that is
> maximally portable. (That seems to imply using (binascii.crc32(...)
> & 0xffffffff) and treating the result as unsigned.)
>
> * At first I thought it was a little bit odd that you pass data
> structures around as dictionaries, but I didn't object. But as I
> look at more and more code it seems more and more cumbersome.
> Therefore, I suggest that you define classes to hold the various
> entities that are manipulated by your programs, because:
>
> * A class definition is a good place to document exactly what fields
> an object is expected to have, and what they mean.
>
> * Access of instance fields (entry.path) is easier to read and type
> than dictionary access (entry["path"]).
>
> * The class definitions will translate pretty directly to C structs.
>
> The fact that class instances use a bit more memory than
> dictionaries is, I think, unimportant. But if that really bothers
> you, you can use __slots__ to save some of the instance memory.
>
> At a higher level:
>
> * What if the offsets to each section were stored in the header, and
> the offsets recorded for dirs and files were relative to the start
> of the section (rather than relative to the start of the file)? I
> think that this would leave open the possibility of formatting the
> sections in memory in parallel in a single pass, then dumping the
> sections to disk in a few big writes (though I'm not saying that this
> should be the *default* way of writing).
I'm thinking if there are any drawbacks doing it that way, but until
now no drawbacks came to my mind. Otherwise this sounds like a good
idea, and I'll include it in the next version.
> * Do you plan to write prototypes for some of the cool new
> functionality that v5 is intended to make possible? For example,
>
> * reading a few specific entries out of an index file
>
> * updating single entries
>
> * adding/removing conflict data to an existing file
>
> * dealing with all of the issues that will come with supporting the
> mutation of an existing index file (i.e., locking, consistency
> checks, etc)
>
> As you probably know from discussions on IRC, I think that the last
> of these is the biggest risk to the success of the project.
I thought of implementing prototypes as the proejct goes on, but not
all of them now. I'd rather first start implementing the reader,
because otherwise the time could get a problem for the midterm.
--
Thomas
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Index format v5
2012-05-19 5:40 ` Michael Haggerty
@ 2012-05-21 20:30 ` Thomas Gummerer
0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gummerer @ 2012-05-21 20:30 UTC (permalink / raw)
To: Michael Haggerty; +Cc: git, trast, gitster, peff, spearce, davidbarr
On 05/19, Michael Haggerty wrote:
> I think it is possible to remove the last magic number and also to
> make the CRC handling easier. I have pushed some suggested changes
> to github [1]:
>
> 1. With the current code, trying to read a file that is less than 24
> bytes long would result in a struct.error (because it would try to
> unpack a string that is shorter than the struct) whereas the
> underlying error in this case should almost always be reported as a
> signature error. So it is correct to read the signature separately
> from the rest of the header, but it is even more correct to check
> the signature (including its length) before reading on.
I looked at the current git code, and the filesize is used for
checking if the index size is big enough. I'll implement it that way
in the prototype for now.
> 2. I introduce a class CRC to hold checksums, so that (a) the code
> for handling checksums can be encapsulated, and (b) an instance of
> this class can be passed into functions and mutated in-place, which
> is less cumbersome than requiring functions to return (data, crc)
> tuples. This, in turn, makes possible...
>
> 3. ...a new function read_struct(f, s, crc), which reads the data
> for a struct.Struct from f, checksums it, and returns the unpacked
> data. This function is more convenient to use than the old
> read_calc_crc().
>
> 4. The checksum instance can also be made responsible for verifying
> that the next four bytes in the file agree with the expected
> checksum. This removes some more code duplication. (See
> CRC.matches().)
>
> 5. You read the extension offsets using CRC_STRUCT.size, which is
> technically correct but misleading. In fact, the extension offsets
> should be documented using their own EXTENSION_INDEX_STRUCT. Also,
> it makes more sense to store the unpacked integer offsets to
> extoffsets rather than the raw 4-byte strings.
>
> 6. With a couple of more minor changes it is possible to replace the
> magic number "24" in read_index_entries(). With this change the
> computation is documented very explicitly and is also (somewhat)
> robust against future changes in the format.
>
> Look over my changes and take whatever you want.
Thanks, I didn't merge them directly, but I took the basics out and
merged them with my code.
--
Thomas
^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2012-05-21 20:31 UTC | newest]
Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-03 17:25 Index format v5 Thomas Gummerer
2012-05-03 18:16 ` Thomas Rast
2012-05-03 19:03 ` Junio C Hamano
2012-05-04 7:12 ` Michael Haggerty
2012-05-07 22:18 ` Robin Rosenberg
2012-05-03 18:21 ` Ronan Keryell
2012-05-03 20:36 ` Thomas Gummerer
2012-05-03 18:54 ` Junio C Hamano
2012-05-03 19:11 ` Thomas Rast
2012-05-03 19:31 ` Thomas Rast
2012-05-03 19:32 ` Thomas Rast
2012-05-03 20:32 ` Junio C Hamano
2012-05-03 21:38 ` Thomas Gummerer
2012-05-07 18:57 ` Robin Rosenberg
2012-05-03 19:38 ` solo-git
2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
2012-05-04 15:44 ` Thomas Gummerer
2012-05-04 13:25 ` Philip Oakley
2012-05-04 15:46 ` Junio C Hamano
2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
2012-05-07 13:44 ` Thomas Gummerer
2012-05-06 16:49 ` Phil Hord
2012-05-07 13:08 ` Thomas Gummerer
2012-05-07 15:15 ` Michael Haggerty
2012-05-08 14:11 ` Thomas Gummerer
2012-05-08 14:25 ` Nguyen Thai Ngoc Duy
2012-05-08 14:34 ` Nguyen Thai Ngoc Duy
2012-05-10 6:53 ` Thomas Gummerer
2012-05-10 11:06 ` Nguyen Thai Ngoc Duy
2012-05-09 8:37 ` Michael Haggerty
2012-05-10 12:19 ` Thomas Gummerer
2012-05-10 18:17 ` Michael Haggerty
2012-05-11 17:12 ` Thomas Gummerer
2012-05-13 19:50 ` Michael Haggerty
2012-05-14 15:01 ` Thomas Gummerer
2012-05-14 21:08 ` Michael Haggerty
2012-05-14 22:10 ` Thomas Rast
2012-05-15 6:43 ` Michael Haggerty
2012-05-15 13:49 ` Thomas Gummerer
2012-05-15 15:02 ` Michael Haggerty
2012-05-18 15:38 ` Thomas Gummerer
2012-05-19 13:00 ` Michael Haggerty
2012-05-21 7:45 ` Thomas Gummerer
2012-05-16 5:01 ` Michael Haggerty
2012-05-16 21:54 ` Thomas Gummerer
2012-05-19 5:40 ` Michael Haggerty
2012-05-21 20:30 ` Thomas Gummerer
2012-05-13 21:01 ` Philip Oakley
2012-05-14 14:54 ` Thomas Gummerer
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).