Re: Index format v5 - Michael Haggerty

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Michael Haggerty <mhagger@alum.mit.edu>
To: Thomas Gummerer <t.gummerer@gmail.com>
Cc: git@vger.kernel.org, trast@student.ethz.ch, gitster@pobox.com,
	peff@peff.net, spearce@spearce.org, davidbarr@google.com
Subject: Re: Index format v5
Date: Sat, 19 May 2012 15:00:59 +0200	[thread overview]
Message-ID: <4FB7998B.2030305@alum.mit.edu> (raw)
In-Reply-To: <20120518153826.GB1738@tgummerer.surfnet.iacbox>

On 05/18/2012 05:38 PM, Thomas Gummerer wrote:
>
>> I suggest that you apply the same kinds of cleanups to
>> git-convert-index.py (which I personally haven't looked at yet at
>> all).  If you want my feedback on that script, please let me know
>> when you think it is ready.
>
> That would be great, if you have the time to do it. I'm not
> completely finished with it (docstrings and conflicted data writing
> are still missing).

I've looked over the writing side of git-convert-index.py version
81411fe6c98, and here are my first comments:

* Please remove trailing whitespace from the source code.

* I suggest that you move constants and code shared by
   git-convert-index.py and git-read-index-v5.py into a library.  Though
   actually, given that git doesn't seem to have infrastructure for
   dealing with Python libraries, this might take some improvisation.

* Please use constants for all of the struct formats.  Constants have
   names, making them mostly self-documenting.

* write_directories() currently writes pathnames and fake data and
   stores file offsets in memory.  Later write_directory_data() runs
   through the file again, seek()ing over the filenames and filling in
   real data.

   Wouldn't it be easier for the first pass just to *compute* and
   record the offsets of the entries to RAM, without writing anything
   to disk, and leave all of the writing to the second pass?

* Instead of writing blank data, it is possible to seek() past it and
   start writing the next thing.  The skipped-over file contents are
   logically initialized to zero.

* When working with iteritems(), it is clearer to unpack the item
   pairs and give them names rather than working with d[0] and d[1];
   for example,

     -    for d in sorted(dirdata.iteritems()):
     +    for (pathname,entry) in sorted(dirdata.iteritems()):

* write_directories() returns a "dirdata" that is just an empty
   defaultdict.  This seems pointless.  Do you have future plans to
   change write_directories() to store something into the dictionary?

* The documentation for binascii.crc32() mentions that it gives
   inconsistent results (signed vs. unsigned) for different versions of
   Python.  Please ensure that you are using it in a way that is
   maximally portable.  (That seems to imply using (binascii.crc32(...)
   & 0xffffffff) and treating the result as unsigned.)

* At first I thought it was a little bit odd that you pass data
   structures around as dictionaries, but I didn't object.  But as I
   look at more and more code it seems more and more cumbersome.
   Therefore, I suggest that you define classes to hold the various
   entities that are manipulated by your programs, because:

   * A class definition is a good place to document exactly what fields
     an object is expected to have, and what they mean.

   * Access of instance fields (entry.path) is easier to read and type
     than dictionary access (entry["path"]).

   * The class definitions will translate pretty directly to C structs.

   The fact that class instances use a bit more memory than
   dictionaries is, I think, unimportant.  But if that really bothers
   you, you can use __slots__ to save some of the instance memory.

At a higher level:

* What if the offsets to each section were stored in the header, and
   the offsets recorded for dirs and files were relative to the start
   of the section (rather than relative to the start of the file)?  I
   think that this would leave open the possibility of formatting the
   sections in memory in parallel in a single pass, then dumping the
   sections to disk in a few big writes (though I'm not saying that this
   should be the *default* way of writing).

* Do you plan to write prototypes for some of the cool new
   functionality that v5 is intended to make possible?  For example,

   * reading a few specific entries out of an index file

   * updating single entries

   * adding/removing conflict data to an existing file

   * dealing with all of the issues that will come with supporting the
     mutation of an existing index file (i.e., locking, consistency
     checks, etc)

   As you probably know from discussions on IRC, I think that the last
   of these is the biggest risk to the success of the project.

> I'm not sure about the read_tree_extensiondata method, if I should
> extract a method, which only reads one entry, but I'm not sure that
> would make any sense, since there would be a lot of parameters and
> return values to the function.

If the index were represented by a class instance, then all of the 
information would be grouped together as a coherent whole that is easy 
to pass around.

> The same thing is in the main method, where I'm not sure if it's
> better to extract the read_index and write_index functions, or
> just leave the code in the main method. My guess is that it makes
> sense in the main method, since there are less calls, but it
> doesn't make sense in the read_tree_extensiondata method?

Ditto.

> Another thing I'm unsure about is the write_directory_data method,
> if there is any way to replace the try/except with something
> simpler?

With dictionaries, you can do

-        try:
-            flags = d[1]["flags"]
-        except KeyError:
-            flags = 0
+        flags = d[1].get("flags", 0)

If you convert to class instances, then presumably the constructor would 
set valid default values for all of the fields.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

next prev parent reply	other threads:[~2012-05-19 13:08 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-03 17:25 Index format v5 Thomas Gummerer
2012-05-03 18:16 ` Thomas Rast
2012-05-03 19:03   ` Junio C Hamano
2012-05-04  7:12   ` Michael Haggerty
2012-05-07 22:18     ` Robin Rosenberg
2012-05-03 18:21 ` Ronan Keryell
2012-05-03 20:36   ` Thomas Gummerer
2012-05-03 18:54 ` Junio C Hamano
2012-05-03 19:11   ` Thomas Rast
2012-05-03 19:31   ` Thomas Rast
2012-05-03 19:32     ` Thomas Rast
2012-05-03 20:32       ` Junio C Hamano
2012-05-03 21:38   ` Thomas Gummerer
2012-05-07 18:57     ` Robin Rosenberg
2012-05-03 19:38 ` solo-git
2012-05-04 13:20 ` Nguyen Thai Ngoc Duy
2012-05-04 15:44   ` Thomas Gummerer
2012-05-04 13:25 ` Philip Oakley
2012-05-04 15:46   ` Junio C Hamano
2012-05-06 10:23 ` Nguyen Thai Ngoc Duy
2012-05-07 13:44   ` Thomas Gummerer
2012-05-06 16:49 ` Phil Hord
2012-05-07 13:08   ` Thomas Gummerer
2012-05-07 15:15 ` Michael Haggerty
2012-05-08 14:11   ` Thomas Gummerer
2012-05-08 14:25     ` Nguyen Thai Ngoc Duy
2012-05-08 14:34       ` Nguyen Thai Ngoc Duy
2012-05-10  6:53         ` Thomas Gummerer
2012-05-10 11:06           ` Nguyen Thai Ngoc Duy
2012-05-09  8:37     ` Michael Haggerty
2012-05-10 12:19       ` Thomas Gummerer
2012-05-10 18:17         ` Michael Haggerty
2012-05-11 17:12           ` Thomas Gummerer
2012-05-13 19:50             ` Michael Haggerty
2012-05-14 15:01               ` Thomas Gummerer
2012-05-14 21:08                 ` Michael Haggerty
2012-05-14 22:10                   ` Thomas Rast
2012-05-15  6:43                     ` Michael Haggerty
2012-05-15 13:49                   ` Thomas Gummerer
2012-05-15 15:02                     ` Michael Haggerty
2012-05-18 15:38                       ` Thomas Gummerer
2012-05-19 13:00                         ` Michael Haggerty [this message]
2012-05-21  7:45                           ` Thomas Gummerer
2012-05-16  5:01                     ` Michael Haggerty
2012-05-16 21:54                       ` Thomas Gummerer
2012-05-19  5:40                         ` Michael Haggerty
2012-05-21 20:30                           ` Thomas Gummerer
2012-05-13 21:01 ` Philip Oakley
2012-05-14 14:54   ` Thomas Gummerer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FB7998B.2030305@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=davidbarr@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=spearce@spearce.org \
    --cc=t.gummerer@gmail.com \
    --cc=trast@student.ethz.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).