git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: git@vger.kernel.org, git@jeffhostetler.com, peff@peff.net,
	jonathantanmy@google.com, szeder.dev@gmail.com,
	sbeller@google.com, Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH v4 04/13] commit-graph: implement write_commit_graph()
Date: Tue, 20 Feb 2018 14:57:42 -0800	[thread overview]
Message-ID: <xmqqmv031d7d.fsf@gitster-ct.c.googlers.com> (raw)
In-Reply-To: <1519066406-81663-5-git-send-email-dstolee@microsoft.com> (Derrick Stolee's message of "Mon, 19 Feb 2018 13:53:17 -0500")

Derrick Stolee <stolee@gmail.com> writes:

> +#define GRAPH_OID_VERSION_SHA1 1
> +#define GRAPH_OID_LEN_SHA1 20

This hardcoded 20 on the right hand side of this #define is probably
problematic.   Unless you are planning to possibly store truncated
hash value for some future hash algorithm, GRAPH_OID_LEN_$HASH should
always be the same as GIT_$HASH_RAWSZ, I would think.  IOW

    #define GRAPH_OID_LEN_SHA1 GIT_SHA1_RAWSZ

perhaps?

> +static void write_graph_chunk_fanout(struct sha1file *f,
> +				     struct commit **commits,
> +				     int nr_commits)
> +{
> +	uint32_t i, count = 0;
> +	struct commit **list = commits;
> +	struct commit **last = commits + nr_commits;
> +
> +	/*
> +	 * Write the first-level table (the list is sorted,
> +	 * but we use a 256-entry lookup to be able to avoid
> +	 * having to do eight extra binary search iterations).
> +	 */
> +	for (i = 0; i < 256; i++) {
> +		while (list < last) {
> +			if ((*list)->object.oid.hash[0] != i)
> +				break;
> +			count++;
> +			list++;
> +		}

If count and list are always incremented in unison, perhaps you do
not need an extra variable "last".  If typeof(nr_commits) is wider
than typeof(count), this loop and the next write-be32 is screwed
anyway ;-)

This comment probably applies equally to some other uses of the same
"compute last pointer to compare with running pointer for
termination" pattern in this patch.

> +		sha1write_be32(f, count);
> +	}
> +}

> +static int commit_pos(struct commit **commits, int nr_commits,
> +		      const struct object_id *oid, uint32_t *pos)
> +{

It is a bit unusual to see something_pos() that returns an integer
that does *NOT* return the position as its return value.  Dropping
the *pos parameter, and returning "mid" when commits[mid] is what we
wanted to see, and otherwise returning "-1 - first" to signal the
position at which we _would_ have found the object, if it were in
the table, would make it more consistent with the usual convention.

Don't we even have such a generalized binary search helper already
somewhere in the system?

> +static void write_graph_chunk_data(struct sha1file *f, int hash_len,
> +				   struct commit **commits, int nr_commits)
> +{
> +	struct commit **list = commits;
> +	struct commit **last = commits + nr_commits;
> +	uint32_t num_large_edges = 0;
> +
> +	while (list < last) {
> +		struct commit_list *parent;
> +		uint32_t int_id;
> +		uint32_t packedDate[2];
> +
> +...
> +		if (!parent)
> +			int_id = GRAPH_PARENT_NONE;
> +		else if (parent->next)
> +			int_id = GRAPH_LARGE_EDGES_NEEDED | num_large_edges;
> +		else if (!commit_pos(commits, nr_commits,
> +				    &(parent->item->object.oid), &int_id))
> +			int_id = GRAPH_PARENT_MISSING;
> +
> +		sha1write_be32(f, int_id);
> +
> +		if (parent && parent->next) {

This is equivalent to checking "int_id & GRAPH_LARGE_EDGES_NEEDED",
right?  Not suggesting to use the other form of checks, but trying
to see what's the best way to express it in the most readable way.

> +			do {
> +				num_large_edges++;
> +				parent = parent->next;
> +			} while (parent);

It feels somewhat wasteful to traverse the commit's parents list
only to count, without populating the octopus table (which I
understand is assumed to be minority case under this design).

> +		}
> +
> +		if (sizeof((*list)->date) > 4)
> +			packedDate[0] = htonl(((*list)->date >> 32) & 0x3);
> +		else
> +			packedDate[0] = 0;

OK, the undefined pattern in the previous round is now gone ;-)  Good.

> +		packedDate[1] = htonl((*list)->date);
> +		sha1write(f, packedDate, 8);
> +
> +		list++;
> +	}
> +}
> +
> +static void write_graph_chunk_large_edges(struct sha1file *f,
> +					  struct commit **commits,
> +					  int nr_commits)
> +{
> +	struct commit **list = commits;
> +	struct commit **last = commits + nr_commits;
> +	struct commit_list *parent;
> +
> +	while (list < last) {
> +		int num_parents = 0;
> +		for (parent = (*list)->parents; num_parents < 3 && parent;
> +		     parent = parent->next)
> +			num_parents++;
> +
> +		if (num_parents <= 2) {
> +			list++;
> +			continue;
> +		}
> +
> +		/* Since num_parents > 2, this initializer is safe. */
> +		for (parent = (*list)->parents->next; parent; parent = parent->next) {
> +			uint32_t int_id, swap_int_id;
> +			uint32_t last_edge = 0;
> +			if (!parent->next)
> +				last_edge |= GRAPH_LAST_EDGE;
> +
> +			if (commit_pos(commits, nr_commits,
> +				       &(parent->item->object.oid),
> +				       &int_id))
> +				swap_int_id = htonl(int_id | last_edge);
> +			else
> +				swap_int_id = htonl(GRAPH_PARENT_MISSING | last_edge);
> +			sha1write(f, &swap_int_id, 4);

What does "swap_" in the name of this variable mean?  For some
archs, there is no swap.  The only difference between int_id and the
variable is that its MSB may possibly be smudged with last_edge bit.

This is a tangent, but after having seen many instances of "int_id",
I started to feel that it is grossly misnamed.  We do not care about
its "int" ness---what's more significant about it is that we use can
it as a short identifier in place for a full object name, given the
table of known OIDs.  "oid_table_index" may be a better name (but
others may be able to suggest even better one).

	int pos;
	pos = commit_pos(commits, nr_commits, parent->item->object.oid);
	oid_table_pos = (pos < 0) ? GRAPH_PARENT_MISSING : pos;
	if (!parent->net)
		oid_table_pos |= GRAPH_LAST_EDGE;
	oid_table_pos = htonl(oid_table_pos);
	sha1write(f, &oid_table_pos, sizeof(oid_table_pos));

or something like that, perhaps?

> +static int commit_compare(const void *_a, const void *_b)
> +{
> +	struct object_id *a = (struct object_id *)_a;
> +	struct object_id *b = (struct object_id *)_b;
> +	return oidcmp(a, b);
> +}

I think oidcmp() takes const pointers, so there is no need to
discard constness from the parameter like this code does.  Also I
think we tend to prefer writing a_/b_ (instead of _a/_b) to appease
language lawyers who do not want us mere mortals to use names that
begin with underscore.

> +static int if_packed_commit_add_to_list(const struct object_id *oid,
> +					struct packed_git *pack,
> +					uint32_t pos,
> +					void *data)

That is a strange name.  "collect packed commits", perhaps?

> +char *write_commit_graph(const char *obj_dir)
> +{
> +	struct packed_oid_list oids;
> +	struct packed_commit_list commits;
> +	struct sha1file *f;
> +	int i, count_distinct = 0;
> +	DIR *info_dir;
> +	struct strbuf tmp_file = STRBUF_INIT;
> +	struct strbuf graph_file = STRBUF_INIT;
> +	unsigned char final_hash[GIT_MAX_RAWSZ];
> +	char *graph_name;
> +	int fd;
> +	uint32_t chunk_ids[5];
> +	uint64_t chunk_offsets[5];
> +	int num_chunks;
> +	int num_long_edges;
> +	struct commit_list *parent;
> +
> +	oids.nr = 0;
> +	oids.alloc = (int)(0.15 * approximate_object_count());

Heh, traditionalist would probably avoid unnecessary use of float
and use something like 1/4 or 1/8 ;-)  After all, it is merely a
ballpark guestimate.

> +	num_long_edges = 0;

This again is about naming, but I find it a bit unnatural to call
the edge between a chind and its octopus parents "long".  Individual
edges are not long--the only thing that is long is your "list of
edges".  Some other codepaths in this patch seems to call the same
concept with s/long/large/, which I found somewhat puzzling.

> +	for (i = 0; i < oids.nr; i++) {
> +		int num_parents = 0;
> +		if (i > 0 && !oidcmp(&oids.list[i-1], &oids.list[i]))
> +			continue;
> +
> +		commits.list[commits.nr] = lookup_commit(&oids.list[i]);
> +		parse_commit(commits.list[commits.nr]);
> +
> +		for (parent = commits.list[commits.nr]->parents;
> +		     parent; parent = parent->next)
> +			num_parents++;
> +
> +		if (num_parents > 2)
> +			num_long_edges += num_parents - 1;

OK, so we count how many entries we will record in the overflow
parent table, and...

> +
> +		commits.nr++;
> +	}
> +	num_chunks = num_long_edges ? 4 : 3;

... if we do not have any octopus commit, we do not need the chunk
for the overflow parent table.  Makes sense.

> +	strbuf_addf(&tmp_file, "%s/info", obj_dir);
> +	info_dir = opendir(tmp_file.buf);
> +
> +	if (!info_dir && mkdir(tmp_file.buf, 0777) < 0)
> +		die_errno(_("cannot mkdir %s"), tmp_file.buf);
> +	if (info_dir)
> +		closedir(info_dir);
> +	strbuf_addstr(&tmp_file, "/tmp_graph_XXXXXX");
> +
> +	fd = git_mkstemp_mode(tmp_file.buf, 0444);
> +	if (fd < 0)
> +		die_errno("unable to create '%s'", tmp_file.buf);

It is not performance critical, but it feels a bit wasteful to
opendir merely to see if something exists as a directory, and it is
misleading to the readers (it looks as if we care about what files
we already have in the directory).

The approach that optimizes for the most common case would be to

	- prepare full path to the tempfile first
	- try create with mkstemp
	  - if successful, you do not have to worry about creating
	    the directory at all, which is the most common case
        - see why mkstemp step above failed.  Was it because you
	  did not have the surrounding directory?
          - if not, there is no point continuing.  Just error out.
	  - if it was due to missing directory, try creating one.
	- try create with mkstemp
	  - if successful, all is well.
        - otherwise there isn't anything more we can do here.



> +
> +	f = sha1fd(fd, tmp_file.buf);
> +
> +	sha1write_be32(f, GRAPH_SIGNATURE);
> +
> +	sha1write_u8(f, GRAPH_VERSION);
> +	sha1write_u8(f, GRAPH_OID_VERSION);
> +	sha1write_u8(f, num_chunks);
> +	sha1write_u8(f, 0); /* unused padding byte */
> +
> +	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
> +	chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
> +	chunk_ids[2] = GRAPH_CHUNKID_DATA;
> +	if (num_long_edges)
> +		chunk_ids[3] = GRAPH_CHUNKID_LARGEEDGES;
> +	else
> +		chunk_ids[3] = 0;
> +	chunk_ids[4] = 0;
> +
> +	chunk_offsets[0] = 8 + GRAPH_CHUNKLOOKUP_SIZE;
> +	chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
> +	chunk_offsets[2] = chunk_offsets[1] + GRAPH_OID_LEN * commits.nr;
> +	chunk_offsets[3] = chunk_offsets[2] + (GRAPH_OID_LEN + 16) * commits.nr;
> +	chunk_offsets[4] = chunk_offsets[3] + 4 * num_long_edges;

Do we have to care about overflowing any of the above?  For example,
the format allows only up to (1<<31)-1 commits, but did something
actually check if commits.nr at this point stayed under that limit?


  reply	other threads:[~2018-02-20 22:57 UTC|newest]

Thread overview: 146+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
2018-01-30 21:39 ` [PATCH v2 01/14] commit-graph: add format document Derrick Stolee
2018-02-01 21:44   ` Jonathan Tan
2018-01-30 21:39 ` [PATCH v2 02/14] graph: add commit graph design document Derrick Stolee
2018-01-31  2:19   ` Stefan Beller
2018-01-30 21:39 ` [PATCH v2 03/14] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-02-02  0:53   ` SZEDER Gábor
2018-01-30 21:39 ` [PATCH v2 04/14] commit-graph: implement construct_commit_graph() Derrick Stolee
2018-02-01 22:23   ` Jonathan Tan
2018-02-01 23:46   ` SZEDER Gábor
2018-02-02 15:32   ` SZEDER Gábor
2018-02-05 16:06     ` Derrick Stolee
2018-02-07 15:08       ` SZEDER Gábor
2018-02-07 15:10         ` Derrick Stolee
2018-01-30 21:39 ` [PATCH v2 05/14] commit-graph: implement git-commit-graph --write Derrick Stolee
2018-02-01 23:33   ` Jonathan Tan
2018-02-02 18:36     ` Stefan Beller
2018-02-02 22:48       ` Junio C Hamano
2018-02-03  1:58         ` Derrick Stolee
2018-02-03  9:28           ` Jeff King
2018-02-05 18:48             ` Junio C Hamano
2018-02-06 18:55               ` Derrick Stolee
2018-02-01 23:48   ` SZEDER Gábor
2018-02-05 18:07     ` Derrick Stolee
2018-02-02  1:47   ` SZEDER Gábor
2018-01-30 21:39 ` [PATCH v2 06/14] commit-graph: implement git-commit-graph --read Derrick Stolee
2018-01-31  2:22   ` Stefan Beller
2018-02-02  0:02   ` SZEDER Gábor
2018-02-02  0:23   ` Jonathan Tan
2018-02-05 19:29     ` Derrick Stolee
2018-01-30 21:39 ` [PATCH v2 07/14] commit-graph: implement git-commit-graph --update-head Derrick Stolee
2018-02-02  1:35   ` SZEDER Gábor
2018-02-05 21:01     ` Derrick Stolee
2018-02-02  2:45   ` SZEDER Gábor
2018-01-30 21:39 ` [PATCH v2 08/14] commit-graph: implement git-commit-graph --clear Derrick Stolee
2018-02-02  4:01   ` SZEDER Gábor
2018-01-30 21:39 ` [PATCH v2 09/14] commit-graph: teach git-commit-graph --delete-expired Derrick Stolee
2018-02-02 15:04   ` SZEDER Gábor
2018-01-30 21:39 ` [PATCH v2 10/14] commit-graph: add core.commitgraph setting Derrick Stolee
2018-01-31 22:44   ` Igor Djordjevic
2018-02-02 16:01   ` SZEDER Gábor
2018-01-30 21:39 ` [PATCH v2 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
2018-02-02  1:51   ` Jonathan Tan
2018-02-06 14:53     ` Derrick Stolee
2018-01-30 21:39 ` [PATCH v2 12/14] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-01-30 21:39 ` [PATCH v2 13/14] commit-graph: close under reachability Derrick Stolee
2018-01-30 21:39 ` [PATCH v2 14/14] commit-graph: build graph from starting commits Derrick Stolee
2018-01-30 21:47 ` [PATCH v2 00/14] Serialized Git Commit Graph Stefan Beller
2018-02-01  2:34   ` Stefan Beller
2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 01/14] commit-graph: add format document Derrick Stolee
2018-02-08 21:21     ` Junio C Hamano
2018-02-08 21:33       ` Derrick Stolee
2018-02-08 23:16         ` Junio C Hamano
2018-02-08 20:37   ` [PATCH v3 02/14] graph: add commit graph design document Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 03/14] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-02-08 21:27     ` Junio C Hamano
2018-02-08 21:36       ` Derrick Stolee
2018-02-08 23:21         ` Junio C Hamano
2018-02-08 20:37   ` [PATCH v3 04/14] commit-graph: implement write_commit_graph() Derrick Stolee
2018-02-08 22:14     ` Junio C Hamano
2018-02-15 18:19     ` Junio C Hamano
2018-02-15 18:23       ` Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 05/14] commit-graph: implement 'git-commit-graph write' Derrick Stolee
2018-02-13 21:57     ` Jonathan Tan
2018-02-08 20:37   ` [PATCH v3 06/14] commit-graph: implement 'git-commit-graph read' Derrick Stolee
2018-02-08 23:38     ` Junio C Hamano
2018-02-08 20:37   ` [PATCH v3 07/14] commit-graph: update graph-head during write Derrick Stolee
2018-02-12 18:56     ` Junio C Hamano
2018-02-12 20:37       ` Junio C Hamano
2018-02-12 21:24         ` Derrick Stolee
2018-02-13 22:38     ` Jonathan Tan
2018-02-08 20:37   ` [PATCH v3 08/14] commit-graph: implement 'git-commit-graph clear' Derrick Stolee
2018-02-13 22:49     ` Jonathan Tan
2018-02-08 20:37   ` [PATCH v3 09/14] commit-graph: implement --delete-expired Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 10/14] commit-graph: add core.commitGraph setting Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
2018-02-14  0:12     ` Jonathan Tan
2018-02-14 18:08       ` Derrick Stolee
2018-02-15 18:25     ` Junio C Hamano
2018-02-08 20:37   ` [PATCH v3 12/14] commit-graph: close under reachability Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 13/14] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 14/14] commit-graph: build graph from starting commits Derrick Stolee
2018-02-09 13:02     ` SZEDER Gábor
2018-02-09 13:45       ` Derrick Stolee
2018-02-14 18:15   ` [PATCH v3 00/14] Serialized Git Commit Graph Derrick Stolee
2018-02-14 18:27     ` Stefan Beller
2018-02-14 19:11       ` Derrick Stolee
2018-02-19 18:53     ` [PATCH v4 00/13] " Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 01/13] commit-graph: add format document Derrick Stolee
2018-02-20 20:49         ` Junio C Hamano
2018-02-21 19:23         ` Stefan Beller
2018-02-21 19:45           ` Derrick Stolee
2018-02-21 19:48             ` Stefan Beller
2018-03-30 13:25         ` Jakub Narebski
2018-04-02 13:09           ` Derrick Stolee
2018-04-02 14:09             ` Jakub Narebski
2018-02-19 18:53       ` [PATCH v4 02/13] graph: add commit graph design document Derrick Stolee
2018-02-20 21:42         ` Junio C Hamano
2018-02-23 15:44           ` Derrick Stolee
2018-02-21 19:34         ` Stefan Beller
2018-02-19 18:53       ` [PATCH v4 03/13] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-02-20 21:51         ` Junio C Hamano
2018-02-21 18:58           ` Junio C Hamano
2018-02-23 16:07             ` Derrick Stolee
2018-02-26 16:25         ` SZEDER Gábor
2018-02-26 17:08           ` Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 04/13] commit-graph: implement write_commit_graph() Derrick Stolee
2018-02-20 22:57         ` Junio C Hamano [this message]
2018-02-23 17:23           ` Derrick Stolee
2018-02-23 19:30             ` Junio C Hamano
2018-02-23 19:48               ` Junio C Hamano
2018-02-23 20:02               ` Derrick Stolee
2018-02-26 16:10         ` SZEDER Gábor
2018-02-28 18:47         ` Junio C Hamano
2018-02-19 18:53       ` [PATCH v4 05/13] commit-graph: implement 'git-commit-graph write' Derrick Stolee
2018-02-21 19:25         ` Junio C Hamano
2018-02-19 18:53       ` [PATCH v4 06/13] commit-graph: implement git commit-graph read Derrick Stolee
2018-02-21 20:11         ` Junio C Hamano
2018-02-22 18:25           ` Junio C Hamano
2018-02-19 18:53       ` [PATCH v4 07/13] commit-graph: implement --set-latest Derrick Stolee
2018-02-22 18:31         ` Junio C Hamano
2018-02-23 17:53           ` Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 08/13] commit-graph: implement --delete-expired Derrick Stolee
2018-02-21 21:34         ` Stefan Beller
2018-02-23 17:43           ` Derrick Stolee
2018-02-22 18:48         ` Junio C Hamano
2018-02-23 17:59           ` Derrick Stolee
2018-02-23 19:33             ` Junio C Hamano
2018-02-23 19:41               ` Derrick Stolee
2018-02-23 19:51                 ` Junio C Hamano
2018-02-19 18:53       ` [PATCH v4 09/13] commit-graph: add core.commitGraph setting Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 10/13] commit-graph: close under reachability Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 11/13] commit: integrate commit graph with commit parsing Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 12/13] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-02-21 22:25         ` Stefan Beller
2018-02-23 19:19           ` Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 13/13] commit-graph: build graph from starting commits Derrick Stolee
2018-03-30 11:10       ` [PATCH v4 00/13] Serialized Git Commit Graph Jakub Narebski
2018-04-02 13:02         ` Derrick Stolee
2018-04-02 14:46           ` Jakub Narebski
2018-04-02 15:02             ` Derrick Stolee
2018-04-02 17:35               ` Stefan Beller
2018-04-02 17:54                 ` Derrick Stolee
2018-04-02 18:02                   ` Stefan Beller
2018-04-07 22:37               ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqmv031d7d.fsf@gitster-ct.c.googlers.com \
    --to=gitster@pobox.com \
    --cc=dstolee@microsoft.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    --cc=stolee@gmail.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).