Re: [PATCH 5.5/22] Add documentation for the index api

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Duy Nguyen <pclouds@gmail.com>
To: Thomas Gummerer <t.gummerer@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>,
	Thomas Rast <trast@inf.ethz.ch>,
	Michael Haggerty <mhagger@alum.mit.edu>,
	Junio C Hamano <gitster@pobox.com>,
	Robin Rosenberg <robin.rosenberg@dewire.com>
Subject: Re: [PATCH 5.5/22] Add documentation for the index api
Date: Tue, 9 Jul 2013 22:42:59 +0700	[thread overview]
Message-ID: <CACsJy8A9+1O_em=FtV_TUKags4FrSggV76dd1h6gsJ+JHfjZKw@mail.gmail.com> (raw)
In-Reply-To: <871u78rcw0.fsf@gmail.com>

On Tue, Jul 9, 2013 at 3:54 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> As promised, a draft for a documentation for the index api as it is in
> this series.

First of all, it may be a good idea to acknowledge
index_state->cache[] as part of the API for now. Not hiding it
simplifies a few things (no need for new next_ce field, no worries
about rewinding in unpack-trees..). Supporting partial loading (and
maybe partial update in some cases) with this API and
index_state->cache[] part of the API are good enough for me. We can do
another tree-based API or something update later when it's formed (I
looked at your index-v5api branch but I don't think a tree-based api
was there, my concern is how much extra head pre-v5 has to pay to use
tree-based api).

> +`read_index_filtered(opts)`::
> +       This method behaves differently for index-v2 and index-v5.
> +
> +       For index-v2 it simply reads the whole index as read_index()
> +       does, so we are sure we don't have to reload anything if the
> +       user wants a different filter.  It also sets the filter_opts
> +       in the index_state, which is used to limit the results when
> +       iterating over the index with for_each_index_entry().
> +
> +       The whole index is read to avoid the need to eventually
> +       re-read the index later, because the performance is no
> +       different when reading it partially.
> +
> +       For index-v5 it creates an adjusted_pathspec to filter the
> +       reading.  First all the directory entries are read and then
> +       the cache_entries in the directories that match the adjusted
> +       pathspec are read.  The filter_opts in the index_state are set
> +       to filter out the rest of the cache_entries that are matched
> +       by the adjusted pathspec but not by the pathspec given.  The
> +       rest of the index entries are filtered out when iterating over
> +       the cache with for_each_index_entries.

You can state in the API that the input pathspec is used as a hint to
load only a portion of the index. read_index_filtered may load _more_
than necessary. It's the caller's responsibility to verify again which
is matched and which is not. That's how read_directory is done. I
think it gives you more liberty in loading strategy. It's already true
for v2 because full index is loaded regardless of the given pathspec.
In the end, we have a linear list (from public view) of cache entries,
accessible via index_state->cache[].

If you happen to know that certain entries match the given pathspec,
you could help the caller avoid match_pathspec'ing again by set a bit
in ce_flags.  To know which entry exists in the index and which is
new, use another flag. Most reader code won't change if we do it this
way, all match_pathspec() remain where they are.

> +`for_each_index_entry(fn, cb_data)`::
> +       Iterates over all cache_entries in the index filtered by
> +       filter_opts in the index_stat.  For each cache entry fn is
> +       executed with cb_data as callback data.  From within the loop
> +       do `return 0` to continue, or `return 1` to break the loop.

Because we don't attempt to hide index_state->cache[], this one may be
for convenience, the user is not required to convert to it. Actually I
think this may be slower because of the cost of calling function
pointer.

> +`next_index_entry(ce)`::
> +       Returns the cache_entry that follows after ce

next_ce field and this method may be gone too, just access index_state->cache[]

> +`index_change_filter_opts(opts)`::
> +       This function again has a slightly different functionality for
> +       index-v2 and index-v5.
> +
> +       For index-v2 it simply changes the filter_opts, so
> +       for_each_index_entry uses the changed index_opts, to iterate
> +       over a different set of cache entries.
> +
> +       For index-v5 it refreshes the index if the filter_opts have
> +       changed and sets the new filter_opts in the index state, again
> +       to iterate over a different set of cache entries as with
> +       index-v2.
> +
> +       This has some optimization potential, in the case that the
> +       opts get stricter (less of the index should be read) it
> +       doesn't have to reload anything, but currently does.

The only use case I see so far is converting a partial index_state
back to a full one. Apart from doing so in order to write the new
index, I think some operation (like rename tracking in diff or
unpack-trees) may expect full index. I think we should support that. I
doubt we need to change pathspec to something different than the one
we used to load the index. When a user passes a pathspec to a command,
the user expects the command to operate on that set only, not outside.

If you take the input pathspec at loading just as a hint, you could
load all related directory blocks and all files in those blocks, so
that expanding to full index is simply adding more files from missing
directory blocks (and their files). An advantage of not strictly
follow the input pathspec.

Some thoughts about the writing api.

In think we should avoid automatically converting partial index into a
full one before writing. Push that back to the caller and die() when
asked to update partial index. They know at what point the index may
be updated and even what part of it may be updated. I think all
commands fall into two categories, tree-wide updates (merge,
checkout...) and limited by the user-given pathspec. "what part to be
updated" is not so hard to determine.

If the caller promises not to update or read outside certain pathspec
(part of API), we don't really need to load full index until
write_index is called. At that point I think we also know what
directory blocks are completely untouched by the caller (by promise)
and could copy them over from the old index byte-by-byte (or something
close, some offsets may be recalculated). That may keep tree compiling
cost proportional to the number of changed entries, not the number of
entries in index.

There is another partial write case that's not covered this round (or
was it discussed and discarded?): refreshing the index. This operation
could be treated as a standalone, special one: refresh and update the
index file directly without waiting until write_index is called. There
are some commands that follow this scheme by doing
update_index_if_able() after refresh_index(). Those will be cheaper
with v5 because we don't need write a full new index.
--
Duy

next prev parent reply	other threads:[~2013-07-09 15:43 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
2013-07-07  8:11 ` [PATCH 01/22] t2104: Don't fail for index versions other than [23] Thomas Gummerer
2013-07-07  8:11 ` [PATCH 02/22] read-cache: split index file version specific functionality Thomas Gummerer
2013-07-07  8:11 ` [PATCH 03/22] read-cache: move index v2 specific functions to their own file Thomas Gummerer
2013-07-07  8:11 ` [PATCH 04/22] read-cache: Re-read index if index file changed Thomas Gummerer
2013-07-07  8:11 ` [PATCH 05/22] read-cache: add index reading api Thomas Gummerer
2013-07-08  2:01   ` Duy Nguyen
2013-07-08 11:40     ` Thomas Gummerer
2013-07-08  2:19   ` Duy Nguyen
2013-07-08 11:20     ` Thomas Gummerer
2013-07-08 12:45       ` Duy Nguyen
2013-07-08 13:37         ` Thomas Gummerer
2013-07-08 20:54         ` [PATCH 5.5/22] Add documentation for the index api Thomas Gummerer
2013-07-09 15:42           ` Duy Nguyen [this message]
2013-07-09 20:10             ` Thomas Gummerer
2013-07-10  5:28               ` Duy Nguyen
2013-07-11 11:30                 ` Thomas Gummerer
2013-07-11 11:42                   ` Duy Nguyen
2013-07-11 12:27                     ` Duy Nguyen
2013-07-08 16:36   ` [PATCH 05/22] read-cache: add index reading api Junio C Hamano
2013-07-08 20:10     ` Thomas Gummerer
2013-07-08 23:09       ` Junio C Hamano
2013-07-09 20:13         ` Thomas Gummerer
2013-07-07  8:11 ` [PATCH 06/22] make sure partially read index is not changed Thomas Gummerer
2013-07-08 16:31   ` Junio C Hamano
2013-07-08 18:33     ` Thomas Gummerer
2013-07-07  8:11 ` [PATCH 07/22] dir.c: use index api Thomas Gummerer
2013-07-07  8:11 ` [PATCH 08/22] tree.c: " Thomas Gummerer
2013-07-07  8:11 ` [PATCH 09/22] name-hash.c: " Thomas Gummerer
2013-07-07  8:11 ` [PATCH 10/22] grep.c: Use " Thomas Gummerer
2013-07-07  8:11 ` [PATCH 11/22] ls-files.c: use the " Thomas Gummerer
2013-07-07  8:11 ` [PATCH 12/22] read-cache: make read_blob_data_from_index use " Thomas Gummerer
2013-07-07  8:11 ` [PATCH 13/22] documentation: add documentation of the index-v5 file format Thomas Gummerer
2013-07-11 10:39   ` Duy Nguyen
2013-07-11 11:39     ` Thomas Gummerer
2013-07-11 11:47       ` Duy Nguyen
2013-07-11 12:26         ` Thomas Gummerer
2013-07-11 12:50           ` Duy Nguyen
2013-07-07  8:11 ` [PATCH 14/22] read-cache: make in-memory format aware of stat_crc Thomas Gummerer
2013-07-07  8:11 ` [PATCH 15/22] read-cache: read index-v5 Thomas Gummerer
2013-07-07 20:18   ` Eric Sunshine
2013-07-08 11:40     ` Thomas Gummerer
2013-07-07  8:11 ` [PATCH 16/22] read-cache: read resolve-undo data Thomas Gummerer
2013-07-07  8:11 ` [PATCH 17/22] read-cache: read cache-tree in index-v5 Thomas Gummerer
2013-07-07 20:41   ` Eric Sunshine
2013-07-07  8:11 ` [PATCH 18/22] read-cache: write index-v5 Thomas Gummerer
2013-07-07 20:43   ` Eric Sunshine
2013-07-07  8:11 ` [PATCH 19/22] read-cache: write index-v5 cache-tree data Thomas Gummerer
2013-07-07  8:11 ` [PATCH 20/22] read-cache: write resolve-undo data for index-v5 Thomas Gummerer
2013-07-07  8:11 ` [PATCH 21/22] update-index.c: rewrite index when index-version is given Thomas Gummerer
2013-07-07  8:12 ` [PATCH 22/22] p0003-index.sh: add perf test for the index formats Thomas Gummerer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACsJy8A9+1O_em=FtV_TUKags4FrSggV76dd1h6gsJ+JHfjZKw@mail.gmail.com' \
    --to=pclouds@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=mhagger@alum.mit.edu \
    --cc=robin.rosenberg@dewire.com \
    --cc=t.gummerer@gmail.com \
    --cc=trast@inf.ethz.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).