git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Thomas Gummerer <t.gummerer@gmail.com>
To: Duy Nguyen <pclouds@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>,
	Thomas Rast <trast@inf.ethz.ch>,
	Michael Haggerty <mhagger@alum.mit.edu>,
	Junio C Hamano <gitster@pobox.com>,
	Robin Rosenberg <robin.rosenberg@dewire.com>,
	Eric Sunshine <sunshine@sunshineco.com>
Subject: Re: [PATCH v2 12/19] read-cache: read index-v5
Date: Wed, 07 Aug 2013 10:23:50 +0200	[thread overview]
Message-ID: <87k3jxrjs9.fsf@gmail.com> (raw)
In-Reply-To: <CACsJy8DDRN5aTOzxuiGSa6UjbBiVUYwHjPcnN+8mw4xKmvkx-A@mail.gmail.com>

Duy Nguyen <pclouds@gmail.com> writes:

> A little bit more..
>
> On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> +static void ce_queue_push(struct cache_entry **head,
>> +                            struct cache_entry **tail,
>> +                            struct cache_entry *ce)
>> +{
>> +       if (!*head) {
>> +               *head = *tail = ce;
>> +               (*tail)->next = NULL;
>> +               return;
>> +       }
>> +
>> +       (*tail)->next = ce;
>> +       ce->next = NULL;
>> +       *tail = (*tail)->next;
>
> No no no. "next" is for name-hash.c don't "reuse" it here. And I don't
> think you really need to maintain a linked list of cache_entries by
> the way. More on read_entries comments below..

You're right, I don't need it for the reading code.  I do need to keep a
list of cache-entries for writing though, where a linked list is best
suited for the task.  I'll use a next_ce pointer for that.

>> +}
>> +
>> +static struct cache_entry *ce_queue_pop(struct cache_entry **head)
>> +{
>> +       struct cache_entry *ce;
>> +
>> +       ce = *head;
>> +       *head = (*head)->next;
>> +       return ce;
>> +}
>
> Same as ce_queue_push.
>
>> +static int read_entries(struct index_state *istate, struct directory_entry **de,
>> +                       unsigned int *entry_offset, void **mmap,
>> +                       unsigned long mmap_size, unsigned int *nr,
>> +                       unsigned int *foffsetblock)
>> +{
>> +       struct cache_entry *head = NULL, *tail = NULL;
>> +       struct conflict_entry *conflict_queue;
>> +       struct cache_entry *ce;
>> +       int i;
>> +
>> +       conflict_queue = NULL;
>> +       if (read_conflicts(&conflict_queue, *de, mmap, mmap_size) < 0)
>> +               return -1;
>> +       for (i = 0; i < (*de)->de_nfiles; i++) {
>> +               if (read_entry(&ce,
>> +                              *de,
>> +                              entry_offset,
>> +                              mmap,
>> +                              mmap_size,
>> +                              foffsetblock) < 0)
>> +                       return -1;
>> +               ce_queue_push(&head, &tail, ce);
>> +               *foffsetblock += 4;
>> +
>> +               /*
>> +                * Add the conflicted entries at the end of the index file
>> +                * to the in memory format
>> +                */
>> +               if (conflict_queue &&
>> +                   (conflict_queue->entries->flags & CONFLICT_CONFLICTED) != 0 &&
>> +                   !cache_name_compare(conflict_queue->name, conflict_queue->namelen,
>> +                                       ce->name, ce_namelen(ce))) {
>> +                       struct conflict_part *cp;
>> +                       cp = conflict_queue->entries;
>> +                       cp = cp->next;
>> +                       while (cp) {
>> +                               ce = convert_conflict_part(cp,
>> +                                               conflict_queue->name,
>> +                                               conflict_queue->namelen);
>> +                               ce_queue_push(&head, &tail, ce);
>> +                               conflict_part_head_remove(&cp);
>> +                       }
>> +                       conflict_entry_head_remove(&conflict_queue);
>> +               }
>
> I start to wonder if separating staged entries is a good idea. It
> seems to make the code more complicated. The good point about conflict
> section at the end of the file is you can just truncate() it out.
> Another way is putting staged entries in fileentries, sorted
> alphabetically then by stage number, and a flag indicating if the
> entry is valid. When you remove resolve an entry, just set the flag to
> invalid (partial write), so that read code will skip it.
>
> I think this approach is reasonably cheap (unless there are a lot of
> conflicts) and it simplifies this piece of code. truncate() may be
> overrated anyway. In my experience, I "git add <path>" as soon as I
> resolve <path> (so that "git diff" shrinks). One entry at a time, one
> index write at a time. I don't think I ever resolve everything then
> "git add -u .", which is where truncate() shines because staged
> entries are removed all at once. We should optimize for one file
> resolution at a time, imo.
>
>> +       }
>> +
>> +       *de = (*de)->next;
>> +
>> +       while (head) {
>> +               if (*de != NULL
>> +                   && strcmp(head->name, (*de)->pathname) > 0) {
>> +                       read_entries(istate,
>> +                                    de,
>> +                                    entry_offset,
>> +                                    mmap,
>> +                                    mmap_size,
>> +                                    nr,
>> +                                    foffsetblock);
>> +               } else {
>> +                       ce = ce_queue_pop(&head);
>> +                       set_index_entry(istate, *nr, ce);
>> +                       (*nr)++;
>> +                       ce->next = NULL;
>> +               }
>
> In this loop, you do some sort of merge sort combining a list of
> sorted files and a list of sorted directories (which will be expanded
> to bunches of files by the recursive read_entries). strcmp() does two
> things. Assume we're in directory 'a', it makes sure that subdirectory
> 'a/b' is only inserted after file 'a/a' is inserted to maintain index
> sort order. It also makes sure that 'de' of an outside directory 'b'
> will not be expanded here. strcmp will be called for every file,
> checking _full_ path. Sounds expensive.
>
> If you maintain two pointers first_subdir and next_sibling in "de" as
> in my previous mail, you know "de" 'b' is out without string
> comparison (de->next_sibling would be NULL). With that, the purpose of
> strcmp is simply making sure that 'a/b' is inserted after 'a/a'.
> Because we know all possible "de" is a subdirectory of 'a', we don't
> need to compare the prefix 'a/'. We still need to strcmp on every
> file, but we only need to compare the file name, not the leading
> directory anymore. Cheaper.
>
> Going further, I wonder if we could eliminate strcmp completely. We
> could store dummy entries in fileentries to mark the positions of the
> subdirectories. When we encounter a dummy entry, we call
> read_entries() instead of set_index_entry.

I've tried implementing both versions with the internal tree structure,
see below for timings.

simpleapi is the code that was posted to the list, HEAD uses a
tree-structure for the directories internally, and replaces strcmp with
cache_name_compare and leaves out the prefix.  This tree uses dummy
entries for the sub-directories.

Dummy entries are single NUL-bytes mixed with the cache-entries, which
should give optimal performance.  I haven't thought much about
corruption of the index, but I don't think we would need any additional
crc checksums or similar.

The performance advantage seems very slim to none when using the dummy
entries to avoid the comparison though, so I don't think it makes sense
to make the index more complicated for a very small speed-up.

Test                                        simpleapi         HEAD                     this tree             
-------------------------------------------------------------------------------------------------------------
0003.2: v[23]: update-index                 3.22(2.61+0.58)   3.20(2.56+0.61) -0.6%    3.22(2.67+0.53) +0.0% 
0003.3: v[23]: grep nonexistent -- subdir   1.65(1.36+0.27)   1.65(1.34+0.30) +0.0%    1.67(1.34+0.31) +1.2% 
0003.4: v[23]: ls-files -- subdir           1.50(1.20+0.29)   1.54(1.18+0.34) +2.7%    1.53(1.22+0.30) +2.0% 
0003.6: v4: update-index                    2.69(2.28+0.39)   2.70(2.21+0.47) +0.4%    2.70(2.27+0.41) +0.4% 
0003.7: v4: grep nonexistent -- subdir      1.33(0.98+0.34)   1.36(1.01+0.33) +2.3%    1.34(1.03+0.30) +0.8% 
0003.8: v4: ls-files -- subdir              1.21(0.86+0.33)   1.23(0.91+0.30) +1.7%    1.22(0.90+0.30) +0.8% 
0003.10: v5: update-index                   2.12(1.58+0.52)   2.20(1.67+0.52) +3.8%    2.19(1.66+0.51) +3.3% 
0003.11: v5: ls-files                       1.57(1.21+0.34)   1.55(1.20+0.33) -1.3%    1.52(1.21+0.29) -3.2% 
0003.12: v5: grep nonexistent -- subdir     0.08(0.06+0.01)   0.07(0.04+0.02) -12.5%   0.07(0.04+0.02) -12.5%
0003.13: v5: ls-files -- subdir             0.07(0.04+0.01)   0.06(0.05+0.00) -14.3%   0.07(0.05+0.01) +0.0% 


> If you merge the "read_entry" for loop in this while loop, you don't
> need to maintain a linked list of ce to pop out one by one here.

Makes sense, I'll do that.

>> +       }
>> +       return 0;
>> +}
>> +
> --
> Duy

  parent reply	other threads:[~2013-08-07  8:24 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 01/19] t2104: Don't fail for index versions other than [23] Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 02/19] read-cache: split index file version specific functionality Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 03/19] read-cache: move index v2 specific functions to their own file Thomas Gummerer
2013-07-14  3:10   ` Duy Nguyen
2013-07-19 14:53     ` Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 04/19] read-cache: Re-read index if index file changed Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 05/19] Add documentation for the index api Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 06/19] read-cache: add index reading api Thomas Gummerer
2013-07-14  3:21   ` Duy Nguyen
2013-07-12 17:26 ` [PATCH v2 07/19] make sure partially read index is not changed Thomas Gummerer
2013-07-14  3:29   ` Duy Nguyen
2013-07-17 12:56     ` Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 08/19] grep.c: Use index api Thomas Gummerer
2013-07-14  3:32   ` Duy Nguyen
2013-07-15  9:51     ` Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 09/19] ls-files.c: use " Thomas Gummerer
2013-07-14  3:39   ` Duy Nguyen
2013-07-17  8:07     ` Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 10/19] documentation: add documentation of the index-v5 file format Thomas Gummerer
2013-07-14  3:59   ` Duy Nguyen
2013-07-17  8:09     ` Thomas Gummerer
2013-08-04 11:26   ` Duy Nguyen
2013-08-04 17:58     ` Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 11/19] read-cache: make in-memory format aware of stat_crc Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 12/19] read-cache: read index-v5 Thomas Gummerer
2013-07-14  4:42   ` Duy Nguyen
2013-08-07  8:13     ` Thomas Gummerer
2013-07-15 10:12   ` Duy Nguyen
2013-07-17  8:11     ` Thomas Gummerer
2013-08-08  2:00       ` Duy Nguyen
2013-08-08 13:28         ` Thomas Gummerer
2013-08-09 13:10         ` Thomas Gummerer
2013-08-07  8:23     ` Thomas Gummerer [this message]
2013-08-08  2:09       ` Duy Nguyen
2013-07-12 17:26 ` [PATCH v2 13/19] read-cache: read resolve-undo data Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 14/19] read-cache: read cache-tree in index-v5 Thomas Gummerer
2013-07-12 17:27 ` [PATCH v2 15/19] read-cache: write index-v5 Thomas Gummerer
2013-07-12 17:27 ` [PATCH v2 16/19] read-cache: write index-v5 cache-tree data Thomas Gummerer
2013-07-12 17:27 ` [PATCH v2 17/19] read-cache: write resolve-undo data for index-v5 Thomas Gummerer
2013-07-12 17:27 ` [PATCH v2 18/19] update-index.c: rewrite index when index-version is given Thomas Gummerer
2013-07-12 17:27 ` [PATCH v2 19/19] p0003-index.sh: add perf test for the index formats Thomas Gummerer
2013-07-14  2:59 ` [PATCH v2 00/19] Index-v5 Duy Nguyen
2013-07-15  9:30   ` Thomas Gummerer
2013-07-15  9:38     ` Duy Nguyen
2013-07-17  8:12       ` Thomas Gummerer
2013-07-17 23:58         ` Junio C Hamano
2013-07-19 17:37           ` Thomas Gummerer
2013-07-19 18:25             ` Junio C Hamano
2013-07-16 21:03 ` Ramsay Jones
2013-07-17  8:04   ` Thomas Gummerer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87k3jxrjs9.fsf@gmail.com \
    --to=t.gummerer@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=mhagger@alum.mit.edu \
    --cc=pclouds@gmail.com \
    --cc=robin.rosenberg@dewire.com \
    --cc=sunshine@sunshineco.com \
    --cc=trast@inf.ethz.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).