From: Patrick Steinhardt <ps@pks.im>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 2/6] dir-iterator: support iteration in sorted order
Date: Tue, 20 Feb 2024 09:34:09 +0100 [thread overview]
Message-ID: <ZdRkAe5zajmGb95q@tanuki> (raw)
In-Reply-To: <xmqq8r3g10tf.fsf@gitster.g>
[-- Attachment #1: Type: text/plain, Size: 9982 bytes --]
On Mon, Feb 19, 2024 at 03:39:08PM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
>
> > The `struct dir_iterator` is a helper that allows us to iterate through
> > directory entries. This iterator returns entries in the exact same order
> > as readdir(3P) does -- or in other words, it guarantees no specific
> > order at all.
> >
> > This is about to become problematic as we are introducing a new reflog
> > subcommand to list reflogs. As the "files" backend uses the directory
> > iterator to enumerate reflogs, returning reflog names and exposing them
> > to the user would inherit the indeterministic ordering. Naturally, it
> > would make for a terrible user interface to show a list with no
> > discernible order. While this could be handled at a higher level by the
> > new subcommand itself by collecting and ordering the reflogs, this would
> > be inefficient and introduce latency when there are many reflogs.
>
> I do not quite understand this argument. Why is sorting at higher
> level less (or more, for that matter) efficient than doing so at
> lower level? We'd need to sort somewhere no matter what, and I of
> course have no problem in listing in a deterministic order.
By sorting at a lower level we only need to sort the respective
directory entries and can then return them without having to recurse
into all subdirectories yet. Sorting at a higher level would require us
to first collect _all_ reflogs and then sort them.
Will rephrase a bit.
> > Instead, introduce a new option into the directory iterator that asks
> > for its entries to be yielded in lexicographical order. If set, the
> > iterator will read all directory entries greedily end sort them before
> > we start to iterate over them.
>
> "end" -> "and". And of course without such sorting option, this
> codepath is allowed to yield entries in any order that is the
> easiest to produce? That makes sense.
>
> > While this will of course also incur overhead as we cannot yield the
> > directory entries immediately, it should at least be more efficient than
> > having to sort the complete list of reflogs as we only need to sort one
> > directory at a time.
>
> True. The initial latency before we see the first byte of the
> output often matters more in perceived performance the throughput.
> As we need to sort to give a reasonable output, that cannot be
> avoided.
>
> > This functionality will be used in a follow-up commit.
> >
> > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > ---
> > dir-iterator.c | 87 ++++++++++++++++++++++++++++++++++++++++----------
> > dir-iterator.h | 3 ++
> > 2 files changed, 73 insertions(+), 17 deletions(-)
> >
> > diff --git a/dir-iterator.c b/dir-iterator.c
> > index f58a97e089..396c28178f 100644
> > --- a/dir-iterator.c
> > +++ b/dir-iterator.c
> > @@ -2,9 +2,12 @@
> > #include "dir.h"
> > #include "iterator.h"
> > #include "dir-iterator.h"
> > +#include "string-list.h"
> >
> > struct dir_iterator_level {
> > DIR *dir;
> > + struct string_list entries;
> > + size_t entries_idx;
>
> Does it deserve a comment that "dir == NULL" is used as a signal
> that we have read the level and sorted its contents into the
> "entries" list (and also we have already called closedir(), of
> course)?
Yeah, probably.
> > @@ -72,6 +75,40 @@ static int push_level(struct dir_iterator_int *iter)
> > return -1;
> > }
> >
> > + string_list_init_dup(&level->entries);
> > + level->entries_idx = 0;
> > +
> > + /*
> > + * When the iterator is sorted we read and sort all directory entries
> > + * directly.
> > + */
> > + if (iter->flags & DIR_ITERATOR_SORTED) {
> > + while (1) {
> > + struct dirent *de;
> > +
> > + errno = 0;
> > + de = readdir(level->dir);
> > + if (!de) {
> > + if (errno && errno != ENOENT) {
> > + warning_errno("error reading directory '%s'",
> > + iter->base.path.buf);
> > + return -1;
> > + }
> > +
> > + break;
> > + }
> > +
> > + if (is_dot_or_dotdot(de->d_name))
> > + continue;
>
> The condition to skip an entry currently is simple enough that "."
> and ".." are the only ones that are skipped, but it must be kept in
> sync with the condition in dir_iterator_advance().
>
> If it becomes more complex than it is now (e.g., we may start to
> skip any name that begins with a dot, like ".git" or ".dummy"), it
> probably is a good idea *not* to add the same filtering logic here
> and in dir_iterator_advance(). Instead, keep the filtering here to
> an absolute minumum, and filter the name, whether it came from
> readdir() or from the .entries string list, in a single copy of
> filtering logic in dir_iterator_advance() function.
>
> We could drop the dot-or-dotdot filter here, too, if we want to
> ensure that unified filtering will be correctly done over there.
Fair point. As you mention further down below, there are two ways to
approach it:
- Just filter at the later stage and accept that we'll allocate memory
for entries that are about to be discarded.
- Create a function `should_include_entry()` that gets called at both
code sites.
I don't think the allocation overhead should matter much, but neither
does it hurt to create a common `should_include_entry()` function. And
as both are trivial to implement I rather lean towards the more
efficient variant, even though the efficiency gain should be negligible.
> > + string_list_append(&level->entries, de->d_name);
> > + }
> > + string_list_sort(&level->entries);
> > +
> > + closedir(level->dir);
> > + level->dir = NULL;
> > + }
> > +
> > return 0;
> > }
> >
> > @@ -88,6 +125,7 @@ static int pop_level(struct dir_iterator_int *iter)
> > warning_errno("error closing directory '%s'",
> > iter->base.path.buf);
> > level->dir = NULL;
> > + string_list_clear(&level->entries, 0);
> >
> > return --iter->levels_nr;
> > }
>
> It is somewhat interesting that the original code already has
> conditional call to closedir() and prepares .dir to be NULL,
> so that we do not have to make it conditional here.
>
> > @@ -136,30 +174,43 @@ int dir_iterator_advance(struct dir_iterator *dir_iterator)
> >
> > /* Loop until we find an entry that we can give back to the caller. */
> > while (1) {
> > - struct dirent *de;
> > struct dir_iterator_level *level =
> > &iter->levels[iter->levels_nr - 1];
> > + struct dirent *de;
> > + const char *name;
>
> Not a huge deal but this is an unnecessary reordering, right?
Right.
> > strbuf_setlen(&iter->base.path, level->prefix_len);
> > +
> > + if (level->dir) {
> > + errno = 0;
> > + de = readdir(level->dir);
> > + if (!de) {
> > + if (errno) {
> > + warning_errno("error reading directory '%s'",
> > + iter->base.path.buf);
> > + if (iter->flags & DIR_ITERATOR_PEDANTIC)
> > + goto error_out;
> > + } else if (pop_level(iter) == 0) {
> > + return dir_iterator_abort(dir_iterator);
> > + }
> > + continue;
> > }
> >
> > + if (is_dot_or_dotdot(de->d_name))
> > + continue;
>
> This is the target of the "if we will end up filtering even more in
> the future, it would probably be a good idea not to duplicate the
> logic to decide what gets filtered in this function and in
> push_level()" comment. If we wanted to go that route, we can get
> rid of the filtering from push_level(), and move this filter code
> outside this if/else before calling prepare_next_entry_data().
>
> The fact that .entries.nr represents the number of entries that are
> shown is unusable (because there is an unsorted codepath that does
> not even populate .entries), so I am not worried about correctness
> gotchas caused by including names in .entries to be filtered out.
> But an obvious downside is that the size of the list to be sorted
> will become larger.
>
> Or we could introduce a shared helper function that takes a name and
> decides if it is to be included, and replace the is_dot_or_dotdot()
> call here and in the push_level() with calls to that helper.
>
> In any case, that is primarily a maintainability issue. The code
> posted as-is is correct.
Yeah, let's use the proposed helper function. In fact, I think we can
share even more code than merely the filtering part: the errno handling
is a bit special, and the warning is the same across both code sites,
too.
Patrick
> > + name = de->d_name;
> > + } else {
> > + if (level->entries_idx >= level->entries.nr) {
> > + if (pop_level(iter) == 0)
> > + return dir_iterator_abort(dir_iterator);
> > + continue;
> > + }
> > +
> > + name = level->entries.items[level->entries_idx++].string;
> > + }
> > +
> > + if (prepare_next_entry_data(iter, name)) {
> > if (errno != ENOENT && iter->flags & DIR_ITERATOR_PEDANTIC)
> > goto error_out;
> > continue;
> > @@ -188,6 +239,8 @@ int dir_iterator_abort(struct dir_iterator *dir_iterator)
> > warning_errno("error closing directory '%s'",
> > iter->base.path.buf);
> > }
> > +
> > + string_list_clear(&level->entries, 0);
> > }
> >
> > free(iter->levels);
> > diff --git a/dir-iterator.h b/dir-iterator.h
> > index 479e1ec784..6d438809b6 100644
> > --- a/dir-iterator.h
> > +++ b/dir-iterator.h
> > @@ -54,8 +54,11 @@
> > * and ITER_ERROR is returned immediately. In both cases, a meaningful
> > * warning is emitted. Note: ENOENT errors are always ignored so that
> > * the API users may remove files during iteration.
> > + *
> > + * - DIR_ITERATOR_SORTED: sort directory entries alphabetically.
> > */
> > #define DIR_ITERATOR_PEDANTIC (1 << 0)
> > +#define DIR_ITERATOR_SORTED (1 << 1)
> >
> > struct dir_iterator {
> > /* The current path: */
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2024-02-20 8:34 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-19 14:35 [PATCH 0/6] reflog: introduce subcommand to list reflogs Patrick Steinhardt
2024-02-19 14:35 ` [PATCH 1/6] dir-iterator: pass name to `prepare_next_entry_data()` directly Patrick Steinhardt
2024-02-19 14:35 ` [PATCH 2/6] dir-iterator: support iteration in sorted order Patrick Steinhardt
2024-02-19 23:39 ` Junio C Hamano
2024-02-20 8:34 ` Patrick Steinhardt [this message]
2024-02-19 14:35 ` [PATCH 3/6] refs/files: sort reflogs returned by the reflog iterator Patrick Steinhardt
2024-02-20 0:04 ` Junio C Hamano
2024-02-20 8:34 ` Patrick Steinhardt
2024-02-19 14:35 ` [PATCH 4/6] refs: drop unused params from the reflog iterator callback Patrick Steinhardt
2024-02-20 0:14 ` Junio C Hamano
2024-02-20 8:34 ` Patrick Steinhardt
2024-02-19 14:35 ` [PATCH 5/6] refs: stop resolving ref corresponding to reflogs Patrick Steinhardt
2024-02-20 0:14 ` Junio C Hamano
2024-02-20 8:34 ` Patrick Steinhardt
2024-02-19 14:35 ` [PATCH 6/6] builtin/reflog: introduce subcommand to list reflogs Patrick Steinhardt
2024-02-20 0:32 ` Junio C Hamano
2024-02-20 8:34 ` Patrick Steinhardt
2024-02-20 9:06 ` [PATCH v2 0/7] reflog: " Patrick Steinhardt
2024-02-20 9:06 ` [PATCH v2 1/7] dir-iterator: pass name to `prepare_next_entry_data()` directly Patrick Steinhardt
2024-02-20 9:06 ` [PATCH v2 2/7] dir-iterator: support iteration in sorted order Patrick Steinhardt
2024-02-20 9:06 ` [PATCH v2 3/7] refs/files: sort reflogs returned by the reflog iterator Patrick Steinhardt
2024-02-20 9:06 ` [PATCH v2 4/7] refs: always treat iterators as ordered Patrick Steinhardt
2024-02-20 9:06 ` [PATCH v2 5/7] refs: drop unused params from the reflog iterator callback Patrick Steinhardt
2024-02-20 9:06 ` [PATCH v2 6/7] refs: stop resolving ref corresponding to reflogs Patrick Steinhardt
2024-02-20 9:06 ` [PATCH v2 7/7] builtin/reflog: introduce subcommand to list reflogs Patrick Steinhardt
2024-04-24 7:30 ` Teng Long
2024-04-24 8:01 ` Patrick Steinhardt
2024-04-24 14:53 ` Junio C Hamano
2024-02-20 17:22 ` [PATCH v2 0/7] reflog: " Junio C Hamano
2024-02-21 11:48 ` Patrick Steinhardt
2024-02-21 12:37 ` [PATCH v3 0/8] " Patrick Steinhardt
2024-02-21 12:37 ` [PATCH v3 1/8] dir-iterator: pass name to `prepare_next_entry_data()` directly Patrick Steinhardt
2024-02-21 12:37 ` [PATCH v3 2/8] dir-iterator: support iteration in sorted order Patrick Steinhardt
2024-02-21 12:37 ` [PATCH v3 3/8] refs/files: sort reflogs returned by the reflog iterator Patrick Steinhardt
2024-02-21 12:37 ` [PATCH v3 4/8] refs/files: sort merged worktree and common reflogs Patrick Steinhardt
2024-02-21 12:37 ` [PATCH v3 5/8] refs: always treat iterators as ordered Patrick Steinhardt
2024-02-21 12:37 ` [PATCH v3 6/8] refs: drop unused params from the reflog iterator callback Patrick Steinhardt
2024-02-21 12:37 ` [PATCH v3 7/8] refs: stop resolving ref corresponding to reflogs Patrick Steinhardt
2024-02-21 12:37 ` [PATCH v3 8/8] builtin/reflog: introduce subcommand to list reflogs Patrick Steinhardt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZdRkAe5zajmGb95q@tanuki \
--to=ps@pks.im \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).