git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* RFC: New reference iteration paradigm
@ 2016-03-31 16:13 Michael Haggerty
  2016-03-31 18:01 ` Junio C Hamano
  2016-03-31 20:15 ` David Turner
  0 siblings, 2 replies; 7+ messages in thread
From: Michael Haggerty @ 2016-03-31 16:13 UTC (permalink / raw)
  To: git discussion list; +Cc: David Turner, Junio C Hamano, Jeff King

Currently the way to iterate over references is via a family of
for_each_ref()-style functions. You pass some arguments plus a callback
function and cb_data to the function, and your callback is called for
each reference that is selected.

This works, but it has two big disadvantages:

1. It is cumbersome for callers. The caller's logic has to be split
   into two functions, the one that calls for_each_ref() and the
   callback function. Any data that have to be passed between the
   functions has to be stuck in a separate data structure.

2. This interface is not composable. For example, you can't write a
   single function that iterates over references from two sources,
   as is interesting for combining packed plus loose references,
   shared plus worktree-specific references, symbolic plus normal
   references, etc. The current code for combining packed and loose
   references needs to walk the two reference trees in lockstep,
   using intimate knowledge about how references are stored [1,2,3].

I'm currently working on a patch series to transition the reference code
from using for_each_ref()-style iteration to using proper iterators.

The main point of this change is to change the base iteration paradigm
that has to be supported by reference backends. So instead of

> int do_for_each_ref_fn(const char *submodule, const char *base,
>                        each_ref_fn fn, int trim, int flags,
>                        void *cb_data);

the backend now has to implement

> struct ref_iterator *ref_iterator_begin_fn(const char *submodule,
>                                            const char *prefix,
>                                            unsigned int flags);

The ref_iterator itself has to implement two main methods:

> int iterator_advance_fn(struct ref_iterator *ref_iterator);
> void iterator_free_fn(struct ref_iterator *ref_iterator);

A loop over references now looks something like

> struct ref_iterator *iter = each_ref_in_iterator("refs/tags/");
> while (ref_iterator_advance(iter)) {
>         /* code using iter->refname, iter->oid, iter->flags */
> }

I built quite a bit of ref_iterator infrastructure to make it easy to
plug things together quite flexibly. For example, there is an
overlay_ref_iterator which takes two other iterators (e.g., one for
packed and one for loose refs) and overlays them, presenting the result
via the same iterator interface. But the same overlay_ref_iterator can
be used to overlay any two other iterators on top of each other.

If you are interested, check out my branch wip/ref-iterators on my
GitHub repo [4]. That branch is based off of a version of David Turner's
patch series (i.e., it will have to be rebased at some point). But it
all works and the early part of the patch series is pretty well polished
I think. In fact, the later patches are optional; there is no special
reason to rewrite client code wholesale to use the new reference
iteration API, because the old API continues to be supported (but is now
built on the new API).

Feedback is welcome!

Michael

[1]
https://github.com/git/git/blob/90f7b16b3adc78d4bbabbd426fb69aa78c714f71/refs/files-backend.c#L1665-L1719
[2]
https://github.com/git/git/blob/90f7b16b3adc78d4bbabbd426fb69aa78c714f71/refs/files-backend.c#L582-L608
[3]
https://github.com/git/git/blob/90f7b16b3adc78d4bbabbd426fb69aa78c714f71/refs/files-backend.c#L610-L680
[4] https://github.com/mhagger/git

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RFC: New reference iteration paradigm
  2016-03-31 16:13 RFC: New reference iteration paradigm Michael Haggerty
@ 2016-03-31 18:01 ` Junio C Hamano
  2016-03-31 19:31   ` Jeff King
  2016-03-31 20:15 ` David Turner
  1 sibling, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2016-03-31 18:01 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git discussion list, David Turner, Jeff King

Michael Haggerty <mhagger@alum.mit.edu> writes:

> the backend now has to implement
>
>> struct ref_iterator *ref_iterator_begin_fn(const char *submodule,
>>                                            const char *prefix,
>>                                            unsigned int flags);
>
> The ref_iterator itself has to implement two main methods:
>
>> int iterator_advance_fn(struct ref_iterator *ref_iterator);
>> void iterator_free_fn(struct ref_iterator *ref_iterator);
>
> A loop over references now looks something like
>
>> struct ref_iterator *iter = each_ref_in_iterator("refs/tags/");
>> while (ref_iterator_advance(iter)) {
>>         /* code using iter->refname, iter->oid, iter->flags */
>> }

We'd want to take advantage of the tree-like organization of the
refs (i.e. refs/tags/a and refs/tags/b sit next to each other and
they are closer to each other than they are to refs/heads/a) so that
a request "I want to iterate only over tags, even though I may have
millions of other kinds of refs" can be done with cost that is
proportional to how many tags you have.

The current implementation of for_each_tag_ref() that goes down to
do_for_each_entry() in files-backend.c has that propertly, and the
new iteration mechanism with the above design seems to keep it,
which is very nice.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RFC: New reference iteration paradigm
  2016-03-31 18:01 ` Junio C Hamano
@ 2016-03-31 19:31   ` Jeff King
  2016-03-31 20:08     ` Junio C Hamano
  2018-05-26 17:25     ` Jakub Narebski
  0 siblings, 2 replies; 7+ messages in thread
From: Jeff King @ 2016-03-31 19:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Michael Haggerty, git discussion list, David Turner

On Thu, Mar 31, 2016 at 11:01:44AM -0700, Junio C Hamano wrote:

> Michael Haggerty <mhagger@alum.mit.edu> writes:
> 
> > the backend now has to implement
> >
> >> struct ref_iterator *ref_iterator_begin_fn(const char *submodule,
> >>                                            const char *prefix,
> >>                                            unsigned int flags);
> >
> > The ref_iterator itself has to implement two main methods:
> >
> >> int iterator_advance_fn(struct ref_iterator *ref_iterator);
> >> void iterator_free_fn(struct ref_iterator *ref_iterator);
> >
> > A loop over references now looks something like
> >
> >> struct ref_iterator *iter = each_ref_in_iterator("refs/tags/");
> >> while (ref_iterator_advance(iter)) {
> >>         /* code using iter->refname, iter->oid, iter->flags */
> >> }
> 
> We'd want to take advantage of the tree-like organization of the
> refs (i.e. refs/tags/a and refs/tags/b sit next to each other and
> they are closer to each other than they are to refs/heads/a) so that
> a request "I want to iterate only over tags, even though I may have
> millions of other kinds of refs" can be done with cost that is
> proportional to how many tags you have.
> 
> The current implementation of for_each_tag_ref() that goes down to
> do_for_each_entry() in files-backend.c has that propertly, and the
> new iteration mechanism with the above design seems to keep it,
> which is very nice.

Actually, that is a slight fiction. :)

We traverse only the loose ref directories we need, but we populate the
entire packed-refs tree in one go. That's fine if you have a reasonable
number of refs and care about the cold-cache case (where you care a lot
about hitting directories you don't need to, but reading the entire
packed-refs file isn't a big deal). But it's really bad if you have an
800MB packed-refs file, as looking up one tiny subset of the entries
wastes a lot of RAM and CPU pulling that into our internal
representation[1].

At one point I wrote a patch to binary search the packed-refs file, find
the first "refs/tags/" entry, and then walk linearly through there. What
stopped me is that the current refs.c code (I guess file-backend.c these
days) was not happy with me partially filling in the ref_dir structs in
this "inside out" way.

That is, they assume we may have iterated "refs/", but not yet dove into
"refs/tags/" (because that's what makes sense for traversing loose
refs). But they are not happy if you already dove into "refs/tags/", but
don't know what else might be present in "refs/".

I wonder if, while Michael is fiddling with the iterator code, it might
be possible to fix that (it's not a fundamental problem, just one with
the way the ref-caching works right now).

-Peff

[1] As you probably guessed, I run into these giant packed-refs files as
    part of our "alternates" storage for multiple forks of the same
    repository. So all of the refs match the pattern
    "refs/remotes/$fork_id/*", and I really would like to be able to
    iterate the refs for just one fork at a time.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RFC: New reference iteration paradigm
  2016-03-31 19:31   ` Jeff King
@ 2016-03-31 20:08     ` Junio C Hamano
  2018-05-26 17:25     ` Jakub Narebski
  1 sibling, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2016-03-31 20:08 UTC (permalink / raw)
  To: Jeff King; +Cc: Michael Haggerty, git discussion list, David Turner

Jeff King <peff@peff.net> writes:

> On Thu, Mar 31, 2016 at 11:01:44AM -0700, Junio C Hamano wrote:
>
>> Michael Haggerty <mhagger@alum.mit.edu> writes:
>> 
>> > the backend now has to implement
>> >
>> >> struct ref_iterator *ref_iterator_begin_fn(const char *submodule,
>> >>                                            const char *prefix,
>> >>                                            unsigned int flags);
>> >
>> > The ref_iterator itself has to implement two main methods:
>> >
>> >> int iterator_advance_fn(struct ref_iterator *ref_iterator);
>> >> void iterator_free_fn(struct ref_iterator *ref_iterator);
>> >
>> > A loop over references now looks something like
>> >
>> >> struct ref_iterator *iter = each_ref_in_iterator("refs/tags/");
>> >> while (ref_iterator_advance(iter)) {
>> >>         /* code using iter->refname, iter->oid, iter->flags */
>> >> }
>> 
>> We'd want to take advantage of the tree-like organization of the
>> refs (i.e. refs/tags/a and refs/tags/b sit next to each other and
>> they are closer to each other than they are to refs/heads/a) so that
>> a request "I want to iterate only over tags, even though I may have
>> millions of other kinds of refs" can be done with cost that is
>> proportional to how many tags you have.
>> 
>> The current implementation of for_each_tag_ref() that goes down to
>> do_for_each_entry() in files-backend.c has that propertly, and the
>> new iteration mechanism with the above design seems to keep it,
>> which is very nice.
>
> Actually, that is a slight fiction. :)

I know.  My first draft had "(at least for the loose ref side)"
there, but I omitted it for brevity.

> We traverse only the loose ref directories we need, but we populate the
> entire packed-refs tree in one go.
> ...
> 800MB packed-refs file, as looking up one tiny subset of the entries
> wastes a lot of RAM and CPU pulling that into our internal
> representation[1].

Yes, that is an important use case that needs to be kept in mind for
any restructure of this machinery.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RFC: New reference iteration paradigm
  2016-03-31 16:13 RFC: New reference iteration paradigm Michael Haggerty
  2016-03-31 18:01 ` Junio C Hamano
@ 2016-03-31 20:15 ` David Turner
  1 sibling, 0 replies; 7+ messages in thread
From: David Turner @ 2016-03-31 20:15 UTC (permalink / raw)
  To: Michael Haggerty, git discussion list; +Cc: Junio C Hamano, Jeff King

On Thu, 2016-03-31 at 18:13 +0200, Michael Haggerty wrote:
> Currently the way to iterate over references is via a family of
> for_each_ref()-style functions. You pass some arguments plus a
> callback
> function and cb_data to the function, and your callback is called for
> each reference that is selected.
> 
> This works, but it has two big disadvantages:
> 
> 1. It is cumbersome for callers. The caller's logic has to be split
>    into two functions, the one that calls for_each_ref() and the
>    callback function. Any data that have to be passed between the
>    functions has to be stuck in a separate data structure.
> 
> 2. This interface is not composable. For example, you can't write a
>    single function that iterates over references from two sources,
>    as is interesting for combining packed plus loose references,
>    shared plus worktree-specific references, symbolic plus normal
>    references, etc. The current code for combining packed and loose
>    references needs to walk the two reference trees in lockstep,
>    using intimate knowledge about how references are stored [1,2,3].
> 
> I'm currently working on a patch series to transition the reference
> code
> from using for_each_ref()-style iteration to using proper iterators.
> 
> The main point of this change is to change the base iteration
> paradigm
> that has to be supported by reference backends. So instead of
> 
> > int do_for_each_ref_fn(const char *submodule, const char *base,
> >                        each_ref_fn fn, int trim, int flags,
> >                        void *cb_data);
> 
> the backend now has to implement
> 
> > struct ref_iterator *ref_iterator_begin_fn(const char *submodule,
> >                                            const char *prefix,
> >                                            unsigned int flags);
> 
> The ref_iterator itself has to implement two main methods:
> 
> > int iterator_advance_fn(struct ref_iterator *ref_iterator);
> > void iterator_free_fn(struct ref_iterator *ref_iterator);
> 
> A loop over references now looks something like
> 
> > struct ref_iterator *iter = each_ref_in_iterator("refs/tags/");
> > while (ref_iterator_advance(iter)) {
> >         /* code using iter->refname, iter->oid, iter->flags */
> > }
> 
> I built quite a bit of ref_iterator infrastructure to make it easy to
> plug things together quite flexibly. For example, there is an
> overlay_ref_iterator which takes two other iterators (e.g., one for
> packed and one for loose refs) and overlays them, presenting the
> result
> via the same iterator interface. But the same overlay_ref_iterator
> can
> be used to overlay any two other iterators on top of each other.

I haven't looked at the code yet, but this makes sense to me.  In
general, the major reason to supply a callback style of API is when
iteration is more complicated than whatever will be consuming the data
(I can't remember where I heard this argument, but I found it pretty
convincing).  It seems like this is increasingly not the case, so we
should move towards the iterator style.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RFC: New reference iteration paradigm
  2016-03-31 19:31   ` Jeff King
  2016-03-31 20:08     ` Junio C Hamano
@ 2018-05-26 17:25     ` Jakub Narebski
  2018-05-29 16:52       ` Jeff King
  1 sibling, 1 reply; 7+ messages in thread
From: Jakub Narebski @ 2018-05-26 17:25 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Michael Haggerty, git discussion list,
	David Turner

Jeff King <peff@peff.net> writes:
> On Thu, Mar 31, 2016 at 11:01:44AM -0700, Junio C Hamano wrote:
>> Michael Haggerty <mhagger@alum.mit.edu> writes:
>> 
>>> the backend now has to implement
>>>
>>>> struct ref_iterator *ref_iterator_begin_fn(const char *submodule,
>>>>                                            const char *prefix,
>>>>                                            unsigned int flags);

The problem with callbacks, including their lack of composability, is
often solved in high-level languages by using Promises / Futures (or
Channels / Supplies).

But I think in this case implementing a straightforward Iterator pattern
would be a better and simpler solution, especially in C.

>>> The ref_iterator itself has to implement two main methods:
>>>
>>>> int iterator_advance_fn(struct ref_iterator *ref_iterator);
>>>> void iterator_free_fn(struct ref_iterator *ref_iterator);
>>>
>>> A loop over references now looks something like
>>>
>>>> struct ref_iterator *iter = each_ref_in_iterator("refs/tags/");
>>>> while (ref_iterator_advance(iter)) {
>>>>         /* code using iter->refname, iter->oid, iter->flags */
>>>> }
>> 
>> We'd want to take advantage of the tree-like organization of the
>> refs (i.e. refs/tags/a and refs/tags/b sit next to each other and
>> they are closer to each other than they are to refs/heads/a) so that
>> a request "I want to iterate only over tags, even though I may have
>> millions of other kinds of refs" can be done with cost that is
>> proportional to how many tags you have.
>> 
>> The current implementation of for_each_tag_ref() that goes down to
>> do_for_each_entry() in files-backend.c has that propertly, and the
>> new iteration mechanism with the above design seems to keep it,
>> which is very nice.
>
> Actually, that is a slight fiction. :)
>
> We traverse only the loose ref directories we need, but we populate the
> entire packed-refs tree in one go. That's fine if you have a reasonable
> number of refs and care about the cold-cache case (where you care a lot
> about hitting directories you don't need to, but reading the entire
> packed-refs file isn't a big deal). But it's really bad if you have an
> 800MB packed-refs file, as looking up one tiny subset of the entries
> wastes a lot of RAM and CPU pulling that into our internal
> representation[1].
>
> At one point I wrote a patch to binary search the packed-refs file, find
> the first "refs/tags/" entry, and then walk linearly through there. What
> stopped me is that the current refs.c code (I guess file-backend.c these
> days) was not happy with me partially filling in the ref_dir structs in
> this "inside out" way.
[...]

Isn't this what reftable - an alternative way of storing refs in Git,
tested by being used by JGit - would solve?  See Christian Couder post
"Implementing reftable in Git"

  https://public-inbox.org/git/CAP8UFD0PPZSjBnxCA7ez91vBuatcHKQ+JUWvTD1iHcXzPBjPBg@mail.gmail.com/t/#u

'Efficient lookup of an entire namespace, such as refs/tags/' is
allegedly one of the objectives of this format.

Regards,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RFC: New reference iteration paradigm
  2018-05-26 17:25     ` Jakub Narebski
@ 2018-05-29 16:52       ` Jeff King
  0 siblings, 0 replies; 7+ messages in thread
From: Jeff King @ 2018-05-29 16:52 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Junio C Hamano, Michael Haggerty, git discussion list

On Sat, May 26, 2018 at 07:25:32PM +0200, Jakub Narebski wrote:

> > At one point I wrote a patch to binary search the packed-refs file, find
> > the first "refs/tags/" entry, and then walk linearly through there. What
> > stopped me is that the current refs.c code (I guess file-backend.c these
> > days) was not happy with me partially filling in the ref_dir structs in
> > this "inside out" way.
> [...]
> 
> Isn't this what reftable - an alternative way of storing refs in Git,
> tested by being used by JGit - would solve?  See Christian Couder post
> "Implementing reftable in Git"
> 
>   https://public-inbox.org/git/CAP8UFD0PPZSjBnxCA7ez91vBuatcHKQ+JUWvTD1iHcXzPBjPBg@mail.gmail.com/t/#u
> 
> 'Efficient lookup of an entire namespace, such as refs/tags/' is
> allegedly one of the objectives of this format.

The thread you are responding to is over 2 years old. ;)

Since then, Michael rewrote the packed-refs code to handle this case,
and we mmap and binary-search the file since v2.15.

Reftables would also have (amortized) log-n lookup, and also fix some
other problems (like lack of atomic transactions). So yes, I hope we do
eventually move to reftable, too.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-05-29 16:52 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-31 16:13 RFC: New reference iteration paradigm Michael Haggerty
2016-03-31 18:01 ` Junio C Hamano
2016-03-31 19:31   ` Jeff King
2016-03-31 20:08     ` Junio C Hamano
2018-05-26 17:25     ` Jakub Narebski
2018-05-29 16:52       ` Jeff King
2016-03-31 20:15 ` David Turner

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).