git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* Proposal: object negotiation for partial clones
@ 2019-04-28 15:55 Matthew DeVore
  2019-05-06 18:25 ` Jonathan Nieder
  2019-05-06 19:28 ` Jonathan Tan
  0 siblings, 2 replies; 23+ messages in thread
From: Matthew DeVore @ 2019-04-28 15:55 UTC (permalink / raw)
  To: git; +Cc: Jonathan Nieder, Jonathan Tan

Hello,

I'm considering implementing a feature in the Git protocol which would
enable efficient and accurate object negotiation when the client is a
partial clone. I'd like to refine and get some validation of my
approach before I start to write any code, so I've written a proposal
for anyone interested to review. Your comments would be appreciated.

Remember this is a publicly-accessible document so be sure to not
discuss any confidential topics in the comments!

Tiny URL: http://tinyurl.com/yxz747cy
Full URL: https://docs.google.com/document/d/1bcDKCgd2Dw5Cl6H9TrNi0ekqzaT8rbyK8EpPE3RcvPA/edit#

Thank you,
Matt

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Proposal: object negotiation for partial clones
  2019-04-28 15:55 Proposal: object negotiation for partial clones Matthew DeVore
@ 2019-05-06 18:25 ` Jonathan Nieder
  2019-05-06 19:28 ` Jonathan Tan
  1 sibling, 0 replies; 23+ messages in thread
From: Jonathan Nieder @ 2019-05-06 18:25 UTC (permalink / raw)
  To: Matthew DeVore; +Cc: git, Jonathan Tan

Hi,

Matthew DeVore wrote:

> I'm considering implementing a feature in the Git protocol which would
> enable efficient and accurate object negotiation when the client is a
> partial clone. I'd like to refine and get some validation of my
> approach before I start to write any code, so I've written a proposal
> for anyone interested to review. Your comments would be appreciated.

Yay!  Thanks for looking into this, and sorry I didn't respond sooner.

I know the doc has a "use case" section, but I suppose I am not sure
that I understand the use case yet.  Is this about improving the
filter syntax to handle features like directory listing?  Or is this
about being able to make better use of deltas in a partial clone, to
decrease bandwidth consumption and overhead that is proportional to
size?

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Proposal: object negotiation for partial clones
  2019-04-28 15:55 Proposal: object negotiation for partial clones Matthew DeVore
  2019-05-06 18:25 ` Jonathan Nieder
@ 2019-05-06 19:28 ` Jonathan Tan
  2019-05-06 19:46   ` Jonathan Nieder
  2019-05-06 22:47   ` Matthew DeVore
  1 sibling, 2 replies; 23+ messages in thread
From: Jonathan Tan @ 2019-05-06 19:28 UTC (permalink / raw)
  To: matvore; +Cc: git, jrn, jonathantanmy

> I'm considering implementing a feature in the Git protocol which would
> enable efficient and accurate object negotiation when the client is a
> partial clone. I'd like to refine and get some validation of my
> approach before I start to write any code, so I've written a proposal
> for anyone interested to review. Your comments would be appreciated.

Thanks. Let me try to summarize: The issue is that, during a fetch,
normally the client can say "have" to inform the server that it has a
commit and all its referenced objects (barring shallow lines), but we
can't do the same if the client is a partial clone (because having a
commit doesn't necessarily mean that we have all referenced objects).
And not doing this means that the server sends a lot of unnecessary
objects in the sent packfile. The solution is to do the fetch in 2
parts: one to get the list of objects that would be sent, and after the
client filters that, one to get the objects themselves.

It was unclear to me whether this is meant for (1) fetches directly
initiated by the user that fetch commits (e.g. "git fetch origin",
reusing the configured "core.partialclonefilter") and/or for (2) lazy
fetching of missing objects. My assumption is that this is only for (2).

My main question is: we can get the same list of objects (in the form of
tree objects) if we fetch with "blob:none" filter. Admittedly, we will
get extra data (file names, etc.) - if the extra bandwidth saving is
necessary, this should be called out. (And some of the savings will be
offset by the fact that we will actually need some of those tree
objects.)

Assuming that we do need that bandwidth saving, here's my review of that
document.

The document describes the 1st request exactly as I envision - a
specific parameter sent by the client, and the server responds with a
list of object names.

For the 2nd request, the document describes it as repeating the original
query of the 1st request while also giving the full list of objects
wanted as "choose-refs". I'm still not convinced that repeating the
original query is necessary - I would just give the list of objects as
wants. The rationale given for repeating the original query is:

> The original query is helpful because it means the server only needs
> to do a single reachability check, rather than many separate ones.

But this omits the fact that, if doing it the document's way, the server
needs to perform an object walk in addition to the "single reachability
check", and it is not true that if doing it my way, "many separate ones"
need to be done because the server can check reachability of all objects
at once.

Also, my way means that supporting the 2nd request does not require any
code or protocol change - it already works today. Assuming we follow my
approach, the discussion thus lies in supporting the 1st request.

Some more thoughts:

- Changes in server and client scalability: Currently, the server checks
  reachability of all wants, then enumerates, then sends all objects.
  With this change, the server checks reachability of all wants, then
  enumerates, then sends an object list, then checks reachability of all
  objects in the filtered list, then sends some objects. There is
  additional overhead in the extra reachability check and lists of
  objects being sent twice (once by server and once by client), but
  sending fewer objects means that I/O (server, network, client) and
  disk space usage (client) is reduced.

- Usefulness outside partial clone: If the user ever wants a list of
  objects referenced by an object but without their file names, the user
  could use this, but I can't think of such a scenario.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Proposal: object negotiation for partial clones
  2019-05-06 19:28 ` Jonathan Tan
@ 2019-05-06 19:46   ` Jonathan Nieder
  2019-05-06 23:20     ` Matthew DeVore
  2019-05-06 22:47   ` Matthew DeVore
  1 sibling, 1 reply; 23+ messages in thread
From: Jonathan Nieder @ 2019-05-06 19:46 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: matvore, git

Hi,

Jonathan Tan wrote:
> Matthew DeVore wrote:

>> I'm considering implementing a feature in the Git protocol which would
>> enable efficient and accurate object negotiation when the client is a
>> partial clone. I'd like to refine and get some validation of my
>> approach before I start to write any code, so I've written a proposal
>> for anyone interested to review. Your comments would be appreciated.
>
> Thanks. Let me try to summarize: The issue is that, during a fetch,
> normally the client can say "have" to inform the server that it has a
> commit and all its referenced objects (barring shallow lines), but we
> can't do the same if the client is a partial clone (because having a
> commit doesn't necessarily mean that we have all referenced objects).

Ah, interesting.  When this was discussed before, the proposal has been
that the client can say "have" anyway.  They don't have the commit and
all referenced objects, but they have the commit and a *promise* that
they can obtain all referenced objects, which is almost as good.
That's what "git fetch" currently implements.

But there's a hitch: when doing the fetch-on-demand for an object
access, the client currently does not say "have".  Sure, even there,
they have a *promise* that they can obtain all referenced objects, but
this could get out of hand: the first pack may contain a delta against
an object the client doesn't have, triggering another fetch which
contains a delta against another object they don't have, and so on.
Too many round trips.

> And not doing this means that the server sends a lot of unnecessary
> objects in the sent packfile. The solution is to do the fetch in 2
> parts: one to get the list of objects that would be sent, and after the
> client filters that, one to get the objects themselves.

This helps with object selection but not with delta base selection.

For object selection, I think the current approach already works okay,
at least where tree and blob filters are involved.  For commit
filters, in the current approach the fetch-on-demand sends way too
much because there's no "filter=commit:none" option to pass.  Is that
what this proposal aims to address?

For blob filters, if I ignore the capability advertisements (there's
an optimization that hasn't yet been implemented to allow
single-round-trip fetches), the current behavior takes the same number
of round trips as this proposal.  Where the current approach has been
lacking is in delta base selection during fetch-on-demand.  Ideas for
improving that?

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Proposal: object negotiation for partial clones
  2019-05-06 19:28 ` Jonathan Tan
  2019-05-06 19:46   ` Jonathan Nieder
@ 2019-05-06 22:47   ` Matthew DeVore
  2019-05-07 18:34     ` Jonathan Tan
  1 sibling, 1 reply; 23+ messages in thread
From: Matthew DeVore @ 2019-05-06 22:47 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Jonathan Nieder

On Mon, May 6, 2019 at 12:28 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > I'm considering implementing a feature in the Git protocol which would
> > enable efficient and accurate object negotiation when the client is a
> > partial clone. I'd like to refine and get some validation of my
> > approach before I start to write any code, so I've written a proposal
> > for anyone interested to review. Your comments would be appreciated.
>
> Thanks. Let me try to summarize: The issue is that, during a fetch,
> normally the client can say "have" to inform the server that it has a
> commit and all its referenced objects (barring shallow lines), but we
> can't do the same if the client is a partial clone (because having a
> commit doesn't necessarily mean that we have all referenced objects).
> And not doing this means that the server sends a lot of unnecessary
> objects in the sent packfile. The solution is to do the fetch in 2
> parts: one to get the list of objects that would be sent, and after the
> client filters that, one to get the objects themselves.
>
> It was unclear to me whether this is meant for (1) fetches directly
> initiated by the user that fetch commits (e.g. "git fetch origin",
> reusing the configured "core.partialclonefilter") and/or for (2) lazy
> fetching of missing objects. My assumption is that this is only for (2).
Yes, that was my intention. The client doesn't really know anything
about the hashes reported, so it can't really make an informed
selection from the candidate list given by the server after the first
request. I guess if we wanted to just reject *all* objects on the
initial clone, this feature would make that possible. But that can
also be achieved more embracively with a better filter system.

>
> My main question is: we can get the same list of objects (in the form of
> tree objects) if we fetch with "blob:none" filter. Admittedly, we will
> get extra data (file names, etc.) - if the extra bandwidth saving is
> necessary, this should be called out. (And some of the savings will be
> offset by the fact that we will actually need some of those tree
> objects.)
That's a very good point. The data the first request gives us is
basically the tree objects minus file names and modes. So I think a
better feature to implement would be combining of multiple filters.
That way, the client can combine "tree:<some small number>" and
"blob:none" and basically get an "enumeration" of available objects.

>
> Assuming that we do need that bandwidth saving, here's my review of that
> document.
>
> The document describes the 1st request exactly as I envision - a
> specific parameter sent by the client, and the server responds with a
> list of object names.
>
> For the 2nd request, the document describes it as repeating the original
> query of the 1st request while also giving the full list of objects
> wanted as "choose-refs". I'm still not convinced that repeating the
> original query is necessary - I would just give the list of objects as
> wants. The rationale given for repeating the original query is:
>
> > The original query is helpful because it means the server only needs
> > to do a single reachability check, rather than many separate ones.
>
> But this omits the fact that, if doing it the document's way, the server
> needs to perform an object walk in addition to the "single reachability
> check", and it is not true that if doing it my way, "many separate ones"
> need to be done because the server can check reachability of all objects
> at once.
After considering more carefully how reachability works (and getting
your explanation of it out-of-band), I would assume that my approach
is no better than marginally faster, and possibly worse, than just
doing a plain reachability check of multiple objects using the current
implementation. My current priorities preclude this kind of
benchmarking+micro-optimization. So I believe what is more important
to me is to simply enable combining multiple filters.

>
> Also, my way means that supporting the 2nd request does not require any
> code or protocol change - it already works today. Assuming we follow my
> approach, the discussion thus lies in supporting the 1st request.
>
> Some more thoughts:
>
> - Changes in server and client scalability: Currently, the server checks
>   reachability of all wants, then enumerates, then sends all objects.
>   With this change, the server checks reachability of all wants, then
>   enumerates, then sends an object list, then checks reachability of all
>   objects in the filtered list, then sends some objects. There is
>   additional overhead in the extra reachability check and lists of
>   objects being sent twice (once by server and once by client), but
>   sending fewer objects means that I/O (server, network, client) and
>   disk space usage (client) is reduced.
Agreed, and this is still true in the new approach we've agreed on.

>
> - Usefulness outside partial clone: If the user ever wants a list of
>   objects referenced by an object but without their file names, the user
>   could use this, but I can't think of such a scenario.
Neither can I.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Proposal: object negotiation for partial clones
  2019-05-06 19:46   ` Jonathan Nieder
@ 2019-05-06 23:20     ` Matthew DeVore
  2019-05-07  0:02       ` Jonathan Nieder
  0 siblings, 1 reply; 23+ messages in thread
From: Matthew DeVore @ 2019-05-06 23:20 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jonathan Tan, matvore, git



> On 2019/05/06, at 12:46, Jonathan Nieder <jrnieder@gmail.com> wrote:
> 
> Hi,
> 
> Jonathan Tan wrote:
>> Matthew DeVore wrote:
> 
>>> I'm considering implementing a feature in the Git protocol which would
>>> enable efficient and accurate object negotiation when the client is a
>>> partial clone. I'd like to refine and get some validation of my
>>> approach before I start to write any code, so I've written a proposal
>>> for anyone interested to review. Your comments would be appreciated.
>> 
>> Thanks. Let me try to summarize: The issue is that, during a fetch,
>> normally the client can say "have" to inform the server that it has a
>> commit and all its referenced objects (barring shallow lines), but we
>> can't do the same if the client is a partial clone (because having a
>> commit doesn't necessarily mean that we have all referenced objects).
> 
> Ah, interesting.  When this was discussed before, the proposal has been
> that the client can say "have" anyway.  They don't have the commit and
> all referenced objects, but they have the commit and a *promise* that
> they can obtain all referenced objects, which is almost as good.
> That's what "git fetch" currently implements.
Doesn’t that mean the “have” may indicate that the client has the entire repository already, even though it’s only a partial clone? If so, then the client intends to ask for some tree plus trees and blobs 2-3 levels down deeper, how would the server distinguish between those objects the client *really* has and those that were just promised to them? Because the whole purpose of this hypothetical request is to get a bunch of promises fulfilled of which 0-99% are fulfilled already.

> 
> For blob filters, if I ignore the capability advertisements (there's
> an optimization that hasn't yet been implemented to allow
> single-round-trip fetches), the current behavior takes the same number
> of round trips as this proposal.  Where the current approach has been
> lacking is in delta base selection during fetch-on-demand.  Ideas for
> improving that?

Maybe something like this (conceptually based on original proposal) ?

1. Client sends request for an object or objects with an extra flag which means “I can’t really tell you what I already have since it’s a chaotic subset of the object database of the repo”

2. Server responds back with set of objects, represented by deltas if that is how the server has them on disk, along with a list of object-IDs needed in order to resolve the content of all the objects. These object-IDs can go several layers of deltas back, and they go back as far as it takes to get to an object stored in its entirety by the server.

3. Client responds back with another request (this time the extra flag sent from step 1 is not necessary) which has “want”s for every object the server named which the client already has.

Very hand-wavey, but I think you see my idea.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Proposal: object negotiation for partial clones
  2019-05-06 23:20     ` Matthew DeVore
@ 2019-05-07  0:02       ` Jonathan Nieder
  0 siblings, 0 replies; 23+ messages in thread
From: Jonathan Nieder @ 2019-05-07  0:02 UTC (permalink / raw)
  To: Matthew DeVore; +Cc: Jonathan Tan, matvore, git

Matthew DeVore wrote:
> On 2019/05/06, at 12:46, Jonathan Nieder <jrnieder@gmail.com> wrote:

>> Ah, interesting.  When this was discussed before, the proposal has been
>> that the client can say "have" anyway.  They don't have the commit and
>> all referenced objects, but they have the commit and a *promise* that
>> they can obtain all referenced objects, which is almost as good.
>> That's what "git fetch" currently implements.
>
> Doesn’t that mean the “have” may indicate that the client has the
> entire repository already, even though it’s only a partial clone? If
> so, then the client intends to ask for some tree plus trees and
> blobs 2-3 levels down deeper, how would the server distinguish
> between those objects the client *really* has and those that were
> just promised to them? Because the whole purpose of this
> hypothetical request is to get a bunch of promises fulfilled of
> which 0-99% are fulfilled already.

For blobs, the answer is simple: the server returns any object
explicitly named in a "want", even if the client already should have
it.

For trees, the current behavior is the same: if you declare that you
"have" everything, then if you "want" a tree with filter tree:2, you
only get that tree.  So here there's already room for improvement.

[...]
> Maybe something like this (conceptually based on original proposal) ?
>
> 1. Client sends request for an object or objects with an extra flag
> which means “I can’t really tell you what I already have since it’s
> a chaotic subset of the object database of the repo”
>
> 2. Server responds back with set of objects, represented by deltas
> if that is how the server has them on disk, along with a list of
> object-IDs needed in order to resolve the content of all the
> objects. These object-IDs can go several layers of deltas back, and
> they go back as far as it takes to get to an object stored in its
> entirety by the server.
>
> 3. Client responds back with another request (this time the extra
> flag sent from step 1 is not necessary) which has “want”s for every
> object the server named which the client already has.
>
> Very hand-wavey, but I think you see my idea.

The only downside I see is that the list of objects may itself be
large, and the server has to check reachability for each one.  But
maybe that's fine.

Perhaps after that initial response, instead of sending the list of
individual objects the client wants, it could send a list of relevant
objects it has (combined with the original set of "want"s).  That
could be a smaller request and it means less work for the server to
check each "want" for reachability.

What do you think?

[...]
> That's a very good point. The data the first request gives us is
> basically the tree objects minus file names and modes. So I think a
> better feature to implement would be combining of multiple filters.
> That way, the client can combine "tree:<some small number>" and
> "blob:none" and basically get an "enumeration" of available objects.

This might be simpler.

Combining filters would be useful for other uses, too.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Proposal: object negotiation for partial clones
  2019-05-06 22:47   ` Matthew DeVore
@ 2019-05-07 18:34     ` Jonathan Tan
  2019-05-07 21:57       ` Matthew DeVore
  0 siblings, 1 reply; 23+ messages in thread
From: Jonathan Tan @ 2019-05-07 18:34 UTC (permalink / raw)
  To: matvore; +Cc: jonathantanmy, git, jrn

> > My main question is: we can get the same list of objects (in the form of
> > tree objects) if we fetch with "blob:none" filter. Admittedly, we will
> > get extra data (file names, etc.) - if the extra bandwidth saving is
> > necessary, this should be called out. (And some of the savings will be
> > offset by the fact that we will actually need some of those tree
> > objects.)
> That's a very good point. The data the first request gives us is
> basically the tree objects minus file names and modes. So I think a
> better feature to implement would be combining of multiple filters.
> That way, the client can combine "tree:<some small number>" and
> "blob:none" and basically get an "enumeration" of available objects.

To get an enumeration of available objects, don't you need to use only
"blob:none"? Combining filters (once that's implemented) will get all
objects only up to a certain depth.

Combining "tree:<n>" and "blob:none" would allow us to reduce the number
of trees transmitted, but I would imagine that the savings would be
significant only for very large repositories. Do you have a specific use
case in mind that isn't solved by "blob:none"?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Proposal: object negotiation for partial clones
  2019-05-07 18:34     ` Jonathan Tan
@ 2019-05-07 21:57       ` Matthew DeVore
  2019-05-09 18:00         ` Jonathan Tan
  0 siblings, 1 reply; 23+ messages in thread
From: Matthew DeVore @ 2019-05-07 21:57 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: matvore, git, jrn



> On 2019/05/07, at 11:34, Jonathan Tan <jonathantanmy@google.com> wrote:
> 
> To get an enumeration of available objects, don't you need to use only
> "blob:none"? Combining filters (once that's implemented) will get all
> objects only up to a certain depth.
> 
> Combining "tree:<n>" and "blob:none" would allow us to reduce the number
> of trees transmitted, but I would imagine that the savings would be
> significant only for very large repositories. Do you have a specific use
> case in mind that isn't solved by "blob:none"?

I am interested in supporting large repositories. The savings seem to be larger than one may expect. I tried the following command on two huge repos to find out how much it costs to fetch “blob:none” for a single commit:

$ git rev-list --objects --filter=blob:none HEAD: | xargs -n 2 bash -c 'git cat-file -s $1' | awk '{ total += $1; print total }'

Note the “:” after HEAD - this limits it to the current commit.

And the results were:
 - Linux: 2 684 054 bytes
 - Chromium: > 16 139 570 bytes (then I got tired of waiting for it to finish)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Proposal: object negotiation for partial clones
  2019-05-07 21:57       ` Matthew DeVore
@ 2019-05-09 18:00         ` Jonathan Tan
  2019-05-14  0:09           ` Matthew DeVore
  0 siblings, 1 reply; 23+ messages in thread
From: Jonathan Tan @ 2019-05-09 18:00 UTC (permalink / raw)
  To: matvore; +Cc: jonathantanmy, matvore, git, jrn

> > On 2019/05/07, at 11:34, Jonathan Tan <jonathantanmy@google.com> wrote:
> >
> > To get an enumeration of available objects, don't you need to use only
> > "blob:none"? Combining filters (once that's implemented) will get all
> > objects only up to a certain depth.
> >
> > Combining "tree:<n>" and "blob:none" would allow us to reduce the number
> > of trees transmitted, but I would imagine that the savings would be
> > significant only for very large repositories. Do you have a specific use
> > case in mind that isn't solved by "blob:none"?
> 
> I am interested in supporting large repositories. The savings seem to be larger than one may expect. I tried the following command on two huge repos to find out how much it costs to fetch “blob:none” for a single commit:
> 
> $ git rev-list --objects --filter=blob:none HEAD: | xargs -n 2 bash -c 'git cat-file -s $1' | awk '{ total += $1; print total }'
> 
> Note the “:” after HEAD - this limits it to the current commit.
> 
> And the results were:
>  - Linux: 2 684 054 bytes
>  - Chromium: > 16 139 570 bytes (then I got tired of waiting for it to finish)

Thanks for the numbers. Let me think about it some more, but I'm still
reluctant to introduce multiple filter support in the protocol and the
implementation for the following reasons:

- For large projects like Linux and Chromium, it may be reasonable to
  expect that an infrequent checkout would result in a few-megabyte
  download.
- (After some in-office discussion) It may be possible to mitigate much
  of that by sending root trees that we have as "have" (e.g. by
  consulting the reflog), and that wouldn't need any protocol change.
- Supporting any combination of filter means that we have more to
  implement and test, especially if we want to support more filters in
  the future. In particular, the different filters (e.g. blob, tree)
  have different code paths now in Git. One way to solve it would be to
  combine everything into one monolith, but I would like to avoid it if
  possible (after having to deal with revision walking a few times...)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Proposal: object negotiation for partial clones
  2019-05-09 18:00         ` Jonathan Tan
@ 2019-05-14  0:09           ` Matthew DeVore
  2019-05-14  0:16             ` Jonathan Nieder
  0 siblings, 1 reply; 23+ messages in thread
From: Matthew DeVore @ 2019-05-14  0:09 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: matvore, git, jrn



> On 2019/05/09, at 11:00, Jonathan Tan <jonathantanmy@google.com> wrote:
> 
> Thanks for the numbers. Let me think about it some more, but I'm still
> reluctant to introduce multiple filter support in the protocol and the
> implementation for the following reasons:

Correction to the original command - I was tweaking it in the middle of running it, and introduced an error that I didn’t notice. Here is one that will work for an entire repo:

$ git rev-list --objects --filter=blob:none HEAD: | awk '{print $1}' | xargs -n 1 git cat-file -s  | awk '{ total += $1; print total }'

When run to completion, Chromium totaled 17 301 144 bytes.

> 
> - For large projects like Linux and Chromium, it may be reasonable to
>  expect that an infrequent checkout would result in a few-megabyte
>  download.

Anyone developing on Chromium would definitely consider a 17 MB original clone to be an improvement over the status quo, but it is still not ideal.

And the 17MB initial download is only incurred once *assuming* the next idea is implemented:

> - (After some in-office discussion) It may be possible to mitigate much
>  of that by sending root trees that we have as "have" (e.g. by
>  consulting the reflog), and that wouldn't need any protocol change.

This would complicate the code - not in Git itself, but in my FUSE-related logic. We would have to explore the reflog and try to find the closest commits in history to the target commit being checked out. This is sounding a bit hacky and round-about, and it assumes that at the FUSE layer we can detect when a checkout is happening cleanly and sufficiently early (rather than when one of the sub-sub-trees is being accessed).

> - Supporting any combination of filter means that we have more to
>  implement and test, especially if we want to support more filters in
>  the future. In particular, the different filters (e.g. blob, tree)
>  have different code paths now in Git. One way to solve it would be to
>  combine everything into one monolith, but I would like to avoid it if
>  possible (after having to deal with revision walking a few times...)

I don’t believe there is any need to introduce monolithic code. The bulk of the filter implementation is in list-objects-filter.c, and I don’t think the file will get much longer with an additional filter that “combines” the existing filter. The new filter is likely simpler than the sparse filter. Once I add the new filter and send out the initial patch set, we can discuss splitting up the file, if it appears to be necessary.

My idea - if it is not clear already - is to add another OO-like interface to list-objects-filter.c which parallels the 5 that are already there.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Proposal: object negotiation for partial clones
  2019-05-14  0:09           ` Matthew DeVore
@ 2019-05-14  0:16             ` Jonathan Nieder
  2019-05-16 18:56               ` [RFC PATCH 0/3] implement composite filters Matthew DeVore
  0 siblings, 1 reply; 23+ messages in thread
From: Jonathan Nieder @ 2019-05-14  0:16 UTC (permalink / raw)
  To: Matthew DeVore; +Cc: Jonathan Tan, matvore, git, Jeff Hostetler, Derrick Stolee

Hi,

Matthew DeVore wrote:
> On 2019/05/09, at 11:00, Jonathan Tan <jonathantanmy@google.com> wrote:

>> - Supporting any combination of filter means that we have more to
>>  implement and test, especially if we want to support more filters in
>>  the future. In particular, the different filters (e.g. blob, tree)
>>  have different code paths now in Git. One way to solve it would be to
>>  combine everything into one monolith, but I would like to avoid it if
>>  possible (after having to deal with revision walking a few times...)
>
> I don’t believe there is any need to introduce monolithic code. The
> bulk of the filter implementation is in list-objects-filter.c, and I
> don’t think the file will get much longer with an additional filter
> that “combines” the existing filter. The new filter is likely
> simpler than the sparse filter. Once I add the new filter and send
> out the initial patch set, we can discuss splitting up the file, if
> it appears to be necessary.
>
> My idea - if it is not clear already - is to add another OO-like
> interface to list-objects-filter.c which parallels the 5 that are
> already there.

Sounds good to me.

For what it's worth, my assumption has always been that we would
eventually want the filters to be stackable.  So I'm glad you're
looking into it.

Jonathan's reminder to clean up as you go is a welcome one.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 0/3] implement composite filters
  2019-05-14  0:16             ` Jonathan Nieder
@ 2019-05-16 18:56               ` Matthew DeVore
  2019-05-16 18:56                 ` [RFC PATCH 1/3] list-objects-filter: refactor into a context struct Matthew DeVore
                                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Matthew DeVore @ 2019-05-16 18:56 UTC (permalink / raw)
  To: jonathantanmy, jrn, git, dstolee, jeffhost, jrnieder
  Cc: Matthew DeVore, matvore

Here is a first stab at composite filters. It does not actually support omits,
but the biggest difficulties of the implementation are already addressed. So I
decided to send out an early version to give interested people an idea of just
what is needed to implement it, and then give them a chance to change my mind
(Jonathan T. especially was concerned about code complexity).

Thank you,

Matthew DeVore (3):
  list-objects-filter: refactor into a context struct
  list-objects-filter-options: error is localizeable
  list-objects-filter: implement composite filters

 Documentation/rev-list-options.txt     |   8 +
 contrib/completion/git-completion.bash |   2 +-
 list-objects-filter-options.c          | 132 +++++++++++-
 list-objects-filter-options.h          |  14 +-
 list-objects-filter.c                  | 272 ++++++++++++++++---------
 list-objects-filter.h                  |  31 ++-
 list-objects.c                         |  45 ++--
 t/t6112-rev-list-filters-objects.sh    | 108 +++++++++-
 8 files changed, 470 insertions(+), 142 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 1/3] list-objects-filter: refactor into a context struct
  2019-05-16 18:56               ` [RFC PATCH 0/3] implement composite filters Matthew DeVore
@ 2019-05-16 18:56                 ` Matthew DeVore
  2019-05-16 18:56                 ` [RFC PATCH 2/3] list-objects-filter-options: error is localizeable Matthew DeVore
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: Matthew DeVore @ 2019-05-16 18:56 UTC (permalink / raw)
  To: jonathantanmy, jrn, git, dstolee, jeffhost, jrnieder
  Cc: Matthew DeVore, matvore

The next patch will create and manage filters in a new way, which means
that this bundle of data will have to be managed at a new callsite. Make
this bundle of data more manageable by putting it in a struct and
making it part of the list-objects-filter module's API.

Signed-off-by: Matthew DeVore <matvore@google.com>
---
 list-objects-filter.c | 156 +++++++++++++++---------------------------
 list-objects-filter.h |  31 ++++++---
 list-objects.c        |  45 ++++++------
 3 files changed, 96 insertions(+), 136 deletions(-)

diff --git a/list-objects-filter.c b/list-objects-filter.c
index ee449de3f7..8e8616b9b8 100644
--- a/list-objects-filter.c
+++ b/list-objects-filter.c
@@ -19,82 +19,60 @@
  * FILTER_SHOWN_BUT_REVISIT -- we set this bit on tree objects
  * that have been shown, but should be revisited if they appear
  * in the traversal (until we mark it SEEN).  This is a way to
  * let us silently de-dup calls to show() in the caller.  This
  * is subtly different from the "revision.h:SHOWN" and the
  * "sha1-name.c:ONELINE_SEEN" bits.  And also different from
  * the non-de-dup usage in pack-bitmap.c
  */
 #define FILTER_SHOWN_BUT_REVISIT (1<<21)
 
-/*
- * A filter for list-objects to omit ALL blobs from the traversal.
- * And to OPTIONALLY collect a list of the omitted OIDs.
- */
-struct filter_blobs_none_data {
-	struct oidset *omits;
-};
-
 static enum list_objects_filter_result filter_blobs_none(
 	struct repository *r,
 	enum list_objects_filter_situation filter_situation,
 	struct object *obj,
 	const char *pathname,
 	const char *filename,
-	void *filter_data_)
+	struct filter_context *ctx)
 {
-	struct filter_blobs_none_data *filter_data = filter_data_;
-
 	switch (filter_situation) {
 	default:
 		BUG("unknown filter_situation: %d", filter_situation);
 
 	case LOFS_BEGIN_TREE:
 		assert(obj->type == OBJ_TREE);
 		/* always include all tree objects */
 		return LOFR_MARK_SEEN | LOFR_DO_SHOW;
 
 	case LOFS_END_TREE:
 		assert(obj->type == OBJ_TREE);
 		return LOFR_ZERO;
 
 	case LOFS_BLOB:
 		assert(obj->type == OBJ_BLOB);
 		assert((obj->flags & SEEN) == 0);
 
-		if (filter_data->omits)
-			oidset_insert(filter_data->omits, &obj->oid);
+		if (ctx->omits)
+			oidset_insert(ctx->omits, &obj->oid);
 		return LOFR_MARK_SEEN; /* but not LOFR_DO_SHOW (hard omit) */
 	}
 }
 
-static void *filter_blobs_none__init(
-	struct oidset *omitted,
+static void filter_blobs_none__init(
 	struct list_objects_filter_options *filter_options,
-	filter_object_fn *filter_fn,
-	filter_free_fn *filter_free_fn)
+	struct filter_context *ctx)
 {
-	struct filter_blobs_none_data *d = xcalloc(1, sizeof(*d));
-	d->omits = omitted;
-
-	*filter_fn = filter_blobs_none;
-	*filter_free_fn = free;
-	return d;
+	ctx->filter_fn = filter_blobs_none;
 }
 
-/*
- * A filter for list-objects to omit ALL trees and blobs from the traversal.
- * Can OPTIONALLY collect a list of the omitted OIDs.
- */
+/* A filter for list-objects to omit ALL trees and blobs from the traversal. */
 struct filter_trees_depth_data {
-	struct oidset *omits;
-
 	/*
 	 * Maps trees to the minimum depth at which they were seen. It is not
 	 * necessary to re-traverse a tree at deeper or equal depths than it has
 	 * already been traversed.
 	 *
 	 * We can't use LOFR_MARK_SEEN for tree objects since this will prevent
 	 * it from being traversed at shallower depths.
 	 */
 	struct oidmap seen_at_depth;
 
@@ -103,41 +81,41 @@ struct filter_trees_depth_data {
 };
 
 struct seen_map_entry {
 	struct oidmap_entry base;
 	size_t depth;
 };
 
 /* Returns 1 if the oid was in the omits set before it was invoked. */
 static int filter_trees_update_omits(
 	struct object *obj,
-	struct filter_trees_depth_data *filter_data,
+	struct filter_context *ctx,
 	int include_it)
 {
-	if (!filter_data->omits)
+	if (!ctx->omits)
 		return 0;
 
 	if (include_it)
-		return oidset_remove(filter_data->omits, &obj->oid);
+		return oidset_remove(ctx->omits, &obj->oid);
 	else
-		return oidset_insert(filter_data->omits, &obj->oid);
+		return oidset_insert(ctx->omits, &obj->oid);
 }
 
 static enum list_objects_filter_result filter_trees_depth(
 	struct repository *r,
 	enum list_objects_filter_situation filter_situation,
 	struct object *obj,
 	const char *pathname,
 	const char *filename,
-	void *filter_data_)
+	struct filter_context *ctx)
 {
-	struct filter_trees_depth_data *filter_data = filter_data_;
+	struct filter_trees_depth_data *filter_data = ctx->data;
 	struct seen_map_entry *seen_info;
 	int include_it = filter_data->current_depth <
 		filter_data->exclude_depth;
 	int filter_res;
 	int already_seen;
 
 	/*
 	 * Note that we do not use _MARK_SEEN in order to allow re-traversal in
 	 * case we encounter a tree or blob again at a shallower depth.
 	 */
@@ -145,47 +123,47 @@ static enum list_objects_filter_result filter_trees_depth(
 	switch (filter_situation) {
 	default:
 		BUG("unknown filter_situation: %d", filter_situation);
 
 	case LOFS_END_TREE:
 		assert(obj->type == OBJ_TREE);
 		filter_data->current_depth--;
 		return LOFR_ZERO;
 
 	case LOFS_BLOB:
-		filter_trees_update_omits(obj, filter_data, include_it);
+		filter_trees_update_omits(obj, ctx, include_it);
 		return include_it ? LOFR_MARK_SEEN | LOFR_DO_SHOW : LOFR_ZERO;
 
 	case LOFS_BEGIN_TREE:
 		seen_info = oidmap_get(
 			&filter_data->seen_at_depth, &obj->oid);
 		if (!seen_info) {
 			seen_info = xcalloc(1, sizeof(*seen_info));
 			oidcpy(&seen_info->base.oid, &obj->oid);
 			seen_info->depth = filter_data->current_depth;
 			oidmap_put(&filter_data->seen_at_depth, seen_info);
 			already_seen = 0;
 		} else {
 			already_seen =
 				filter_data->current_depth >= seen_info->depth;
 		}
 
 		if (already_seen) {
 			filter_res = LOFR_SKIP_TREE;
 		} else {
 			int been_omitted = filter_trees_update_omits(
-				obj, filter_data, include_it);
+				obj, ctx, include_it);
 			seen_info->depth = filter_data->current_depth;
 
 			if (include_it)
 				filter_res = LOFR_DO_SHOW;
-			else if (filter_data->omits && !been_omitted)
+			else if (ctx->omits && !been_omitted)
 				/*
 				 * Must update omit information of children
 				 * recursively; they have not been omitted yet.
 				 */
 				filter_res = LOFR_ZERO;
 			else
 				filter_res = LOFR_SKIP_TREE;
 		}
 
 		filter_data->current_depth++;
@@ -194,55 +172,48 @@ static enum list_objects_filter_result filter_trees_depth(
 }
 
 static void filter_trees_free(void *filter_data) {
 	struct filter_trees_depth_data *d = filter_data;
 	if (!d)
 		return;
 	oidmap_free(&d->seen_at_depth, 1);
 	free(d);
 }
 
-static void *filter_trees_depth__init(
-	struct oidset *omitted,
+static void filter_trees_depth__init(
 	struct list_objects_filter_options *filter_options,
-	filter_object_fn *filter_fn,
-	filter_free_fn *filter_free_fn)
+	struct filter_context *ctx)
 {
 	struct filter_trees_depth_data *d = xcalloc(1, sizeof(*d));
-	d->omits = omitted;
 	oidmap_init(&d->seen_at_depth, 0);
 	d->exclude_depth = filter_options->tree_exclude_depth;
 	d->current_depth = 0;
 
-	*filter_fn = filter_trees_depth;
-	*filter_free_fn = filter_trees_free;
-	return d;
+	ctx->filter_fn = filter_trees_depth;
+	ctx->free_fn = filter_trees_free;
+	ctx->data = d;
 }
 
-/*
- * A filter for list-objects to omit large blobs.
- * And to OPTIONALLY collect a list of the omitted OIDs.
- */
+/* A filter for list-objects to omit large blobs. */
 struct filter_blobs_limit_data {
-	struct oidset *omits;
 	unsigned long max_bytes;
 };
 
 static enum list_objects_filter_result filter_blobs_limit(
 	struct repository *r,
 	enum list_objects_filter_situation filter_situation,
 	struct object *obj,
 	const char *pathname,
 	const char *filename,
-	void *filter_data_)
+	struct filter_context *ctx)
 {
-	struct filter_blobs_limit_data *filter_data = filter_data_;
+	struct filter_blobs_limit_data *filter_data = ctx->data;
 	unsigned long object_length;
 	enum object_type t;
 
 	switch (filter_situation) {
 	default:
 		BUG("unknown filter_situation: %d", filter_situation);
 
 	case LOFS_BEGIN_TREE:
 		assert(obj->type == OBJ_TREE);
 		/* always include all tree objects */
@@ -263,44 +234,41 @@ static enum list_objects_filter_result filter_blobs_limit(
 			 * apply the size filter criteria.  Be conservative
 			 * and force show it (and let the caller deal with
 			 * the ambiguity).
 			 */
 			goto include_it;
 		}
 
 		if (object_length < filter_data->max_bytes)
 			goto include_it;
 
-		if (filter_data->omits)
-			oidset_insert(filter_data->omits, &obj->oid);
+		if (ctx->omits)
+			oidset_insert(ctx->omits, &obj->oid);
 		return LOFR_MARK_SEEN; /* but not LOFR_DO_SHOW (hard omit) */
 	}
 
 include_it:
-	if (filter_data->omits)
-		oidset_remove(filter_data->omits, &obj->oid);
+	if (ctx->omits)
+		oidset_remove(ctx->omits, &obj->oid);
 	return LOFR_MARK_SEEN | LOFR_DO_SHOW;
 }
 
-static void *filter_blobs_limit__init(
-	struct oidset *omitted,
+static void filter_blobs_limit__init(
 	struct list_objects_filter_options *filter_options,
-	filter_object_fn *filter_fn,
-	filter_free_fn *filter_free_fn)
+	struct filter_context *ctx)
 {
 	struct filter_blobs_limit_data *d = xcalloc(1, sizeof(*d));
-	d->omits = omitted;
 	d->max_bytes = filter_options->blob_limit_value;
 
-	*filter_fn = filter_blobs_limit;
-	*filter_free_fn = free;
-	return d;
+	ctx->filter_fn = filter_blobs_limit;
+	ctx->free_fn = free;
+	ctx->data = d;
 }
 
 /*
  * A filter driven by a sparse-checkout specification to only
  * include blobs that a sparse checkout would populate.
  *
  * The sparse-checkout spec can be loaded from a blob with the
  * given OID or from a local pathname.  We allow an OID because
  * the repo may be bare or we may be doing the filtering on the
  * server.
@@ -319,36 +287,35 @@ struct frame {
 	 * omitted objects.
 	 *
 	 * 0 if everything (recursively) contained in this directory
 	 * has been explicitly included (SHOWN) in the result and
 	 * the directory may be short-cut later in the traversal.
 	 */
 	unsigned child_prov_omit : 1;
 };
 
 struct filter_sparse_data {
-	struct oidset *omits;
 	struct exclude_list el;
 
 	size_t nr, alloc;
 	struct frame *array_frame;
 };
 
 static enum list_objects_filter_result filter_sparse(
 	struct repository *r,
 	enum list_objects_filter_situation filter_situation,
 	struct object *obj,
 	const char *pathname,
 	const char *filename,
-	void *filter_data_)
+	struct filter_context *ctx)
 {
-	struct filter_sparse_data *filter_data = filter_data_;
+	struct filter_sparse_data *filter_data = ctx->data;
 	int val, dtype;
 	struct frame *frame;
 
 	switch (filter_situation) {
 	default:
 		BUG("unknown filter_situation: %d", filter_situation);
 
 	case LOFS_BEGIN_TREE:
 		assert(obj->type == OBJ_TREE);
 		dtype = DT_DIR;
@@ -414,128 +381,117 @@ static enum list_objects_filter_result filter_sparse(
 
 		frame = &filter_data->array_frame[filter_data->nr];
 
 		dtype = DT_REG;
 		val = is_excluded_from_list(pathname, strlen(pathname),
 					    filename, &dtype, &filter_data->el,
 					    r->index);
 		if (val < 0)
 			val = frame->defval;
 		if (val > 0) {
-			if (filter_data->omits)
-				oidset_remove(filter_data->omits, &obj->oid);
+			if (ctx->omits)
+				oidset_remove(ctx->omits, &obj->oid);
 			return LOFR_MARK_SEEN | LOFR_DO_SHOW;
 		}
 
 		/*
 		 * Provisionally omit it.  We've already established that
 		 * this pathname is not in the sparse-checkout specification
 		 * with the CURRENT pathname, so we *WANT* to omit this blob.
 		 *
 		 * However, a pathname elsewhere in the tree may also
 		 * reference this same blob, so we cannot reject it yet.
 		 * Leave the LOFR_ bits unset so that if the blob appears
 		 * again in the traversal, we will be asked again.
 		 */
-		if (filter_data->omits)
-			oidset_insert(filter_data->omits, &obj->oid);
+		if (ctx->omits)
+			oidset_insert(ctx->omits, &obj->oid);
 
 		/*
 		 * Remember that at least 1 blob in this tree was
 		 * provisionally omitted.  This prevents us from short
 		 * cutting the tree in future iterations.
 		 */
 		frame->child_prov_omit = 1;
 		return LOFR_ZERO;
 	}
 }
 
 
 static void filter_sparse_free(void *filter_data)
 {
 	struct filter_sparse_data *d = filter_data;
 	/* TODO free contents of 'd' */
 	free(d);
 }
 
-static void *filter_sparse_oid__init(
-	struct oidset *omitted,
+static void filter_sparse_oid__init(
 	struct list_objects_filter_options *filter_options,
-	filter_object_fn *filter_fn,
-	filter_free_fn *filter_free_fn)
+	struct filter_context *ctx)
 {
 	struct filter_sparse_data *d = xcalloc(1, sizeof(*d));
-	d->omits = omitted;
 	if (add_excludes_from_blob_to_list(filter_options->sparse_oid_value,
 					   NULL, 0, &d->el) < 0)
 		die("could not load filter specification");
 
 	ALLOC_GROW(d->array_frame, d->nr + 1, d->alloc);
 	d->array_frame[d->nr].defval = 0; /* default to include */
 	d->array_frame[d->nr].child_prov_omit = 0;
 
-	*filter_fn = filter_sparse;
-	*filter_free_fn = filter_sparse_free;
-	return d;
+	ctx->filter_fn = filter_sparse;
+	ctx->free_fn = filter_sparse_free;
+	ctx->data = d;
 }
 
-static void *filter_sparse_path__init(
-	struct oidset *omitted,
+static void filter_sparse_path__init(
 	struct list_objects_filter_options *filter_options,
-	filter_object_fn *filter_fn,
-	filter_free_fn *filter_free_fn)
+	struct filter_context *ctx)
 {
 	struct filter_sparse_data *d = xcalloc(1, sizeof(*d));
-	d->omits = omitted;
 	if (add_excludes_from_file_to_list(filter_options->sparse_path_value,
 					   NULL, 0, &d->el, NULL) < 0)
 		die("could not load filter specification");
 
 	ALLOC_GROW(d->array_frame, d->nr + 1, d->alloc);
 	d->array_frame[d->nr].defval = 0; /* default to include */
 	d->array_frame[d->nr].child_prov_omit = 0;
 
-	*filter_fn = filter_sparse;
-	*filter_free_fn = filter_sparse_free;
-	return d;
+	ctx->filter_fn = filter_sparse;
+	ctx->free_fn = filter_sparse_free;
+	ctx->data = d;
 }
 
-typedef void *(*filter_init_fn)(
-	struct oidset *omitted,
+typedef void (*filter_init_fn)(
 	struct list_objects_filter_options *filter_options,
-	filter_object_fn *filter_fn,
-	filter_free_fn *filter_free_fn);
+	struct filter_context *ctx);
 
 /*
  * Must match "enum list_objects_filter_choice".
  */
 static filter_init_fn s_filters[] = {
 	NULL,
 	filter_blobs_none__init,
 	filter_blobs_limit__init,
 	filter_trees_depth__init,
 	filter_sparse_oid__init,
 	filter_sparse_path__init,
 };
 
-void *list_objects_filter__init(
+void list_objects_filter__init(
 	struct oidset *omitted,
 	struct list_objects_filter_options *filter_options,
-	filter_object_fn *filter_fn,
-	filter_free_fn *filter_free_fn)
+	struct filter_context *ctx)
 {
 	filter_init_fn init_fn;
 
 	assert((sizeof(s_filters) / sizeof(s_filters[0])) == LOFC__COUNT);
 
 	if (filter_options->choice >= LOFC__COUNT)
 		BUG("invalid list-objects filter choice: %d",
 		    filter_options->choice);
 
+	memset(ctx, 0, sizeof(*ctx));
+	ctx->omits = omitted;
 	init_fn = s_filters[filter_options->choice];
 	if (init_fn)
-		return init_fn(omitted, filter_options,
-			       filter_fn, filter_free_fn);
-	*filter_fn = NULL;
-	*filter_free_fn = NULL;
-	return NULL;
+		init_fn(filter_options, ctx);
 }
diff --git a/list-objects-filter.h b/list-objects-filter.h
index 1d45a4ad57..ee807f5d9b 100644
--- a/list-objects-filter.h
+++ b/list-objects-filter.h
@@ -53,37 +53,46 @@ enum list_objects_filter_result {
 	LOFR_DO_SHOW   = 1<<1,
 	LOFR_SKIP_TREE = 1<<2,
 };
 
 enum list_objects_filter_situation {
 	LOFS_BEGIN_TREE,
 	LOFS_END_TREE,
 	LOFS_BLOB
 };
 
-typedef enum list_objects_filter_result (*filter_object_fn)(
-	struct repository *r,
-	enum list_objects_filter_situation filter_situation,
-	struct object *obj,
-	const char *pathname,
-	const char *filename,
-	void *filter_data);
+struct filter_context {
+	enum list_objects_filter_result (*filter_fn)(
+		struct repository *r,
+		enum list_objects_filter_situation filter_situation,
+		struct object *obj,
+		const char *pathname,
+		const char *filename,
+		struct filter_context *ctx);
+	void (*free_fn)(void *filter_data);
 
-typedef void (*filter_free_fn)(void *filter_data);
+	struct oidset *omits;
+	void *data;
+};
 
 /*
  * Constructor for the set of defined list-objects filters.
  * Returns a generic "void *filter_data".
  *
  * The returned "filter_fn" will be used by traverse_commit_list()
  * to filter the results.
  *
  * The returned "filter_free_fn" is a destructor for the
  * filter_data.
  */
-void *list_objects_filter__init(
+void list_objects_filter__init(
 	struct oidset *omitted,
 	struct list_objects_filter_options *filter_options,
-	filter_object_fn *filter_fn,
-	filter_free_fn *filter_free_fn);
+	struct filter_context *ctx);
+
+static inline void list_objects_filter__release(struct filter_context *ctx) {
+	if (ctx->data && ctx->free_fn)
+		ctx->free_fn(ctx->data);
+	memset(ctx, 0, sizeof(*ctx));
+}
 
 #endif /* LIST_OBJECTS_FILTER_H */
diff --git a/list-objects.c b/list-objects.c
index b5651ddd5b..7a73f7deee 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -11,22 +11,21 @@
 #include "list-objects-filter-options.h"
 #include "packfile.h"
 #include "object-store.h"
 #include "trace.h"
 
 struct traversal_context {
 	struct rev_info *revs;
 	show_object_fn show_object;
 	show_commit_fn show_commit;
 	void *show_data;
-	filter_object_fn filter_fn;
-	void *filter_data;
+	struct filter_context filter_ctx;
 };
 
 static void process_blob(struct traversal_context *ctx,
 			 struct blob *blob,
 			 struct strbuf *path,
 			 const char *name)
 {
 	struct object *obj = &blob->object;
 	size_t pathlen;
 	enum list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_DO_SHOW;
@@ -47,25 +46,25 @@ static void process_blob(struct traversal_context *ctx,
 	 * may cause the actual filter to report an incomplete list
 	 * of missing objects.
 	 */
 	if (ctx->revs->exclude_promisor_objects &&
 	    !has_object_file(&obj->oid) &&
 	    is_promisor_object(&obj->oid))
 		return;
 
 	pathlen = path->len;
 	strbuf_addstr(path, name);
-	if ((obj->flags & NOT_USER_GIVEN) && ctx->filter_fn)
-		r = ctx->filter_fn(ctx->revs->repo,
-				   LOFS_BLOB, obj,
-				   path->buf, &path->buf[pathlen],
-				   ctx->filter_data);
+	if ((obj->flags & NOT_USER_GIVEN) && ctx->filter_ctx.filter_fn)
+		r = ctx->filter_ctx.filter_fn(ctx->revs->repo,
+					      LOFS_BLOB, obj,
+					      path->buf, &path->buf[pathlen],
+					      &ctx->filter_ctx);
 	if (r & LOFR_MARK_SEEN)
 		obj->flags |= SEEN;
 	if (r & LOFR_DO_SHOW)
 		ctx->show_object(obj, path->buf, ctx->show_data);
 	strbuf_setlen(path, pathlen);
 }
 
 /*
  * Processing a gitlink entry currently does nothing, since
  * we do not recurse into the subproject.
@@ -179,42 +178,42 @@ static void process_tree(struct traversal_context *ctx,
 		 */
 		if (revs->exclude_promisor_objects &&
 		    is_promisor_object(&obj->oid))
 			return;
 
 		if (!revs->do_not_die_on_missing_tree)
 			die("bad tree object %s", oid_to_hex(&obj->oid));
 	}
 
 	strbuf_addstr(base, name);
-	if ((obj->flags & NOT_USER_GIVEN) && ctx->filter_fn)
-		r = ctx->filter_fn(ctx->revs->repo,
-				   LOFS_BEGIN_TREE, obj,
-				   base->buf, &base->buf[baselen],
-				   ctx->filter_data);
+	if ((obj->flags & NOT_USER_GIVEN) && ctx->filter_ctx.filter_fn)
+		r = ctx->filter_ctx.filter_fn(ctx->revs->repo,
+					      LOFS_BEGIN_TREE, obj,
+					      base->buf, &base->buf[baselen],
+					      &ctx->filter_ctx);
 	if (r & LOFR_MARK_SEEN)
 		obj->flags |= SEEN;
 	if (r & LOFR_DO_SHOW)
 		ctx->show_object(obj, base->buf, ctx->show_data);
 	if (base->len)
 		strbuf_addch(base, '/');
 
 	if (r & LOFR_SKIP_TREE)
 		trace_printf("Skipping contents of tree %s...\n", base->buf);
 	else if (!failed_parse)
 		process_tree_contents(ctx, tree, base);
 
-	if ((obj->flags & NOT_USER_GIVEN) && ctx->filter_fn) {
-		r = ctx->filter_fn(ctx->revs->repo,
-				   LOFS_END_TREE, obj,
-				   base->buf, &base->buf[baselen],
-				   ctx->filter_data);
+	if ((obj->flags & NOT_USER_GIVEN) && ctx->filter_ctx.filter_fn) {
+		r = ctx->filter_ctx.filter_fn(ctx->revs->repo,
+					      LOFS_END_TREE, obj,
+					      base->buf, &base->buf[baselen],
+					      &ctx->filter_ctx);
 		if (r & LOFR_MARK_SEEN)
 			obj->flags |= SEEN;
 		if (r & LOFR_DO_SHOW)
 			ctx->show_object(obj, base->buf, ctx->show_data);
 	}
 
 	strbuf_setlen(base, baselen);
 	free_tree_buffer(tree);
 }
 
@@ -395,38 +394,34 @@ static void do_traverse(struct traversal_context *ctx)
 void traverse_commit_list(struct rev_info *revs,
 			  show_commit_fn show_commit,
 			  show_object_fn show_object,
 			  void *show_data)
 {
 	struct traversal_context ctx;
 	ctx.revs = revs;
 	ctx.show_commit = show_commit;
 	ctx.show_object = show_object;
 	ctx.show_data = show_data;
-	ctx.filter_fn = NULL;
-	ctx.filter_data = NULL;
+	memset(&ctx.filter_ctx, 0, sizeof(ctx.filter_ctx));
 	do_traverse(&ctx);
 }
 
 void traverse_commit_list_filtered(
 	struct list_objects_filter_options *filter_options,
 	struct rev_info *revs,
 	show_commit_fn show_commit,
 	show_object_fn show_object,
 	void *show_data,
 	struct oidset *omitted)
 {
 	struct traversal_context ctx;
-	filter_free_fn filter_free_fn = NULL;
+	memset(&ctx, 0, sizeof(ctx));
 
 	ctx.revs = revs;
 	ctx.show_object = show_object;
 	ctx.show_commit = show_commit;
 	ctx.show_data = show_data;
-	ctx.filter_fn = NULL;
 
-	ctx.filter_data = list_objects_filter__init(omitted, filter_options,
-						    &ctx.filter_fn, &filter_free_fn);
+	list_objects_filter__init(omitted, filter_options, &ctx.filter_ctx);
 	do_traverse(&ctx);
-	if (ctx.filter_data && filter_free_fn)
-		filter_free_fn(ctx.filter_data);
+	list_objects_filter__release(&ctx.filter_ctx);
 }
-- 
2.21.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 2/3] list-objects-filter-options: error is localizeable
  2019-05-16 18:56               ` [RFC PATCH 0/3] implement composite filters Matthew DeVore
  2019-05-16 18:56                 ` [RFC PATCH 1/3] list-objects-filter: refactor into a context struct Matthew DeVore
@ 2019-05-16 18:56                 ` Matthew DeVore
  2019-05-16 18:56                 ` [RFC PATCH 3/3] list-objects-filter: implement composite filters Matthew DeVore
  2019-05-16 22:41                 ` [RFC PATCH 0/3] " Jonathan Tan
  3 siblings, 0 replies; 23+ messages in thread
From: Matthew DeVore @ 2019-05-16 18:56 UTC (permalink / raw)
  To: jonathantanmy, jrn, git, dstolee, jeffhost, jrnieder
  Cc: Matthew DeVore, matvore

The "invalid filter-spec" message is user-facing and not a BUG, so make
it localizeable.

Signed-off-by: Matthew DeVore <matvore@google.com>
---
 list-objects-filter-options.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
index c0036f7378..e46ea467bc 100644
--- a/list-objects-filter-options.c
+++ b/list-objects-filter-options.c
@@ -81,21 +81,21 @@ static int gently_parse_list_objects_filter(
 		filter_options->choice = LOFC_SPARSE_PATH;
 		filter_options->sparse_path_value = strdup(v0);
 		return 0;
 	}
 	/*
 	 * Please update _git_fetch() in git-completion.bash when you
 	 * add new filters
 	 */
 
 	if (errbuf)
-		strbuf_addf(errbuf, "invalid filter-spec '%s'", arg);
+		strbuf_addf(errbuf, _("invalid filter-spec '%s'"), arg);
 
 	memset(filter_options, 0, sizeof(*filter_options));
 	return 1;
 }
 
 int parse_list_objects_filter(struct list_objects_filter_options *filter_options,
 			      const char *arg)
 {
 	struct strbuf buf = STRBUF_INIT;
 	if (gently_parse_list_objects_filter(filter_options, arg, &buf))
-- 
2.21.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 3/3] list-objects-filter: implement composite filters
  2019-05-16 18:56               ` [RFC PATCH 0/3] implement composite filters Matthew DeVore
  2019-05-16 18:56                 ` [RFC PATCH 1/3] list-objects-filter: refactor into a context struct Matthew DeVore
  2019-05-16 18:56                 ` [RFC PATCH 2/3] list-objects-filter-options: error is localizeable Matthew DeVore
@ 2019-05-16 18:56                 ` Matthew DeVore
  2019-05-17  3:25                   ` Junio C Hamano
  2019-05-16 22:41                 ` [RFC PATCH 0/3] " Jonathan Tan
  3 siblings, 1 reply; 23+ messages in thread
From: Matthew DeVore @ 2019-05-16 18:56 UTC (permalink / raw)
  To: jonathantanmy, jrn, git, dstolee, jeffhost, jrnieder
  Cc: Matthew DeVore, matvore

Allow combining filters such that only objects accepted by all filters
are shown. The motivation for this is to allow getting directory
listings without also fetching blobs. This can be done by combining
blob:none with tree:<depth>. There are massive repositories that have
larger-than-expected trees - even if you include only a single commit.

The current usage requires passing the filter to rev-list, or sending
it over the wire, as:

	combine:<FILTER1>+<FILTER2>

(i.e.: git rev-list --filter=combine:tree:2+blob:limit=32k). This is
potentially awkward because individual filters must be URL-encoded if
they contain + or %. This can potentially be improved by supporting a
repeated flag syntax, e.g.:

	$ git rev-list --filter=tree:2 --filter:blob:limit=32k

Such usage is currently an error, so giving it a meaning is backwards-
compatible.

Signed-off-by: Matthew DeVore <matvore@google.com>
---
 Documentation/rev-list-options.txt     |   8 ++
 contrib/completion/git-completion.bash |   2 +-
 list-objects-filter-options.c          | 130 +++++++++++++++++++++++++
 list-objects-filter-options.h          |  14 ++-
 list-objects-filter.c                  | 116 ++++++++++++++++++++++
 t/t6112-rev-list-filters-objects.sh    | 108 +++++++++++++++++++-
 6 files changed, 373 insertions(+), 5 deletions(-)

diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index ddbc1de43f..a2e14d3753 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -730,20 +730,28 @@ specification contained in <path>.
 +
 The form '--filter=tree:<depth>' omits all blobs and trees whose depth
 from the root tree is >= <depth> (minimum depth if an object is located
 at multiple depths in the commits traversed). <depth>=0 will not include
 any trees or blobs unless included explicitly in the command-line (or
 standard input when --stdin is used). <depth>=1 will include only the
 tree and blobs which are referenced directly by a commit reachable from
 <commit> or an explicitly-given object. <depth>=2 is like <depth>=1
 while also including trees and blobs one more level removed from an
 explicitly-given commit or tree.
++
+The form '--filter=combine:<filter1>+<filter2>+...<filterN>' combines
+several filters. Only objects which are accepted by every filter are
+included. Filters are joined by "+" and individual filters are %-encoded
+(i.e. URL-encoded). Only the "%" and "+" characters must be encoded. For
+instance, 'combine:tree:3+blob:none' and
+'combine:tree%3A2+blob%3Anone' are equivalent, and result in only tree
+objects whose depth from the root is >= 3 being included.
 
 --no-filter::
 	Turn off any previous `--filter=` argument.
 
 --filter-print-omitted::
 	Only useful with `--filter=`; prints a list of the objects omitted
 	by the filter.  Object IDs are prefixed with a ``~'' character.
 
 --missing=<missing-action>::
 	A debug option to help with future "partial clone" development.
diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash
index 3eefbabdb1..0fd0a10d0c 100644
--- a/contrib/completion/git-completion.bash
+++ b/contrib/completion/git-completion.bash
@@ -1529,21 +1529,21 @@ _git_difftool ()
 __git_fetch_recurse_submodules="yes on-demand no"
 
 _git_fetch ()
 {
 	case "$cur" in
 	--recurse-submodules=*)
 		__gitcomp "$__git_fetch_recurse_submodules" "" "${cur##--recurse-submodules=}"
 		return
 		;;
 	--filter=*)
-		__gitcomp "blob:none blob:limit= sparse:oid= sparse:path=" "" "${cur##--filter=}"
+		__gitcomp "blob:none blob:limit= sparse:oid= sparse:path= combine: tree:" "" "${cur##--filter=}"
 		return
 		;;
 	--*)
 		__gitcomp_builtin fetch
 		return
 		;;
 	esac
 	__git_complete_remote_or_refspec
 }
 
diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
index e46ea467bc..e9d0de17e0 100644
--- a/list-objects-filter-options.c
+++ b/list-objects-filter-options.c
@@ -1,19 +1,24 @@
 #include "cache.h"
 #include "commit.h"
 #include "config.h"
 #include "revision.h"
 #include "argv-array.h"
 #include "list-objects.h"
 #include "list-objects-filter.h"
 #include "list-objects-filter-options.h"
 
+static int parse_combine_filter(
+	struct list_objects_filter_options *filter_options,
+	const char *arg,
+	struct strbuf *errbuf);
+
 /*
  * Parse value of the argument to the "filter" keyword.
  * On the command line this looks like:
  *       --filter=<arg>
  * and in the pack protocol as:
  *       "filter" SP <arg>
  *
  * The filter keyword will be used by many commands.
  * See Documentation/rev-list-options.txt for allowed values for <arg>.
  *
@@ -74,33 +79,152 @@ static int gently_parse_list_objects_filter(
 		if (!get_oid_with_context(the_repository, v0, GET_OID_BLOB,
 					  &sparse_oid, &oc))
 			filter_options->sparse_oid_value = oiddup(&sparse_oid);
 		filter_options->choice = LOFC_SPARSE_OID;
 		return 0;
 
 	} else if (skip_prefix(arg, "sparse:path=", &v0)) {
 		filter_options->choice = LOFC_SPARSE_PATH;
 		filter_options->sparse_path_value = strdup(v0);
 		return 0;
+
+	} else if (skip_prefix(arg, "combine:", &v0)) {
+		int sub_parse_res = parse_combine_filter(
+			filter_options, v0, errbuf);
+		if (sub_parse_res)
+			return sub_parse_res;
+		return 0;
+
 	}
 	/*
 	 * Please update _git_fetch() in git-completion.bash when you
 	 * add new filters
 	 */
 
 	if (errbuf)
 		strbuf_addf(errbuf, _("invalid filter-spec '%s'"), arg);
 
 	memset(filter_options, 0, sizeof(*filter_options));
 	return 1;
 }
 
+static int digit_value(int c, struct strbuf *errbuf) {
+	if (c >= '0' && c <= '9')
+		return c - '0';
+	if (c >= 'a' && c <= 'f')
+		return c - 'a' + 10;
+	if (c >= 'A' && c <= 'F')
+		return c - 'A' + 10;
+
+	if (!errbuf)
+		return -1;
+
+	strbuf_addf(errbuf, _("error in filter-spec - "));
+	if (c)
+		strbuf_addf(
+			errbuf,
+			_("expect two hex digits after %%, but got: '%c'"),
+			c);
+	else
+		strbuf_addf(
+			errbuf,
+			_("not enough hex digits after %%; expected two"));
+
+	return -1;
+}
+
+static int url_decode(struct strbuf *s, struct strbuf *errbuf) {
+	char *dest = s->buf;
+	char *src = s->buf;
+	while (*src) {
+		int digit_value_0, digit_value_1;
+
+		if (src[0] != '%') {
+			*dest++ = *src++;
+			continue;
+		}
+		src++;
+
+		digit_value_0 = digit_value(*src++, errbuf);
+		if (digit_value_0 < 0)
+			return 1;
+		digit_value_1 = digit_value(*src++, errbuf);
+		if (digit_value_1 < 0)
+			return 1;
+		*dest++ = digit_value_0 * 16 + digit_value_1;
+	}
+	size_t new_len = dest - s->buf;
+	strbuf_remove(s, new_len, s->len - new_len);
+
+	return 0;
+}
+
+static int parse_combine_filter(
+	struct list_objects_filter_options *filter_options,
+	const char *arg,
+	struct strbuf *errbuf)
+{
+	struct strbuf **sub_specs = strbuf_split_str(arg, '+', 2);
+	int result;
+
+	if (!sub_specs[0]) {
+		if (errbuf)
+			strbuf_addf(errbuf,
+				    _("expected something after combine:"));
+		result = 1;
+		goto cleanup;
+	}
+
+	/*
+	 * Only decode the first sub-filter, since the rest will be decoded on
+	 * the recursive call.
+	 */
+	result = url_decode(sub_specs[0], errbuf);
+	if (result)
+		goto cleanup;
+
+	if (!sub_specs[1]) {
+		/*
+		 * There is only one sub-filter, so we don't need the
+		 * combine: - just parse it as a non-composite filter.
+		 */
+		result = gently_parse_list_objects_filter(
+			filter_options, sub_specs[0]->buf, errbuf);
+		goto cleanup;
+	}
+
+	/* Remove trailing "+" so we can parse it. */
+	assert(sub_specs[0]->buf[sub_specs[0]->len - 1] == '+');
+	strbuf_remove(sub_specs[0], sub_specs[0]->len - 1, 1);
+
+	filter_options->choice = LOFC_COMBINE;
+	filter_options->lhs = xcalloc(1, sizeof(*filter_options->lhs));
+	result = gently_parse_list_objects_filter(filter_options->lhs,
+						  sub_specs[0]->buf,
+						  errbuf);
+	if (result)
+		goto cleanup;
+
+	filter_options->rhs = xcalloc(1, sizeof(*filter_options->rhs));
+	result = parse_combine_filter(filter_options->rhs,
+				      sub_specs[1]->buf,
+				      errbuf);
+
+cleanup:
+	strbuf_list_free(sub_specs);
+	if (result) {
+		list_objects_filter_release(filter_options);
+		memset(filter_options, 0, sizeof(*filter_options));
+	}
+	return result;
+}
+
 int parse_list_objects_filter(struct list_objects_filter_options *filter_options,
 			      const char *arg)
 {
 	struct strbuf buf = STRBUF_INIT;
 	if (gently_parse_list_objects_filter(filter_options, arg, &buf))
 		die("%s", buf.buf);
 	return 0;
 }
 
 int opt_parse_list_objects_filter(const struct option *opt,
@@ -127,23 +251,29 @@ void expand_list_objects_filter_spec(
 	else if (filter->choice == LOFC_TREE_DEPTH)
 		strbuf_addf(expanded_spec, "tree:%lu",
 			    filter->tree_exclude_depth);
 	else
 		strbuf_addstr(expanded_spec, filter->filter_spec);
 }
 
 void list_objects_filter_release(
 	struct list_objects_filter_options *filter_options)
 {
+	if (!filter_options)
+		return;
 	free(filter_options->filter_spec);
 	free(filter_options->sparse_oid_value);
 	free(filter_options->sparse_path_value);
+	list_objects_filter_release(filter_options->lhs);
+	free(filter_options->lhs);
+	list_objects_filter_release(filter_options->rhs);
+	free(filter_options->rhs);
 	memset(filter_options, 0, sizeof(*filter_options));
 }
 
 void partial_clone_register(
 	const char *remote,
 	const struct list_objects_filter_options *filter_options)
 {
 	/*
 	 * Record the name of the partial clone remote in the
 	 * config and in the global variable -- the latter is
diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h
index e3adc78ebf..6c0f0ecd08 100644
--- a/list-objects-filter-options.h
+++ b/list-objects-filter-options.h
@@ -7,20 +7,21 @@
 /*
  * The list of defined filters for list-objects.
  */
 enum list_objects_filter_choice {
 	LOFC_DISABLED = 0,
 	LOFC_BLOB_NONE,
 	LOFC_BLOB_LIMIT,
 	LOFC_TREE_DEPTH,
 	LOFC_SPARSE_OID,
 	LOFC_SPARSE_PATH,
+	LOFC_COMBINE,
 	LOFC__COUNT /* must be last */
 };
 
 struct list_objects_filter_options {
 	/*
 	 * 'filter_spec' is the raw argument value given on the command line
 	 * or protocol request.  (The part after the "--keyword=".)  For
 	 * commands that launch filtering sub-processes, or for communication
 	 * over the network, don't use this value; use the result of
 	 * expand_list_objects_filter_spec() instead.
@@ -32,28 +33,35 @@ struct list_objects_filter_options {
 	 * the filtering algorithm to use.
 	 */
 	enum list_objects_filter_choice choice;
 
 	/*
 	 * Choice is LOFC_DISABLED because "--no-filter" was requested.
 	 */
 	unsigned int no_filter : 1;
 
 	/*
-	 * Parsed values (fields) from within the filter-spec.  These are
-	 * choice-specific; not all values will be defined for any given
-	 * choice.
+	 * BEGIN choice-specific parsed values from within the filter-spec. Only
+	 * some values will be defined for any given choice.
 	 */
+
 	struct object_id *sparse_oid_value;
 	char *sparse_path_value;
 	unsigned long blob_limit_value;
 	unsigned long tree_exclude_depth;
+
+	/* LOFC_COMBINE values */
+	struct list_objects_filter_options *lhs, *rhs;
+
+	/*
+	 * END choice-specific parsed values.
+	 */
 };
 
 /* Normalized command line arguments */
 #define CL_ARG__FILTER "filter"
 
 int parse_list_objects_filter(
 	struct list_objects_filter_options *filter_options,
 	const char *arg);
 
 int opt_parse_list_objects_filter(const struct option *opt,
diff --git a/list-objects-filter.c b/list-objects-filter.c
index 8e8616b9b8..28496f31f7 100644
--- a/list-objects-filter.c
+++ b/list-objects-filter.c
@@ -453,34 +453,150 @@ static void filter_sparse_path__init(
 
 	ALLOC_GROW(d->array_frame, d->nr + 1, d->alloc);
 	d->array_frame[d->nr].defval = 0; /* default to include */
 	d->array_frame[d->nr].child_prov_omit = 0;
 
 	ctx->filter_fn = filter_sparse;
 	ctx->free_fn = filter_sparse_free;
 	ctx->data = d;
 }
 
+struct filter_combine_data {
+	/* sub[0] corresponds to lhs, sub[1] to rhs. */
+	struct {
+		struct filter_context ctx;
+		struct oidset seen;
+		struct object_id skip_tree;
+		unsigned is_skipping_tree : 1;
+	} sub[2];
+};
+
+static void filter_combine_free(void *filter_data)
+{
+	struct filter_combine_data *d = filter_data;
+	int i;
+	for (i = 0; i < 2; i++) {
+		list_objects_filter__release(&d->sub[i].ctx);
+		oidset_clear(&d->sub[i].seen);
+	}
+	free(d);
+}
+
+static int should_delegate(enum list_objects_filter_situation filter_situation,
+			   struct object *obj,
+			   struct filter_combine_data *d,
+			   int side)
+{
+	if (!d->sub[side].is_skipping_tree)
+		return 1;
+	if (filter_situation == LOFS_END_TREE &&
+		oideq(&obj->oid, &d->sub[side].skip_tree)) {
+		d->sub[side].is_skipping_tree = 0;
+		return 1;
+	}
+	return 0;
+}
+
+static enum list_objects_filter_result filter_combine(
+	struct repository *r,
+	enum list_objects_filter_situation filter_situation,
+	struct object *obj,
+	const char *pathname,
+	const char *filename,
+	struct filter_context *ctx)
+{
+	struct filter_combine_data *d = ctx->data;
+	enum list_objects_filter_result lhs_result = LOFR_ZERO;
+	enum list_objects_filter_result rhs_result = LOFR_ZERO;
+	int lhs_already_seen = oidset_contains(&d->sub[0].seen, &obj->oid);
+	int rhs_already_seen = oidset_contains(&d->sub[1].seen, &obj->oid);
+	int delegate_lhs = !lhs_already_seen &&
+		should_delegate(filter_situation, obj, d, 0);
+	int delegate_rhs = !rhs_already_seen &&
+		should_delegate(filter_situation, obj, d, 1);
+	enum list_objects_filter_result combined_result = LOFR_ZERO;
+
+	if (lhs_already_seen && rhs_already_seen)
+		return LOFR_ZERO;
+
+	if (delegate_lhs)
+		lhs_result = d->sub[0].ctx.filter_fn(
+			r, filter_situation, obj, pathname, filename,
+			&d->sub[0].ctx);
+	if (delegate_rhs)
+		rhs_result = d->sub[1].ctx.filter_fn(
+			r, filter_situation, obj, pathname, filename,
+			&d->sub[1].ctx);
+
+	if (lhs_result & LOFR_MARK_SEEN)
+		oidset_insert(&d->sub[0].seen, &obj->oid);
+
+	if (rhs_result & LOFR_MARK_SEEN)
+		oidset_insert(&d->sub[1].seen, &obj->oid);
+
+	if (lhs_result & LOFR_SKIP_TREE) {
+		d->sub[0].is_skipping_tree = 1;
+		d->sub[0].skip_tree = obj->oid;
+	}
+	if (rhs_result & LOFR_SKIP_TREE) {
+		d->sub[1].is_skipping_tree = 1;
+		d->sub[1].skip_tree = obj->oid;
+	}
+
+	if ((lhs_result & LOFR_DO_SHOW) && (rhs_result & LOFR_DO_SHOW))
+		combined_result |= LOFR_DO_SHOW;
+	if (d->sub[0].is_skipping_tree && d->sub[1].is_skipping_tree)
+		combined_result |= LOFR_SKIP_TREE;
+
+	return combined_result;
+}
+
+static void filter_combine__init(
+	struct list_objects_filter_options *filter_options,
+	struct filter_context *ctx)
+{
+	struct filter_combine_data *d = xcalloc(1, sizeof(*d));
+	struct oidset *lhs_omits = NULL;
+	struct oidset *rhs_omits = NULL;
+
+	if (ctx->omits) {
+		lhs_omits = xcalloc(1, sizeof(*lhs_omits));
+		oidset_init(lhs_omits, 16);
+		rhs_omits = xcalloc(1, sizeof(*rhs_omits));
+		oidset_init(rhs_omits, 16);
+	}
+
+	list_objects_filter__init(lhs_omits, filter_options->lhs,
+				  &d->sub[0].ctx);
+	list_objects_filter__init(rhs_omits, filter_options->rhs,
+				  &d->sub[1].ctx);
+
+	ctx->filter_fn = filter_combine;
+	ctx->free_fn = filter_combine_free;
+	ctx->data = d;
+}
+
 typedef void (*filter_init_fn)(
 	struct list_objects_filter_options *filter_options,
 	struct filter_context *ctx);
 
 /*
  * Must match "enum list_objects_filter_choice".
  */
 static filter_init_fn s_filters[] = {
 	NULL,
 	filter_blobs_none__init,
 	filter_blobs_limit__init,
 	filter_trees_depth__init,
 	filter_sparse_oid__init,
 	filter_sparse_path__init,
+	filter_combine__init,
 };
 
 void list_objects_filter__init(
 	struct oidset *omitted,
 	struct list_objects_filter_options *filter_options,
 	struct filter_context *ctx)
 {
 	filter_init_fn init_fn;
 
 	assert((sizeof(s_filters) / sizeof(s_filters[0])) == LOFC__COUNT);
diff --git a/t/t6112-rev-list-filters-objects.sh b/t/t6112-rev-list-filters-objects.sh
index 9c11427719..63c1524f6f 100755
--- a/t/t6112-rev-list-filters-objects.sh
+++ b/t/t6112-rev-list-filters-objects.sh
@@ -284,21 +284,33 @@ test_expect_success 'verify tree:0 includes trees in "filtered" output' '
 # Make sure tree:0 does not iterate through any trees.
 
 test_expect_success 'verify skipping tree iteration when not collecting omits' '
 	GIT_TRACE=1 git -C r3 rev-list \
 		--objects --filter=tree:0 HEAD 2>filter_trace &&
 	grep "Skipping contents of tree [.][.][.]" filter_trace >actual &&
 	# One line for each commit traversed.
 	test_line_count = 2 actual &&
 
 	# Make sure no other trees were considered besides the root.
-	! grep "Skipping contents of tree [^.]" filter_trace
+	! grep "Skipping contents of tree [^.]" filter_trace &&
+
+	# Try this again with "combine:". If both sub-filters are skipping
+	# trees, the composite filter should also skip trees. This is not
+	# important unless the user does combine:tree:X+tree:Y or another filter
+	# besides "tree:" is implemented in the future which can skip trees.
+	GIT_TRACE=1 git -C r3 rev-list \
+		--objects --filter=combine:tree:1+tree:3 HEAD 2>filter_trace &&
+
+	# Only skip the dir1/ tree, which is shared between the two commits.
+	grep "Skipping contents of tree " filter_trace >actual &&
+	test_write_lines "Skipping contents of tree dir1/..." >expected &&
+	test_cmp expected actual
 '
 
 # Test tree:# filters.
 
 expect_has () {
 	commit=$1 &&
 	name=$2 &&
 
 	hash=$(git -C r3 rev-parse $commit:$name) &&
 	grep "^$hash $name$" actual
@@ -336,20 +348,114 @@ test_expect_success 'verify tree:3 includes everything expected' '
 	expect_has HEAD dir1/sparse1 &&
 	expect_has HEAD dir1/sparse2 &&
 	expect_has HEAD pattern &&
 	expect_has HEAD sparse1 &&
 	expect_has HEAD sparse2 &&
 
 	# There are also 2 commit objects
 	test_line_count = 10 actual
 '
 
+test_expect_success 'combine:... for a simple combination' '
+	git -C r3 rev-list --objects --filter=combine:tree:2+blob:none HEAD \
+		>actual &&
+
+	expect_has HEAD "" &&
+	expect_has HEAD~1 "" &&
+	expect_has HEAD dir1 &&
+
+	# There are also 2 commit objects
+	test_line_count = 5 actual
+'
+
+test_expect_success 'combine:... with URL encoding' '
+	git -C r3 rev-list --objects \
+		--filter=combine:tree%3a2+blob:%6Eon%65 HEAD >actual &&
+
+	expect_has HEAD "" &&
+	expect_has HEAD~1 "" &&
+	expect_has HEAD dir1 &&
+
+	# There are also 2 commit objects
+	test_line_count = 5 actual
+'
+
+expect_invalid_filter_spec () {
+	spec="$1" &&
+	err="$2" &&
+
+	test_must_fail git -C r3 rev-list --objects --filter="$spec" HEAD \
+		>actual 2>actual_stderr &&
+	test_must_be_empty actual &&
+	test_i18ngrep "$err" actual_stderr && cat actual_stderr >/dev/stderr
+}
+
+test_expect_success 'combine:... while URL-encoding things that should not be' '
+	expect_invalid_filter_spec combine%3Atree:2+blob:none \
+		"invalid filter-spec"
+'
+
+test_expect_success 'combine:... with invalid URL-encoded sequences' '
+	expect_invalid_filter_spec combine:tree:2+blob:non%a \
+		"error in filter-spec - not enough hex digits after %" &&
+	# Edge cases for non-hex chars: "Gg/:@`"
+	expect_invalid_filter_spec combine:tree:2+blob%G5none \
+		"error in filter-spec - expect two hex digits .*: .G." &&
+	expect_invalid_filter_spec combine:tree:2+blob%5/none \
+		"error in filter-spec - expect two hex digits .*: ./." &&
+	expect_invalid_filter_spec combine:%:5tree:2+blob:none \
+		"error in filter-spec - expect two hex digits .*: .:." &&
+	expect_invalid_filter_spec combine:tree:2+blob:none%f@ \
+		"error in filter-spec - expect two hex digits .*: .@." &&
+	expect_invalid_filter_spec combine:tree:2+blob:none%f\` \
+		"error in filter-spec - expect two hex digits .*: .\`."
+'
+
+test_expect_success 'combine:... with edge-case hex digits: Ff Aa 0 9' '
+	git -C r3 rev-list --objects --filter="combine:tree:2+bl%6Fb:n%6fne" \
+		HEAD >actual &&
+	test_line_count = 5 actual &&
+	git -C r3 rev-list --objects --filter="combine:tree%3A2+blob%3anone" \
+		HEAD >actual &&
+	test_line_count = 5 actual &&
+	git -C r3 rev-list --objects --filter="combine:tree:%30" HEAD >actual &&
+	test_line_count = 2 actual &&
+	git -C r3 rev-list --objects --filter="combine:tree:%39+blob:none" \
+		HEAD >actual &&
+	test_line_count = 5 actual
+'
+
+test_expect_success 'combine:... with more than two sub-filters' '
+	git -C r3 rev-list --objects \
+		--filter=combine:tree:3+blob:limit=40+sparse:path=../pattern1 \
+		HEAD >actual &&
+
+	expect_has HEAD "" &&
+	expect_has HEAD~1 "" &&
+	expect_has HEAD dir1 &&
+	expect_has HEAD dir1/sparse1 &&
+	expect_has HEAD dir1/sparse2 &&
+
+	# Should also have 2 commits
+	test_line_count = 7 actual &&
+
+	# Try again, this time making sure the last sub-filter is only
+	# URL-decoded once.
+	cp pattern1 pattern1+renamed% &&
+	cp actual expect &&
+
+	git -C r3 rev-list --objects \
+		--filter=combine:tree:3+blob:limit=40+sparse:path=../pattern1%2brenamed%25 \
+		HEAD >actual &&
+	test_cmp expect actual
+'
+
 # Test provisional omit collection logic with a repo that has objects appearing
 # at multiple depths - first deeper than the filter's threshold, then shallow.
 
 test_expect_success 'setup r4' '
 	git init r4 &&
 
 	echo foo > r4/foo &&
 	mkdir r4/subdir &&
 	echo bar > r4/subdir/bar &&
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/3] implement composite filters
  2019-05-16 18:56               ` [RFC PATCH 0/3] implement composite filters Matthew DeVore
                                   ` (2 preceding siblings ...)
  2019-05-16 18:56                 ` [RFC PATCH 3/3] list-objects-filter: implement composite filters Matthew DeVore
@ 2019-05-16 22:41                 ` Jonathan Tan
  2019-05-17  0:01                   ` Matthew DeVore
  3 siblings, 1 reply; 23+ messages in thread
From: Jonathan Tan @ 2019-05-16 22:41 UTC (permalink / raw)
  To: matvore; +Cc: jonathantanmy, git

> Here is a first stab at composite filters. It does not actually support omits,
> but the biggest difficulties of the implementation are already addressed. So I
> decided to send out an early version to give interested people an idea of just
> what is needed to implement it, and then give them a chance to change my mind
> (Jonathan T. especially was concerned about code complexity).

Thanks - seeing these patches reduces my concerns significantly. A
composite filter has LHS and RHS filters, either of which can be
composite themselves, so we support compositing arbitrary numbers of
filters. I see that the approach is to run LHS and RHS in lockstep, with
our own handling of the result flags:

- LOFR_MARK_SEEN is tracked for LHS and RHS separately. To support an
  arbitrary number of filters, we don't use object flags to track this,
  so we use oidsets instead. I don't think that the extra memory usage
  will be a problem (we already allocate more for all the struct
  object). If this is an issue in the future, we can switch to using
  object flags for the first N filters, and oidsets thereafter.

- LOFR_SKIP_TREE is simulated if only one filter wants to skip the tree.

- I haven't fully figured out LOFR_DO_SHOW yet. It seems to me that if
  an object appears twice in the walk, and the LHS says LOFR_DO_SHOW on
  the first occurrence, if the RHS says LOFR_DO_SHOW on the second
  occurrence, the object will be shown twice. But perhaps this isn't a
  problem - as it is, I think that a filter can call LOFR_DO_SHOW
  multiple times on the same object anyway. In any case, if this turns
  out to be a problem, it seems surmountable to me.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/3] implement composite filters
  2019-05-16 22:41                 ` [RFC PATCH 0/3] " Jonathan Tan
@ 2019-05-17  0:01                   ` Matthew DeVore
  0 siblings, 0 replies; 23+ messages in thread
From: Matthew DeVore @ 2019-05-17  0:01 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: matvore, git



> On 2019/05/16, at 15:41, Jonathan Tan <jonathantanmy@google.com> wrote:
> 
> Thanks - seeing these patches reduces my concerns significantly. A

Thank you for taking a look :)

> 
> - LOFR_MARK_SEEN is tracked for LHS and RHS separately. To support an
>  arbitrary number of filters, we don't use object flags to track this,
>  so we use oidsets instead. I don't think that the extra memory usage
>  will be a problem (we already allocate more for all the struct
>  object). If this is an issue in the future, we can switch to using
>  object flags for the first N filters, and oidsets thereafter.

Yup. Another possibility that comes to mind is that when both the lhs and rhs seen sets contain the same object, we promote it to a combined set, and remove it from the individual ones at that time.

> 
> - LOFR_SKIP_TREE is simulated if only one filter wants to skip the tree.
> 
> - I haven't fully figured out LOFR_DO_SHOW yet. It seems to me that if
>  an object appears twice in the walk, and the LHS says LOFR_DO_SHOW on
>  the first occurrence, if the RHS says LOFR_DO_SHOW on the second
>  occurrence, the object will be shown twice. But perhaps this isn't a

LOFR_DO_SHOW is only propagated upward from the combine: filter if both children indicate LOFR_DO_SHOW for the same object at the same point in the traversal (see the line that has "combined_result |= LOFR_DO_SHOW"). In the scenario you draw out, the object won’t be shown at all, since the first occurrence is filtered out for one reason, and the second is filtered out for a separate reason. This may happen when a sparse: filter includes a "deep" blob but excludes the same blob at a shallower point in the tree, and a tree: filter does the opposite with the same blob. Just thinking about it now for a moment, this seems intuitive and reasonable.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 3/3] list-objects-filter: implement composite filters
  2019-05-16 18:56                 ` [RFC PATCH 3/3] list-objects-filter: implement composite filters Matthew DeVore
@ 2019-05-17  3:25                   ` Junio C Hamano
  2019-05-17 13:17                     ` Matthew DeVore
  0 siblings, 1 reply; 23+ messages in thread
From: Junio C Hamano @ 2019-05-17  3:25 UTC (permalink / raw)
  To: Matthew DeVore
  Cc: jonathantanmy, jrn, git, dstolee, jeffhost, jrnieder, matvore

Matthew DeVore <matvore@google.com> writes:

> Allow combining filters such that only objects accepted by all filters
> are shown. The motivation for this is to allow getting directory
> listings without also fetching blobs. This can be done by combining
> blob:none with tree:<depth>. There are massive repositories that have
> larger-than-expected trees - even if you include only a single commit.
>
> The current usage requires passing the filter to rev-list, or sending
> it over the wire, as:
>
> 	combine:<FILTER1>+<FILTER2>
>
> (i.e.: git rev-list --filter=combine:tree:2+blob:limit=32k). This is
> potentially awkward because individual filters must be URL-encoded if
> they contain + or %. This can potentially be improved by supporting a
> repeated flag syntax, e.g.:
>
> 	$ git rev-list --filter=tree:2 --filter:blob:limit=32k

Shouldn't the second one say "--filter=blob:limit=32k" (i.e. the
first colon should be an equal sign)?

> Such usage is currently an error, so giving it a meaning is backwards-
> compatible.

Two minor comments.  

If combine means "must satisfy all of these", '+' is probably a poor
choice (perhaps we want '&' instead).  Also, it seems to me that
having to worry about url encoding and parsing encoded data
correctly and securely would be far more work than simply taking
multiple command line parameters, accumulating them in a string
list, and then at the end of command line parsing, building a
combined filter out of all of them at once (a degenerate case may
end up attempting to build a combined filter that combines a single
filter), iow just biting the bullet and do the "potentially be
improved" step from the beginning.

> +The form '--filter=combine:<filter1>+<filter2>+...<filterN>' combines
> +several filters. Only objects which are accepted by every filter are
> +included. Filters are joined by "+" and individual filters are %-encoded
> +(i.e. URL-encoded). Only the "%" and "+" characters must be encoded. For
> +instance, 'combine:tree:3+blob:none' and
> +'combine:tree%3A2+blob%3Anone' are equivalent, and result in only tree

I do not quite get this part.  The way I read the justification for
encoding given in the proposed commit log message was because a
filter name or filter parameter might want to include the '+' sign,
so a filter whose name is frotz+nitfol that takes a param may be
invoked standalone as "--filter=frotz+nitfol:param=1" needs to
protect its '+' from getting mistaken when used in the combine
filter, it should not be mistaken as a combination of 'frotz' filter
(with no parameter) and 'nitfol' filter (with param set to one), and
the way to do so is to say

	--filter="combine:frotz%2Bnitfol:param=1"

(this is a degenerate 'combination of a single filter', but you can
add another one by following it with '+' and another URLencoded
string).

So why are we allowing %3A there that does not even have to be
encoded?  Shouldn't it be an error?

In any case, I am not quite convinced that we need to complicate the
parameters with URLencoding, so I'd skip reviewing large part this
patch that is about "decoding".

Once the combined filter definition is built in-core, the code that
evaluates the intersection of all conditions seems to be written
sanely to me.

Thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 3/3] list-objects-filter: implement composite filters
  2019-05-17  3:25                   ` Junio C Hamano
@ 2019-05-17 13:17                     ` Matthew DeVore
  2019-05-19  1:12                       ` Junio C Hamano
                                         ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Matthew DeVore @ 2019-05-17 13:17 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Matthew DeVore, jonathantanmy, jrn, git, dstolee, jeffhost, jrnieder



> On May 16, 2019, at 8:25 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> 
>> 	$ git rev-list --filter=tree:2 --filter:blob:limit=32k
> 
> Shouldn't the second one say "--filter=blob:limit=32k" (i.e. the
> first colon should be an equal sign)?

That's right. Fixed locally.

> 
>> Such usage is currently an error, so giving it a meaning is backwards-
>> compatible.
> 
> Two minor comments.  
> 
> If combine means "must satisfy all of these", '+' is probably a poor
> choice (perhaps we want '&' instead).  Also, it seems to me that

I think I agree. & is more intuitive.

> having to worry about url encoding and parsing encoded data
> correctly and securely would be far more work than simply taking
> multiple command line parameters, accumulating them in a string
> list, and then at the end of command line parsing, building a
> combined filter out of all of them at once (a degenerate case may
> end up attempting to build a combined filter that combines a single
> filter), iow just biting the bullet and do the "potentially be
> improved" step from the beginning.

My intention actually is to support the repeated flag pretty soon, but I only want to write the code if there's agreement on my current approach.

My justification for the URL-encoding scheme is:

1. The combined filters will eventually have to travel over the wire.

2. The Git protocol will either have repeated "filter" lines or it will continue to use a single filter line with an encoding scheme.

3. Continuing to use a single filter line seemed the least disruptive considering both this codebase and Git clones like JGit. Other clones will likely fail saying "unknown filter combine:" or something like that until it gets implemented. A paranoid consideration is that clones and proprietary server implementations may currently allow the "filter" line to be silently overridden if it is repeated.

4. Assuming we *do* use a single filter line over the wire, it makes sense to allow the user to specify the raw filter line as well as have the more friendly UI of repeating --filter flags.

5. If we use repeated "filter" lines over the wire, and later start implementing a more complete DSL for specifying filters (see Mercurial's "revsets") the repeated-filter-line feature in the protocol may end up becoming deprecated and we will end up back-pedaling to allow integration of the "&" operator with whatever new operators we need.

(I very much doubt I will be the one implementing such a DSL for filters or resets, but I think it's a possibility)

> So why are we allowing %3A there that does not even have to be
> encoded?  Shouldn't it be an error?

We do have to require the combine operator (& or +) and % be encoded. For other operators, there are three options:

1. Allow anything to be encoded. I chose this because it's how I usually think of URL encoding working. For instance, if I go to https://public-inbox.org/git/?q=cod%65+coverage in Chrome, the browser automatically decodes the %65 to an e in the address bar. Safari does not automatically decode, but the server apparently interprets the %65 as an e. I am not really attached to this choice.

2. Do not allow or require anything else to be encoded.

3. Require encoding of a couple of "reserved" characters that don't appear in filters now, and don't typically appear in UNIX path names. This would allow for expansion later. For instance, "~&%*+|(){}!\" plus the ASCII range [0, 0x20] and single and double quotes - do not allow encoding of anything else.

4. Same requirements as 3, but permit encoding of other arbitrary characters.

I kind of like 3 now that I've thought it out more.

> 
> In any case, I am not quite convinced that we need to complicate the
> parameters with URLencoding, so I'd skip reviewing large part this
> patch that is about "decoding".

It's fine if we drop the encoding scheme. I intentionally tried to limit the amount of work I stacked on top of it until I got agreement. Please let me know if anything I've said changes your perspective.

> 
> Once the combined filter definition is built in-core, the code that
> evaluates the intersection of all conditions seems to be written
> sanely to me.

Great! I actually did simplify it a bit since I sent the first roll-up.

Thanks.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 3/3] list-objects-filter: implement composite filters
  2019-05-17 13:17                     ` Matthew DeVore
@ 2019-05-19  1:12                       ` Junio C Hamano
  2019-05-20 18:24                       ` Matthew DeVore
  2019-05-20 18:28                       ` Matthew DeVore
  2 siblings, 0 replies; 23+ messages in thread
From: Junio C Hamano @ 2019-05-19  1:12 UTC (permalink / raw)
  To: Matthew DeVore
  Cc: Matthew DeVore, jonathantanmy, jrn, git, dstolee, jeffhost, jrnieder

Matthew DeVore <matvore@comcast.net> writes:

> My justification for the URL-encoding scheme is:
> ...
> 3. Continuing to use a single filter line seemed the least
> disruptive considering both this codebase and Git clones like
> JGit. Other clones will likely fail saying "unknown filter
> combine:" or something like that until it gets implemented. A
> paranoid consideration is that clones and proprietary server
> implementations may currently allow the "filter" line to be
> silently overridden if it is repeated.
>
> 4. Assuming we *do* use a single filter line over the wire, it
> makes sense to allow the user to specify the raw filter line as
> well as have the more friendly UI of repeating --filter flags.
>
> 5. If we use repeated "filter" lines over the wire, and later
> start implementing a more complete DSL for specifying filters (see
> Mercurial's "revsets") the repeated-filter-line feature in the
> protocol may end up becoming deprecated and we will end up
> back-pedaling to allow integration of the "&" operator with
> whatever new operators we need.

OK, that's fair.

> 1. Allow anything to be encoded. I chose this because it's how I
> usually think of URL encoding working. For instance, if I go to
> https://public-inbox.org/git/?q=cod%65+coverage in Chrome, the
> browser automatically decodes the %65 to an e in the address
> bar. Safari does not automatically decode, but the server
> apparently interprets the %65 as an e. I am not really attached to
> this choice.

OK, so the rule is "when you see 'combine:' (this part is never
encoded), take the rest as a single string, separate it at '+' (these
pluses are never encoded), and URLdecode each part---each of these
parts is a filter", which totally makes sense.  I somehow didn't see
that clearly written in your description.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 3/3] list-objects-filter: implement composite filters
  2019-05-17 13:17                     ` Matthew DeVore
  2019-05-19  1:12                       ` Junio C Hamano
@ 2019-05-20 18:24                       ` Matthew DeVore
  2019-05-20 18:28                       ` Matthew DeVore
  2 siblings, 0 replies; 23+ messages in thread
From: Matthew DeVore @ 2019-05-20 18:24 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Matthew DeVore, jonathantanmy, jrn, git, dstolee, jeffhost, jrnieder



> On 2019/05/17, at 6:17, Matthew DeVore <matvore@comcast.net> wrote:
> 
> 
> 
>> On May 16, 2019, at 8:25 PM, Junio C Hamano <gitster@pobox.com> wrote:
>>> 
>>> 	$ git rev-list --filter=tree:2 --filter:blob:limit=32k
>> 
>> Shouldn't the second one say "--filter=blob:limit=32k" (i.e. the
>> first colon should be an equal sign)?
> 
> That's right. Fixed locally.
> 
>> 
>>> Such usage is currently an error, so giving it a meaning is backwards-
>>> compatible.
>> 
>> Two minor comments.  
>> 
>> If combine means "must satisfy all of these", '+' is probably a poor
>> choice (perhaps we want '&' instead).  Also, it seems to me that
> 
> I think I agree. & is more intuitive.

After I tried this in code, I noticed two problems with & which make
me prefer + again:

a. the "&" char must be quoted or escaped in the shell, even if it is
   hugged by alphanumeric characters on either side:

	$ echo a&b
	[1] 17083
	a
	-bash: b: command not found
	[1]+  Done                    echo a
	$

b. visually speaking, "&" doesn't stand out very well unless it's
   surrounded by whitespace, and currently it must *not* be surrounded
   by whitespace:

	--filter=combine:blob:none&tree:3&sparse:../foo

	vs.

	--filter=combine:blob:none+tree:3+sparse:../foo

> 
>> having to worry about url encoding and parsing encoded data
>> correctly and securely would be far more work than simply taking
>> multiple command line parameters, accumulating them in a string
>> list, and then at the end of command line parsing, building a
>> combined filter out of all of them at once (a degenerate case may
>> end up attempting to build a combined filter that combines a single
>> filter), iow just biting the bullet and do the "potentially be
>> improved" step from the beginning.
> 
> My intention actually is to support the repeated flag pretty soon, but I only want to write the code if there's agreement on my current approach.
> 
> My justification for the URL-encoding scheme is:
> 
> 1. The combined filters will eventually have to travel over the wire.
> 
> 2. The Git protocol will either have repeated "filter" lines or it will continue to use a single filter line with an encoding scheme.
> 
> 3. Continuing to use a single filter line seemed the least disruptive considering both this codebase and Git clones like JGit. Other clones will likely fail saying "unknown filter combine:" or something like that until it gets implemented. A paranoid consideration is that clones and proprietary server implementations may currently allow the "filter" line to be silently overridden if it is repeated.
> 
> 4. Assuming we *do* use a single filter line over the wire, it makes sense to allow the user to specify the raw filter line as well as have the more friendly UI of repeating --filter flags.
> 
> 5. If we use repeated "filter" lines over the wire, and later start implementing a more complete DSL for specifying filters (see Mercurial's "revsets") the repeated-filter-line feature in the protocol may end up becoming deprecated and we will end up back-pedaling to allow integration of the "&" operator with whatever new operators we need.
> 
> (I very much doubt I will be the one implementing such a DSL for filters or resets, but I think it's a possibility)
> 
>> So why are we allowing %3A there that does not even have to be
>> encoded?  Shouldn't it be an error?
> 
> We do have to require the combine operator (& or +) and % be encoded. For other operators, there are three options:
> 
> 1. Allow anything to be encoded. I chose this because it's how I usually think of URL encoding working. For instance, if I go to https://public-inbox.org/git/?q=cod%65+coverage in Chrome, the browser automatically decodes the %65 to an e in the address bar. Safari does not automatically decode, but the server apparently interprets the %65 as an e. I am not really attached to this choice.
> 
> 2. Do not allow or require anything else to be encoded.
> 
> 3. Require encoding of a couple of "reserved" characters that don't appear in filters now, and don't typically appear in UNIX path names. This would allow for expansion later. For instance, "~&%*+|(){}!\" plus the ASCII range [0, 0x20] and single and double quotes - do not allow encoding of anything else.
> 
> 4. Same requirements as 3, but permit encoding of other arbitrary characters.
> 
> I kind of like 3 now that I've thought it out more.
> 
>> 
>> In any case, I am not quite convinced that we need to complicate the
>> parameters with URLencoding, so I'd skip reviewing large part this
>> patch that is about "decoding".
> 
> It's fine if we drop the encoding scheme. I intentionally tried to limit the amount of work I stacked on top of it until I got agreement. Please let me know if anything I've said changes your perspective.
> 
>> 
>> Once the combined filter definition is built in-core, the code that
>> evaluates the intersection of all conditions seems to be written
>> sanely to me.
> 
> Great! I actually did simplify it a bit since I sent the first roll-up.
> 
> Thanks.
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 3/3] list-objects-filter: implement composite filters
  2019-05-17 13:17                     ` Matthew DeVore
  2019-05-19  1:12                       ` Junio C Hamano
  2019-05-20 18:24                       ` Matthew DeVore
@ 2019-05-20 18:28                       ` Matthew DeVore
  2 siblings, 0 replies; 23+ messages in thread
From: Matthew DeVore @ 2019-05-20 18:28 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Matthew DeVore, jonathantanmy, jrn, git, dstolee, jeffhost, jrnieder


> On 2019/05/17, at 6:17, Matthew DeVore <matvore@comcast.net> wrote:
> 
> We do have to require the combine operator (& or +) and % be encoded. For other operators, there are three options:
> 
> ...
> 3. Require encoding of a couple of "reserved" characters that don't appear in filters now, and don't typically appear in UNIX path names. This would allow for expansion later. For instance, "~&%*+|(){}!\" plus the ASCII range [0, 0x20] and single and double quotes - do not allow encoding of anything else.
> 
> 4. Same requirements as 3, but permit encoding of other arbitrary characters.

I tried implementing the reserved characters idea, and considering that
it was a small amount of code, I think it's worth it:

+static const char *RESERVED_NON_WS = "\"\\'[]{}!@#$^~*+";
+
+static int has_reserved_character(
+       struct strbuf *sub_spec, struct strbuf *errbuf)
+{
+       const char *c = sub_spec->buf;
+       while (*c) {
+               if (*c <= ' ' || *strchr(RESERVED_NON_WS, *c))
+                       goto found_reserved;
+               c++;
+       }
+
+       return 0;
+
+found_reserved:
+       if (errbuf)
+               strbuf_addf(errbuf, "must escape char in sub-spec: %c", *c);
+       return 1;
+}
+


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-05-20 18:29 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-28 15:55 Proposal: object negotiation for partial clones Matthew DeVore
2019-05-06 18:25 ` Jonathan Nieder
2019-05-06 19:28 ` Jonathan Tan
2019-05-06 19:46   ` Jonathan Nieder
2019-05-06 23:20     ` Matthew DeVore
2019-05-07  0:02       ` Jonathan Nieder
2019-05-06 22:47   ` Matthew DeVore
2019-05-07 18:34     ` Jonathan Tan
2019-05-07 21:57       ` Matthew DeVore
2019-05-09 18:00         ` Jonathan Tan
2019-05-14  0:09           ` Matthew DeVore
2019-05-14  0:16             ` Jonathan Nieder
2019-05-16 18:56               ` [RFC PATCH 0/3] implement composite filters Matthew DeVore
2019-05-16 18:56                 ` [RFC PATCH 1/3] list-objects-filter: refactor into a context struct Matthew DeVore
2019-05-16 18:56                 ` [RFC PATCH 2/3] list-objects-filter-options: error is localizeable Matthew DeVore
2019-05-16 18:56                 ` [RFC PATCH 3/3] list-objects-filter: implement composite filters Matthew DeVore
2019-05-17  3:25                   ` Junio C Hamano
2019-05-17 13:17                     ` Matthew DeVore
2019-05-19  1:12                       ` Junio C Hamano
2019-05-20 18:24                       ` Matthew DeVore
2019-05-20 18:28                       ` Matthew DeVore
2019-05-16 22:41                 ` [RFC PATCH 0/3] " Jonathan Tan
2019-05-17  0:01                   ` Matthew DeVore

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).