Import/Export as a fast way to purge files from Git?

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Import/Export as a fast way to purge files from Git?
@ 2018-09-23 13:04 Lars Schneider
  2018-09-23 14:55 ` Eric Sunshine
                   ` (2 more replies)
  0 siblings, 3 replies; 90+ messages in thread
From: Lars Schneider @ 2018-09-23 13:04 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Taylor Blau, brian m. carlson

Hi,

I recently had to purge files from large Git repos (many files, many commits). 
The usual recommendation is to use `git filter-branch --index-filter` to purge 
files. However, this is *very* slow for large repos (e.g. it takes 45min to
remove the `builtin` directory from git core). I realized that I can remove
files *way* faster by exporting the repo, removing the file references, 
and then importing the repo (see Perl script below, it takes ~30sec to remove
the `builtin` directory from git core). Do you see any problem with this 
approach?

Thank you,
Lars

#!/usr/bin/perl
#
# Purge paths from Git repositories.
#
# Usage:
#     git-purge-path [path-regex1] [path-regex2] ...
#
# Examples:
#    Remove the file "test.bin" from all directories:
#    git-purge-path "/test.bin$"
#
#    Remove all "*.bin" files from all directories:
#    git-purge-path "\.bin$"
#
#    Remove all files in the "/foo" directory:
#    git-purge-path "^/foo/$"
#
# Attention:
#     You want to run this script on a case sensitive file-system (e.g.
#     ext4 on Linux). Otherwise the resulting Git repository will not
#     contain changes that modify the casing of file paths.
#

use strict;
use warnings;

open( my $pipe_in, "git fast-export --progress=100 --no-data HEAD |" ) or die $!;
open( my $pipe_out, "| git fast-import --force --quiet" ) or die $!;

LOOP: while ( my $cmd = <$pipe_in> ) {
    my $data = "";
    if ( $cmd =~ /^data ([0-9]+)$/ ) {
        # skip data blocks
        my $skip_bytes = $1;
        read($pipe_in, $data, $skip_bytes);
    }
    elsif ( $cmd =~ /^M [0-9]{6} [0-9a-f]{40} (.+)$/ ) {
        my $pathname = $1;
        foreach (@ARGV) {
            next LOOP if ("/" . $pathname) =~ /$_/
        }
    }
    print {$pipe_out} $cmd . $data;
}

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: Import/Export as a fast way to purge files from Git?
  2018-09-23 13:04 Import/Export as a fast way to purge files from Git? Lars Schneider
@ 2018-09-23 14:55 ` Eric Sunshine
  2018-09-23 15:58   ` Lars Schneider
  2018-09-23 15:53 ` brian m. carlson
  2018-09-24 17:24 ` Elijah Newren
  2 siblings, 1 reply; 90+ messages in thread
From: Eric Sunshine @ 2018-09-23 14:55 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Git List, Jeff King, Taylor Blau, brian m. carlson

On Sun, Sep 23, 2018 at 9:05 AM Lars Schneider <larsxschneider@gmail.com> wrote:
> I recently had to purge files from large Git repos (many files, many commits).
> The usual recommendation is to use `git filter-branch --index-filter` to purge
> files. However, this is *very* slow for large repos (e.g. it takes 45min to
> remove the `builtin` directory from git core). I realized that I can remove
> files *way* faster by exporting the repo, removing the file references,
> and then importing the repo (see Perl script below, it takes ~30sec to remove
> the `builtin` directory from git core). Do you see any problem with this
> approach?

A couple comments:

For purging files from a history, take a look at BFG[1] which bills
itself as "a simpler, faster alternative to git-filter-branch for
cleansing bad data out of your Git repository history".

The approach of exporting to a fast-import stream, modifying the
stream, and re-importing is quite reasonable. However, rather than
re-inventing, take a look at reposurgeon[2], which allows you to do
major surgery on fast-import streams. Not only can it purge files from
a repository, but it can slice, dice, puree, and saute pretty much any
attribute of a repository.

[1]: https://rtyley.github.io/bfg-repo-cleaner/
[2]: http://www.catb.org/esr/reposurgeon/

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: Import/Export as a fast way to purge files from Git?
  2018-09-23 13:04 Import/Export as a fast way to purge files from Git? Lars Schneider
  2018-09-23 14:55 ` Eric Sunshine
@ 2018-09-23 15:53 ` brian m. carlson
  2018-09-23 17:04   ` Jeff King
  2018-09-24 17:24 ` Elijah Newren
  2 siblings, 1 reply; 90+ messages in thread
From: brian m. carlson @ 2018-09-23 15:53 UTC (permalink / raw)
  To: Lars Schneider; +Cc: git, Jeff King, Taylor Blau

[-- Attachment #1: Type: text/plain, Size: 1081 bytes --]

On Sun, Sep 23, 2018 at 03:04:58PM +0200, Lars Schneider wrote:
> Hi,
> 
> I recently had to purge files from large Git repos (many files, many commits).
> The usual recommendation is to use `git filter-branch --index-filter` to purge
> files. However, this is *very* slow for large repos (e.g. it takes 45min to
> remove the `builtin` directory from git core). I realized that I can remove
> files *way* faster by exporting the repo, removing the file references,
> and then importing the repo (see Perl script below, it takes ~30sec to remove
> the `builtin` directory from git core). Do you see any problem with this
> approach?

I don't know of any problems with this approach.  I didn't audit your
specific Perl script for any issues, though.

I suspect you're gaining speed mostly because you're running three
processes total instead of at least one process (sh) per commit.  So I
don't think there's anything that Git can do to make this faster on our
end without a redesign.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: Import/Export as a fast way to purge files from Git?
  2018-09-23 14:55 ` Eric Sunshine
@ 2018-09-23 15:58   ` Lars Schneider
  0 siblings, 0 replies; 90+ messages in thread
From: Lars Schneider @ 2018-09-23 15:58 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Git List, Jeff King, Taylor Blau, brian m. carlson



> On Sep 23, 2018, at 4:55 PM, Eric Sunshine <sunshine@sunshineco.com> wrote:
> 
> On Sun, Sep 23, 2018 at 9:05 AM Lars Schneider <larsxschneider@gmail.com> wrote:
>> I recently had to purge files from large Git repos (many files, many commits).
>> The usual recommendation is to use `git filter-branch --index-filter` to purge
>> files. However, this is *very* slow for large repos (e.g. it takes 45min to
>> remove the `builtin` directory from git core). I realized that I can remove
>> files *way* faster by exporting the repo, removing the file references,
>> and then importing the repo (see Perl script below, it takes ~30sec to remove
>> the `builtin` directory from git core). Do you see any problem with this
>> approach?
> 
> A couple comments:
> 
> For purging files from a history, take a look at BFG[1] which bills
> itself as "a simpler, faster alternative to git-filter-branch for
> cleansing bad data out of your Git repository history".

Yes, BFG is great. Unfortunately, it requires Java which is not available
on every system I have to work with. I required a solution that would work
in every Git environment. Hence the Perl script :-)


> The approach of exporting to a fast-import stream, modifying the
> stream, and re-importing is quite reasonable.

Thanks for the confirmation!


> However, rather than
> re-inventing, take a look at reposurgeon[2], which allows you to do
> major surgery on fast-import streams. Not only can it purge files from
> a repository, but it can slice, dice, puree, and saute pretty much any
> attribute of a repository.

Wow. Reposurgeon looks very interesting. Thanks a lot for the pointer!

Cheers,
Lars


> [1]: https://rtyley.github.io/bfg-repo-cleaner/
> [2]: http://www.catb.org/esr/reposurgeon/


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: Import/Export as a fast way to purge files from Git?
  2018-09-23 15:53 ` brian m. carlson
@ 2018-09-23 17:04   ` Jeff King
  0 siblings, 0 replies; 90+ messages in thread
From: Jeff King @ 2018-09-23 17:04 UTC (permalink / raw)
  To: brian m. carlson, Lars Schneider, git, Taylor Blau

On Sun, Sep 23, 2018 at 03:53:38PM +0000, brian m. carlson wrote:

> I suspect you're gaining speed mostly because you're running three
> processes total instead of at least one process (sh) per commit.  So I
> don't think there's anything that Git can do to make this faster on our
> end without a redesign.

It's not just the process startup overhead that makes it faster. Using
multiple processes means they have to communicate somehow. In this case,
git-read-tree is writing out the whole index for each commit, which
git-rm reads in and modifies, and then git-commit-tree finally converts
back to a tree. In addition to the raw CPU of that work, there's a bunch
of latency as each step is performed serially.

Whereas in the proposed pipeline, fast-export is writing out a diff and
fast-import is turning that directly back into tree objects. And both
processes are proceeding independently, so you benefit from multiple
cores.

Which isn't to say I really disagree with "Git can't really make this
faster". filter-branch has a ton of power to let you replay arbitrary
commands (including non-Git commands!), so the speed tradeoff in its
approach is very intentional. If we could modify the index in-place that
would probably make it a little faster, but that probably counts as
"redesign" in your statement. ;)

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: Import/Export as a fast way to purge files from Git?
  2018-09-23 13:04 Import/Export as a fast way to purge files from Git? Lars Schneider
  2018-09-23 14:55 ` Eric Sunshine
  2018-09-23 15:53 ` brian m. carlson
@ 2018-09-24 17:24 ` Elijah Newren
  2018-10-31 19:15   ` Lars Schneider
  2 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-09-24 17:24 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Git Mailing List, Jeff King, me, brian m. carlson

On Sun, Sep 23, 2018 at 6:08 AM Lars Schneider <larsxschneider@gmail.com> wrote:
>
> Hi,
>
> I recently had to purge files from large Git repos (many files, many commits).
> The usual recommendation is to use `git filter-branch --index-filter` to purge
> files. However, this is *very* slow for large repos (e.g. it takes 45min to
> remove the `builtin` directory from git core). I realized that I can remove
> files *way* faster by exporting the repo, removing the file references,
> and then importing the repo (see Perl script below, it takes ~30sec to remove
> the `builtin` directory from git core). Do you see any problem with this
> approach?

It looks like others have pointed you at other tools, and you're
already shifting to that route.  But I think it's a useful question to
answer more generally, so for those that are really curious...

The basic approach is fine, though if you try to extend it much you
can run into a few possible edge/corner cases (more on that below).
I've been using this basic approach for years and even created a
mini-python library[1] designed specifically to allow people to create
"fast-filters", used as
   git fast-export <options> | your-fast-filter | git fast-import <options>

But that library didn't really take off; even I have rarely used it,
often opting for filter-branch despite its horrible performance or a
simple fast-export | long-sed-command | fast-import (with some extra
pre-checking to make sure the sed wouldn't unintentionally munge other
data).  BFG is great, as long as you're only interested in removing a
few big items, but otherwise doesn't seem very useful (to be fair,
it's very upfront about only wanting to solve that problem).
Recently, due to continuing questions on filter-branch and folks still
getting confused with it, I looked at existing tools, decided I didn't
think any quite fit, and started looking into converting
git_fast_filter into a filter-branch-like tool instead of just a
libary.  Found some bugs and missing features in fast-export along the
way (and have some patches I still need to send in).  But I kind of
got stuck -- if the tool is in python, will that limit adoption too
much?  It'd be kind of nice to have this tool in core git.  But I kind
of like leaving open the possibility of using it as a tool _or_ as a
library, the latter for the special cases where case-specific
programmatic filtering is needed.  But a developer-convenience library
makes almost no sense unless in a higher level language, such as
python.  I'm still trying to make up my mind about what I want (and
what others might want), and have been kind of blocking on that.  (If
others have opinions, I'm all ears.)

Anyway, the edge/corner cases you can watch out for:

  - Signed tags are a problem; you may need to specify
--signed-tags=strip to fast-export

  - References to other commits in your commit messages will now be
incorrect.  I think a good tool should either default to rewriting
commit ids in commit messages or at least have an option to do so
(BFG does this; filter-branch doesn't; fast-export format makes it
really hard for a filter based on it to do so)

  - If the paths you remove are the only paths modified in a commit,
the commit can become empty.  If you're only filtering a few paths
out, this might be nothing more than a minor inconvenience for you.
However, if you're trying to prune directories (and perhaps several
toplevel ones), then it can be extremely annoying to have a new
history with the vast majority of all commits being empty.
(filter-branch has an option for this; BFG does not; tools based on
fast-export output can do it with sufficient effort).

  - If you start pruning empty commits, you have to worry about
rewriting branches and tags to remaining parents.  This _might_ happen
for free depending on your history's structure and the fast-export
stream, but to be correct in general you will have to specify the new
commit for some branches or tags.

  - If you start pruning empty commits, you have to decide whether to
allow pruning of merge commits.  Your first reaction might be to not
allow it, but if one parent and its entire history are all pruned,
then transforming the merge commit to a normal commit and then
considering whether it is empty and allowing it to be pruned is much
better.

  - If you start pruning empty commits, you also have to worry about
history topology changing, beyond the all-ancestors-empty case above.
For example, the last non-empty commit in the ancestry of a merge on
both sides may be the same commit, making the merge-commit have the
same parent twice.  Should the duplicate parent be de-duped,
transforming the commit into a normal non-merge commit?  (I'd say yes
-- this commit is likely to be empty and prunable once you do so, but
I'm not sure everyone would agree with converting a merge commit to a
non-merge.)  Similarly, what if the rewritten parents of a merge have
one parent that is the direct ancestor of another?  Can the extra
unnecessary parent be removed as a parent?  (And again, such a commit
is likely to become empty and be prunable itself.)

  - If you try to avoid the extra work involved with pruning empty
commits by passing path-specifiers as rev-list-args to fast-export,
and use the --tag-of-filtered-object=rewrite option if needed, then
depending on the topology you can hit any of three bugs: an outright
die() (despite the --tag-of-filtered-object=rewrite), a branch being
reset to a non-existent mark (causing fast-import to die), or find
that a ref which you explicitly requested to be part of the export is
silently omitted from the stream.  (granted, these aren't fundamental
issues; they're just bugs in fast-export that I seem to have been the
first to find.)

  -  filter-branch has a nice ability to rewrite only the last few
commits using a range specifier like HEAD~10..HEAD.  Trying the same
with fast-export will get you a history with only 10 commits, the
first of which squashes all early history together.  Trying to
duplicate the filter-branch behavior can be done, but it requires
multiple exports with different args and usage of --export-marks and
--import-marks; it's cumbersome and somewhat non-obvious.

  - some filters are difficult; e.g. if you want to mimick
filter-branch's --parent-filter, or BFG's --strip-blobs-with-ids, you
run into the issue that the fast-export stream doesn't provide the
original sha1sums for commits or blobs and there's no easy way for you
to associate it with the given mark.

Those are the ones I know about.  It's possible that there are others.

Hope that helps or is at least interesting.

Elijah

[1] https://public-inbox.org/git/51419b2c0904072035u1182b507o836a67ac308d32b9@mail.gmail.com/

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: Import/Export as a fast way to purge files from Git?
  2018-09-24 17:24 ` Elijah Newren
@ 2018-10-31 19:15   ` Lars Schneider
  2018-11-01  7:12     ` Elijah Newren
  0 siblings, 1 reply; 90+ messages in thread
From: Lars Schneider @ 2018-10-31 19:15 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List, Jeff King, me, brian m. carlson



> On Sep 24, 2018, at 7:24 PM, Elijah Newren <newren@gmail.com> wrote:
> 
> On Sun, Sep 23, 2018 at 6:08 AM Lars Schneider <larsxschneider@gmail.com> wrote:
>> 
>> Hi,
>> 
>> I recently had to purge files from large Git repos (many files, many commits).
>> The usual recommendation is to use `git filter-branch --index-filter` to purge
>> files. However, this is *very* slow for large repos (e.g. it takes 45min to
>> remove the `builtin` directory from git core). I realized that I can remove
>> files *way* faster by exporting the repo, removing the file references,
>> and then importing the repo (see Perl script below, it takes ~30sec to remove
>> the `builtin` directory from git core). Do you see any problem with this
>> approach?
> 
> It looks like others have pointed you at other tools, and you're
> already shifting to that route.  But I think it's a useful question to
> answer more generally, so for those that are really curious...
> 
> 
> The basic approach is fine, though if you try to extend it much you
> can run into a few possible edge/corner cases (more on that below).
> I've been using this basic approach for years and even created a
> mini-python library[1] designed specifically to allow people to create
> "fast-filters", used as
>   git fast-export <options> | your-fast-filter | git fast-import <options>
> 
> But that library didn't really take off; even I have rarely used it,
> often opting for filter-branch despite its horrible performance or a
> simple fast-export | long-sed-command | fast-import (with some extra
> pre-checking to make sure the sed wouldn't unintentionally munge other
> data).  BFG is great, as long as you're only interested in removing a
> few big items, but otherwise doesn't seem very useful (to be fair,
> it's very upfront about only wanting to solve that problem).
> Recently, due to continuing questions on filter-branch and folks still
> getting confused with it, I looked at existing tools, decided I didn't
> think any quite fit, and started looking into converting
> git_fast_filter into a filter-branch-like tool instead of just a
> libary.  Found some bugs and missing features in fast-export along the
> way (and have some patches I still need to send in).  But I kind of
> got stuck -- if the tool is in python, will that limit adoption too
> much?  It'd be kind of nice to have this tool in core git.  But I kind
> of like leaving open the possibility of using it as a tool _or_ as a
> library, the latter for the special cases where case-specific
> programmatic filtering is needed.  But a developer-convenience library
> makes almost no sense unless in a higher level language, such as
> python.  I'm still trying to make up my mind about what I want (and
> what others might want), and have been kind of blocking on that.  (If
> others have opinions, I'm all ears.)

That library sounds like a very interesting idea. Unfortunately, the 
referenced repo seems not to be available anymore:
    git://gitorious.org/git_fast_filter/mainline.git

I very much like Python. However, more recently I started to
write Git tools in Perl as they work out of the box on every
machine with Git installed ... and I think Perl can be quite
readable if no shortcuts are used :-). 


> Anyway, the edge/corner cases you can watch out for:
> 
>  - Signed tags are a problem; you may need to specify
> --signed-tags=strip to fast-export
> 
>  - References to other commits in your commit messages will now be
> incorrect.  I think a good tool should either default to rewriting
> commit ids in commit messages or at least have an option to do so
> (BFG does this; filter-branch doesn't; fast-export format makes it
> really hard for a filter based on it to do so)
> 
>  - If the paths you remove are the only paths modified in a commit,
> the commit can become empty.  If you're only filtering a few paths
> out, this might be nothing more than a minor inconvenience for you.
> However, if you're trying to prune directories (and perhaps several
> toplevel ones), then it can be extremely annoying to have a new
> history with the vast majority of all commits being empty.
> (filter-branch has an option for this; BFG does not; tools based on
> fast-export output can do it with sufficient effort).
> 
>  - If you start pruning empty commits, you have to worry about
> rewriting branches and tags to remaining parents.  This _might_ happen
> for free depending on your history's structure and the fast-export
> stream, but to be correct in general you will have to specify the new
> commit for some branches or tags.
> 
>  - If you start pruning empty commits, you have to decide whether to
> allow pruning of merge commits.  Your first reaction might be to not
> allow it, but if one parent and its entire history are all pruned,
> then transforming the merge commit to a normal commit and then
> considering whether it is empty and allowing it to be pruned is much
> better.
> 
>  - If you start pruning empty commits, you also have to worry about
> history topology changing, beyond the all-ancestors-empty case above.
> For example, the last non-empty commit in the ancestry of a merge on
> both sides may be the same commit, making the merge-commit have the
> same parent twice.  Should the duplicate parent be de-duped,
> transforming the commit into a normal non-merge commit?  (I'd say yes
> -- this commit is likely to be empty and prunable once you do so, but
> I'm not sure everyone would agree with converting a merge commit to a
> non-merge.)  Similarly, what if the rewritten parents of a merge have
> one parent that is the direct ancestor of another?  Can the extra
> unnecessary parent be removed as a parent?  (And again, such a commit
> is likely to become empty and be prunable itself.)
> 
>  - If you try to avoid the extra work involved with pruning empty
> commits by passing path-specifiers as rev-list-args to fast-export,
> and use the --tag-of-filtered-object=rewrite option if needed, then
> depending on the topology you can hit any of three bugs: an outright
> die() (despite the --tag-of-filtered-object=rewrite), a branch being
> reset to a non-existent mark (causing fast-import to die), or find
> that a ref which you explicitly requested to be part of the export is
> silently omitted from the stream.  (granted, these aren't fundamental
> issues; they're just bugs in fast-export that I seem to have been the
> first to find.)
> 
>  -  filter-branch has a nice ability to rewrite only the last few
> commits using a range specifier like HEAD~10..HEAD.  Trying the same
> with fast-export will get you a history with only 10 commits, the
> first of which squashes all early history together.  Trying to
> duplicate the filter-branch behavior can be done, but it requires
> multiple exports with different args and usage of --export-marks and
> --import-marks; it's cumbersome and somewhat non-obvious.
> 
>  - some filters are difficult; e.g. if you want to mimick
> filter-branch's --parent-filter, or BFG's --strip-blobs-with-ids, you
> run into the issue that the fast-export stream doesn't provide the
> original sha1sums for commits or blobs and there's no easy way for you
> to associate it with the given mark.

Thanks a lot for these tips and tricks. I was aware of the empty commits
but the signed tags problem was not yet on my radar!

Thanks,
Lars

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: Import/Export as a fast way to purge files from Git?
  2018-10-31 19:15   ` Lars Schneider
@ 2018-11-01  7:12     ` Elijah Newren
  2018-11-11  6:23       ` [PATCH 00/10] fast export and import fixes and features Elijah Newren
  2018-11-12  9:17       ` Import/Export as a fast way to purge files from Git? Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-01  7:12 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Git Mailing List, Jeff King, Taylor Blau, brian m. carlson

On Wed, Oct 31, 2018 at 12:16 PM Lars Schneider
<larsxschneider@gmail.com> wrote:
> > On Sep 24, 2018, at 7:24 PM, Elijah Newren <newren@gmail.com> wrote:
> > On Sun, Sep 23, 2018 at 6:08 AM Lars Schneider <larsxschneider@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> I recently had to purge files from large Git repos (many files, many commits).
> >> The usual recommendation is to use `git filter-branch --index-filter` to purge
> >> files. However, this is *very* slow for large repos (e.g. it takes 45min to
> >> remove the `builtin` directory from git core). I realized that I can remove
> >> files *way* faster by exporting the repo, removing the file references,
> >> and then importing the repo (see Perl script below, it takes ~30sec to remove
> >> the `builtin` directory from git core). Do you see any problem with this
> >> approach?
> >
> > It looks like others have pointed you at other tools, and you're
> > already shifting to that route.  But I think it's a useful question to
> > answer more generally, so for those that are really curious...
> >
> >
> > The basic approach is fine, though if you try to extend it much you
> > can run into a few possible edge/corner cases (more on that below).
> > I've been using this basic approach for years and even created a
> > mini-python library[1] designed specifically to allow people to create
> > "fast-filters", used as
> >   git fast-export <options> | your-fast-filter | git fast-import <options>
> >
> > But that library didn't really take off; even I have rarely used it,
> > often opting for filter-branch despite its horrible performance or a
> > simple fast-export | long-sed-command | fast-import (with some extra
> > pre-checking to make sure the sed wouldn't unintentionally munge other
> > data).  BFG is great, as long as you're only interested in removing a
> > few big items, but otherwise doesn't seem very useful (to be fair,
> > it's very upfront about only wanting to solve that problem).
> > Recently, due to continuing questions on filter-branch and folks still
> > getting confused with it, I looked at existing tools, decided I didn't
> > think any quite fit, and started looking into converting
> > git_fast_filter into a filter-branch-like tool instead of just a
> > libary.  Found some bugs and missing features in fast-export along the
> > way (and have some patches I still need to send in).  But I kind of
> > got stuck -- if the tool is in python, will that limit adoption too
> > much?  It'd be kind of nice to have this tool in core git.  But I kind
> > of like leaving open the possibility of using it as a tool _or_ as a
> > library, the latter for the special cases where case-specific
> > programmatic filtering is needed.  But a developer-convenience library
> > makes almost no sense unless in a higher level language, such as
> > python.  I'm still trying to make up my mind about what I want (and
> > what others might want), and have been kind of blocking on that.  (If
> > others have opinions, I'm all ears.)
>
> That library sounds like a very interesting idea. Unfortunately, the
> referenced repo seems not to be available anymore:
>     git://gitorious.org/git_fast_filter/mainline.git

Yeah, gitorious went down at a time when I was busy with enough other
things that I never bothered moving my repos to a new hosting site.
Sorry about that.

I've got a copy locally, but I've been editing it heavily, without the
testing I should have in place, so I hesitate to point you at it right
now.  (Also, the old version failed to handle things like --no-data
output, which is important.)  I'll post an updated copy soon; feel
free to ping me in a week if you haven't heard anything yet.

> I very much like Python. However, more recently I started to
> write Git tools in Perl as they work out of the box on every
> machine with Git installed ... and I think Perl can be quite
> readable if no shortcuts are used :-).

Yeah, when portability matters, perl makes sense.  I thought about
switching it over, but I'm not sure I want to rewrite 1-2k lines of
code.  Especially since repo-filtering tools are kind of one-shot by
nature, and only need to be done by one person of a team, on one
specific machine, and won't affect daily development thereafter.
(Also, since I don't depend on any libraries and use only stuff from
the default python library, it ought to be relatively portable
anyway.)

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 00/10] fast export and import fixes and features
  2018-11-01  7:12     ` Elijah Newren
@ 2018-11-11  6:23       ` Elijah Newren
  2018-11-11  6:23         ` [PATCH 01/10] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
                           ` (11 more replies)
  2018-11-12  9:17       ` Import/Export as a fast way to purge files from Git? Ævar Arnfjörð Bjarmason
  1 sibling, 12 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-11  6:23 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, Elijah Newren

This is a series of ten patches representing two doc corrections, one
pedantic fix, three real bug fixes, one micro code refactor, and three
new features.  Each of these ten changes is relatively small in size.
These changes predominantly affect fast-export, but there's a couple
small changes for fast-import as well.

I could potentially split these patches up, but I'd just end up
chaining them sequentially since otherwise there'd be lots of
conflicts; having 10 different single patch series with lots of
dependencies sounded like a bigger pain to me, but let me know if you
would prefer I split them up and how you suggest doing so.

These patches were driven by the needs of git-repo-filter[1], but most
if not all of them should be independently useful.

Elijah Newren (10):
  git-fast-import.txt: fix documentation for --quiet option
  git-fast-export.txt: clarify misleading documentation about rev-list
    args
  fast-export: use value from correct enum
  fast-export: avoid dying when filtering by paths and old tags exist
  fast-export: move commit rewriting logic into a function for reuse
  fast-export: when using paths, avoid corrupt stream with non-existent
    mark
  fast-export: ensure we export requested refs
  fast-export: add --reference-excluded-parents option
  fast-export: add a --show-original-ids option to show original names
  fast-export: add --always-show-modify-after-rename

 Documentation/git-fast-export.txt |  33 ++++++-
 Documentation/git-fast-import.txt |   7 +-
 builtin/fast-export.c             | 156 +++++++++++++++++++++++-------
 fast-import.c                     |  17 ++++
 t/t9350-fast-export.sh            | 124 +++++++++++++++++++++++-
 5 files changed, 293 insertions(+), 44 deletions(-)

[1] https://github.com/newren/git-repo-filter if you're really
curious, but ***** IT HAS SEVERAL SHARP EDGES *****.  It isn't really
ready for review/testing/usage/announcing/etc; in fact, it's not quite
WIP/RFC ready.  (Further, it's not clear if it should somehow become
part of core git, should go into contrib, or just remain separate
indefinitely.)  Anyway, please do not attempt to use it for anything
real yet.  I'll send out an email when I think it's closer to ready.

-- 
2.19.1.866.g82735bcbde

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 01/10] git-fast-import.txt: fix documentation for --quiet option
  2018-11-11  6:23       ` [PATCH 00/10] fast export and import fixes and features Elijah Newren
@ 2018-11-11  6:23         ` Elijah Newren
  2018-11-11  6:33           ` Jeff King
  2018-11-11  6:23         ` [PATCH 02/10] git-fast-export.txt: clarify misleading documentation about rev-list args Elijah Newren
                           ` (10 subsequent siblings)
  11 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-11  6:23 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, Elijah Newren

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-import.txt | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index e81117d27f..7ab97745a6 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -40,9 +40,10 @@ OPTIONS
 	not contain the old commit).
 
 --quiet::
-	Disable all non-fatal output, making fast-import silent when it
-	is successful.  This option disables the output shown by
-	--stats.
+	Disable the output shown by --stats, making fast-import usually
+	be silent when it is successful.  However, if the import stream
+	has directives intended to show user output (e.g. `progress`
+	directives), the corresponding messages will still be shown.
 
 --stats::
 	Display some basic statistics about the objects fast-import has
-- 
2.19.1.866.g82735bcbde


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 02/10] git-fast-export.txt: clarify misleading documentation about rev-list args
  2018-11-11  6:23       ` [PATCH 00/10] fast export and import fixes and features Elijah Newren
  2018-11-11  6:23         ` [PATCH 01/10] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
@ 2018-11-11  6:23         ` Elijah Newren
  2018-11-11  6:36           ` Jeff King
  2018-11-11  6:23         ` [PATCH 03/10] fast-export: use value from correct enum Elijah Newren
                           ` (9 subsequent siblings)
  11 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-11  6:23 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, Elijah Newren

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-export.txt | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index ce954be532..677510b7f7 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -119,7 +119,8 @@ marks the same across runs.
 	'git rev-list', that specifies the specific objects and references
 	to export.  For example, `master~10..master` causes the
 	current master reference to be exported along with all objects
-	added since its 10th ancestor commit.
+	added since its 10th ancestor commit and all files common to
+	master\~9 and master~10.
 
 EXAMPLES
 --------
-- 
2.19.1.866.g82735bcbde


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 03/10] fast-export: use value from correct enum
  2018-11-11  6:23       ` [PATCH 00/10] fast export and import fixes and features Elijah Newren
  2018-11-11  6:23         ` [PATCH 01/10] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
  2018-11-11  6:23         ` [PATCH 02/10] git-fast-export.txt: clarify misleading documentation about rev-list args Elijah Newren
@ 2018-11-11  6:23         ` Elijah Newren
  2018-11-11  6:36           ` Jeff King
  2018-11-11  6:23         ` [PATCH 04/10] fast-export: avoid dying when filtering by paths and old tags exist Elijah Newren
                           ` (8 subsequent siblings)
  11 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-11  6:23 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, Elijah Newren

ABORT and ERROR happen to have the same value, but come from differnt
enums.  Use the one from the correct enum.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 456797c12a..1a299c2a21 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -752,7 +752,7 @@ static void handle_tag(const char *name, struct tag *tag)
 	tagged_mark = get_object_mark(tagged);
 	if (!tagged_mark) {
 		switch(tag_of_filtered_mode) {
-		case ABORT:
+		case ERROR:
 			die("tag %s tags unexported object; use "
 			    "--tag-of-filtered-object=<mode> to handle it",
 			    oid_to_hex(&tag->object.oid));
-- 
2.19.1.866.g82735bcbde


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 04/10] fast-export: avoid dying when filtering by paths and old tags exist
  2018-11-11  6:23       ` [PATCH 00/10] fast export and import fixes and features Elijah Newren
                           ` (2 preceding siblings ...)
  2018-11-11  6:23         ` [PATCH 03/10] fast-export: use value from correct enum Elijah Newren
@ 2018-11-11  6:23         ` Elijah Newren
  2018-11-11  6:44           ` Jeff King
  2018-11-11  6:23         ` [PATCH 05/10] fast-export: move commit rewriting logic into a function for reuse Elijah Newren
                           ` (7 subsequent siblings)
  11 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-11  6:23 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, Elijah Newren

If --tag-of-filtered-object=rewrite is specified along with a set of
paths to limit what is exported, then any tags pointing to old commits
that do not contain any of those specified paths cause problems.  Since
the old tagged commit is not exported, fast-export attempts to rewrite
such tags to an ancestor commit which was exported.  If no such commit
exists, then fast-export currently die()s.  Five years after the tag
rewriting logic was added to fast-export (see commit 2d8ad4691921,
"fast-export: Add a --tag-of-filtered-object  option for newly dangling
tags", 2009-06-25), fast-import gained the ability to delete refs (see
commit 4ee1b225b99f, "fast-import: add support to delete refs",
2014-04-20), so now we do have a valid option to rewrite the tag to.
Delete these tags instead of dying.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c  |  9 ++++++---
 t/t9350-fast-export.sh | 20 ++++++++++++++++++++
 2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 1a299c2a21..89de9d6400 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -774,9 +774,12 @@ static void handle_tag(const char *name, struct tag *tag)
 					break;
 				if (!(p->object.flags & TREESAME))
 					break;
-				if (!p->parents)
-					die("can't find replacement commit for tag %s",
-					     oid_to_hex(&tag->object.oid));
+				if (!p->parents) {
+					printf("reset %s\nfrom %s\n\n",
+					       name, sha1_to_hex(null_sha1));
+					free(buf);
+					return;
+				}
 				p = p->parents->item;
 			}
 			tagged_mark = get_object_mark(&p->object);
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 6a392e87bc..5bf21b4908 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -325,6 +325,26 @@ test_expect_success 'rewriting tag of filtered out object' '
 )
 '
 
+test_expect_success 'rewrite tag predating pathspecs to nothing' '
+	test_create_repo rewrite_tag_predating_pathspecs &&
+	(
+		cd rewrite_tag_predating_pathspecs &&
+
+		touch ignored &&
+		git add ignored &&
+		test_commit initial &&
+
+		git tag -a -m "Some old tag" v0.0.0.0.0.0.1 &&
+
+		echo foo >bar &&
+		git add bar &&
+		test_commit add-bar &&
+
+		git fast-export --tag-of-filtered-object=rewrite --all -- bar >output &&
+		grep -A 1 refs/tags/v0.0.0.0.0.0.1 output | grep -E ^from.0{40}
+	)
+'
+
 cat > limit-by-paths/expected << EOF
 blob
 mark :1
-- 
2.19.1.866.g82735bcbde


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 05/10] fast-export: move commit rewriting logic into a function for reuse
  2018-11-11  6:23       ` [PATCH 00/10] fast export and import fixes and features Elijah Newren
                           ` (3 preceding siblings ...)
  2018-11-11  6:23         ` [PATCH 04/10] fast-export: avoid dying when filtering by paths and old tags exist Elijah Newren
@ 2018-11-11  6:23         ` Elijah Newren
  2018-11-11  6:47           ` Jeff King
  2018-11-11  6:23         ` [PATCH 06/10] fast-export: when using paths, avoid corrupt stream with non-existent mark Elijah Newren
                           ` (6 subsequent siblings)
  11 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-11  6:23 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, Elijah Newren

Logic to replace a filtered commit with an unfiltered ancestor is useful
elsewhere; put it into a function we can call.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c | 37 ++++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 89de9d6400..a3c044b0af 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -187,6 +187,22 @@ static int get_object_mark(struct object *object)
 	return ptr_to_mark(decoration);
 }
 
+static struct commit *rewrite_commit(struct commit *p)
+{
+	for (;;) {
+		if (p->parents && p->parents->next)
+			break;
+		if (p->object.flags & UNINTERESTING)
+			break;
+		if (!(p->object.flags & TREESAME))
+			break;
+		if (!p->parents)
+			return NULL;
+		p = p->parents->item;
+	}
+	return p;
+}
+
 static void show_progress(void)
 {
 	static int counter = 0;
@@ -766,21 +782,12 @@ static void handle_tag(const char *name, struct tag *tag)
 				    oid_to_hex(&tag->object.oid),
 				    type_name(tagged->type));
 			}
-			p = (struct commit *)tagged;
-			for (;;) {
-				if (p->parents && p->parents->next)
-					break;
-				if (p->object.flags & UNINTERESTING)
-					break;
-				if (!(p->object.flags & TREESAME))
-					break;
-				if (!p->parents) {
-					printf("reset %s\nfrom %s\n\n",
-					       name, sha1_to_hex(null_sha1));
-					free(buf);
-					return;
-				}
-				p = p->parents->item;
+			p = rewrite_commit((struct commit *)tagged);
+			if (!p) {
+				printf("reset %s\nfrom %s\n\n",
+				       name, sha1_to_hex(null_sha1));
+				free(buf);
+				return;
 			}
 			tagged_mark = get_object_mark(&p->object);
 		}
-- 
2.19.1.866.g82735bcbde


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 06/10] fast-export: when using paths, avoid corrupt stream with non-existent mark
  2018-11-11  6:23       ` [PATCH 00/10] fast export and import fixes and features Elijah Newren
                           ` (4 preceding siblings ...)
  2018-11-11  6:23         ` [PATCH 05/10] fast-export: move commit rewriting logic into a function for reuse Elijah Newren
@ 2018-11-11  6:23         ` Elijah Newren
  2018-11-11  6:53           ` Jeff King
  2018-11-11  6:23         ` [PATCH 07/10] fast-export: ensure we export requested refs Elijah Newren
                           ` (5 subsequent siblings)
  11 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-11  6:23 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, Elijah Newren

If file paths are specified to fast-export and multiple refs point to a
commit that does not touch any of the relevant file paths, then
fast-export can hit problems.  fast-export has a list of additional refs
that it needs to explicitly set after exporting all blobs and commits,
and when it tries to get_object_mark() on the relevant commit, it can
get a mark of 0, i.e. "not found", because the commit in question did
not touch the relevant paths and thus was not exported.  Trying to
import a stream with a mark corresponding to an unexported object will
cause fast-import to crash.

Avoid this problem by taking the commit the ref points to and finding an
ancestor of it that was exported, and make the ref point to that commit
instead.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c  | 13 ++++++++++++-
 t/t9350-fast-export.sh | 24 ++++++++++++++++++++++++
 2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index a3c044b0af..5648a8ce9c 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -900,7 +900,18 @@ static void handle_tags_and_duplicates(void)
 			if (anonymize)
 				name = anonymize_refname(name);
 			/* create refs pointing to already seen commits */
-			commit = (struct commit *)object;
+			commit = rewrite_commit((struct commit *)object);
+			if (!commit) {
+				/*
+				 * Neither this object nor any of its
+				 * ancestors touch any relevant paths, so
+				 * it has been filtered to nothing.  Delete
+				 * it.
+				 */
+				printf("reset %s\nfrom %s\n\n",
+				       name, sha1_to_hex(null_sha1));
+				continue;
+			}
 			printf("reset %s\nfrom :%d\n\n", name,
 			       get_object_mark(&commit->object));
 			show_progress();
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 5bf21b4908..dbb560c110 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -386,6 +386,30 @@ test_expect_success 'path limiting with import-marks does not lose unmodified fi
 	grep file0 actual
 '
 
+test_expect_success 'avoid corrupt stream with non-existent mark' '
+	test_create_repo avoid_non_existent_mark &&
+	(
+		cd avoid_non_existent_mark &&
+
+		touch important-path &&
+		git add important-path &&
+		test_commit initial &&
+
+		touch ignored &&
+		git add ignored &&
+		test_commit whatever &&
+
+		git branch A &&
+		git branch B &&
+
+		echo foo >>important-path &&
+		git add important-path &&
+		test_commit more changes &&
+
+		git fast-export --all -- important-path | git fast-import --force
+	)
+'
+
 test_expect_success 'full-tree re-shows unmodified files'        '
 	git checkout -f simple &&
 	git fast-export --full-tree simple >actual &&
-- 
2.19.1.866.g82735bcbde


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 07/10] fast-export: ensure we export requested refs
  2018-11-11  6:23       ` [PATCH 00/10] fast export and import fixes and features Elijah Newren
                           ` (5 preceding siblings ...)
  2018-11-11  6:23         ` [PATCH 06/10] fast-export: when using paths, avoid corrupt stream with non-existent mark Elijah Newren
@ 2018-11-11  6:23         ` Elijah Newren
  2018-11-11  7:02           ` Jeff King
  2018-11-11  6:23         ` [PATCH 08/10] fast-export: add --reference-excluded-parents option Elijah Newren
                           ` (4 subsequent siblings)
  11 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-11  6:23 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, Elijah Newren

If file paths are specified to fast-export and a ref points to a commit
that does not touch any of the relevant paths, then that ref would
sometimes fail to be exported.  (This depends on whether any ancestors
of the commit which do touch the relevant paths would be exported with
that same ref name or a different ref name.)  To avoid this problem,
put *all* specified refs into extra_refs to start, and then as we export
each commit, remove the refname used in the 'commit $REFNAME' directive
from extra_refs.  Then, in handle_tags_and_duplicates() we know which
refs actually do need a manual reset directive in order to be included.

This means that we do need some special handling for excluded refs; e.g.
if someone runs
   git fast-export ^master master
then they've asked for master to be exported, but they have also asked
for the commit which master points to and all of its history to be
excluded.  That logically means ref deletion.  Previously, such refs
were just silently omitted from being exported despite having been
explicitly requested for export.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
NOTE: I was hoping the strmap API proposal would materialize, but I either
missed it or it hasn't shown up.  The usage of string_list in this patch
would be better replaced by what Peff suggested.

 builtin/fast-export.c  | 48 +++++++++++++++++++++++++++++++-----------
 t/t9350-fast-export.sh | 16 +++++++++++---
 2 files changed, 49 insertions(+), 15 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 5648a8ce9c..0d0bbd9445 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -38,6 +38,7 @@ static int use_done_feature;
 static int no_data;
 static int full_tree;
 static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
+static struct string_list tag_refs = STRING_LIST_INIT_NODUP;
 static struct refspec refspecs = REFSPEC_INIT_FETCH;
 static int anonymize;
 static struct revision_sources revision_sources;
@@ -611,6 +612,7 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 			export_blob(&diff_queued_diff.queue[i]->two->oid);
 
 	refname = *revision_sources_at(&revision_sources, commit);
+	string_list_remove(&extra_refs, refname, 0);
 	if (anonymize) {
 		refname = anonymize_refname(refname);
 		anonymize_ident_line(&committer, &committer_end);
@@ -814,7 +816,7 @@ static struct commit *get_commit(struct rev_cmdline_entry *e, char *full_name)
 		/* handle nested tags */
 		while (tag && tag->object.type == OBJ_TAG) {
 			parse_object(the_repository, &tag->object.oid);
-			string_list_append(&extra_refs, full_name)->util = tag;
+			string_list_append(&tag_refs, full_name)->util = tag;
 			tag = (struct tag *)tag->tagged;
 		}
 		if (!tag)
@@ -873,25 +875,30 @@ static void get_tags_and_duplicates(struct rev_cmdline_info *info)
 		}
 
 		/*
-		 * This ref will not be updated through a commit, lets make
-		 * sure it gets properly updated eventually.
+		 * Make sure this ref gets properly updated eventually, whether
+		 * through a commit or manually at the end.
 		 */
-		if (*revision_sources_at(&revision_sources, commit) ||
-		    commit->object.flags & SHOWN)
+		if (e->item->type != OBJ_TAG)
 			string_list_append(&extra_refs, full_name)->util = commit;
+
 		if (!*revision_sources_at(&revision_sources, commit))
 			*revision_sources_at(&revision_sources, commit) = full_name;
 	}
+
+	string_list_sort(&extra_refs);
+	string_list_remove_duplicates(&extra_refs, 0);
 }
 
-static void handle_tags_and_duplicates(void)
+static void handle_tags_and_duplicates(struct string_list *extras)
 {
 	struct commit *commit;
 	int i;
 
-	for (i = extra_refs.nr - 1; i >= 0; i--) {
-		const char *name = extra_refs.items[i].string;
-		struct object *object = extra_refs.items[i].util;
+	for (i = extras->nr - 1; i >= 0; i--) {
+		const char *name = extras->items[i].string;
+		struct object *object = extras->items[i].util;
+		int mark;
+
 		switch (object->type) {
 		case OBJ_TAG:
 			handle_tag(name, (struct tag *)object);
@@ -912,8 +919,24 @@ static void handle_tags_and_duplicates(void)
 				       name, sha1_to_hex(null_sha1));
 				continue;
 			}
-			printf("reset %s\nfrom :%d\n\n", name,
-			       get_object_mark(&commit->object));
+
+			mark = get_object_mark(&commit->object);
+			if (!mark) {
+				/*
+				 * Getting here means we have a commit which
+				 * was excluded by a negative refspec (e.g.
+				 * fast-export ^master master).  If the user
+				 * wants the branch exported but every commit
+				 * in its history to be deleted, that sounds
+				 * like a ref deletion to me.
+				 */
+				printf("reset %s\nfrom %s\n\n",
+				       name, sha1_to_hex(null_sha1));
+				continue;
+			}
+
+			printf("reset %s\nfrom :%d\n\n", name, mark
+			       );
 			show_progress();
 			break;
 		}
@@ -1101,7 +1124,8 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		}
 	}
 
-	handle_tags_and_duplicates();
+	handle_tags_and_duplicates(&extra_refs);
+	handle_tags_and_duplicates(&tag_refs);
 	handle_deletes();
 
 	if (export_filename && lastimportid != last_idnum)
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index dbb560c110..a0c93f2212 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -552,10 +552,20 @@ test_expect_success 'use refspec' '
 	test_cmp expected actual
 '
 
-test_expect_success 'delete refspec' '
+test_expect_success 'delete ref because entire history excluded' '
 	git branch to-delete &&
-	git fast-export --refspec :refs/heads/to-delete to-delete ^to-delete > actual &&
-	cat > expected <<-EOF &&
+	git fast-export to-delete ^to-delete >actual &&
+	cat >expected <<-EOF &&
+	reset refs/heads/to-delete
+	from 0000000000000000000000000000000000000000
+
+	EOF
+	test_cmp expected actual
+'
+
+test_expect_success 'delete refspec' '
+	git fast-export --refspec :refs/heads/to-delete >actual &&
+	cat >expected <<-EOF &&
 	reset refs/heads/to-delete
 	from 0000000000000000000000000000000000000000
 
-- 
2.19.1.866.g82735bcbde


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 08/10] fast-export: add --reference-excluded-parents option
  2018-11-11  6:23       ` [PATCH 00/10] fast export and import fixes and features Elijah Newren
                           ` (6 preceding siblings ...)
  2018-11-11  6:23         ` [PATCH 07/10] fast-export: ensure we export requested refs Elijah Newren
@ 2018-11-11  6:23         ` Elijah Newren
  2018-11-11  7:11           ` Jeff King
  2018-11-11  6:23         ` [PATCH 09/10] fast-export: add a --show-original-ids option to show original names Elijah Newren
                           ` (3 subsequent siblings)
  11 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-11  6:23 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, Elijah Newren

git filter-branch has a nifty feature allowing you to rewrite, e.g. just
the last 8 commits of a linear history
  git filter-branch $OPTIONS HEAD~8..HEAD

If you try the same with git fast-export, you instead get a history of
only 8 commits, with HEAD~7 being rewritten into a root commit.  There
are two alternatives:

  1) Don't use the negative revision specification, and when you're
     filtering the output to make modifications to the last 8 commits,
     just be careful to not modify any earlier commits somehow.

  2) First run 'git fast-export --export-marks=somefile HEAD~8', then
     run 'git fast-export --import-marks=somefile HEAD~8..HEAD'.

Both are more error prone than I'd like (the first for obvious reasons;
with the second option I have sometimes accidentally included too many
revisions in the first command and then found that the corresponding
extra revisions were not exported by the second command and thus were
not modified as I expected).  Also, both are poor from a performance
perspective.

Add a new --reference-excluded-parents option which will cause
fast-export to refer to commits outside the specified rev-list-args
range by their sha1sum.  Such a stream will only be useful in a
repository which already contains the necessary commits (much like the
restriction imposed when using --no-data).

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-export.txt | 16 ++++++++++--
 builtin/fast-export.c             | 42 +++++++++++++++++++++++--------
 t/t9350-fast-export.sh            | 11 ++++++++
 3 files changed, 57 insertions(+), 12 deletions(-)

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index 677510b7f7..2916096bdd 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -110,6 +110,17 @@ marks the same across runs.
 	the shape of the history and stored tree.  See the section on
 	`ANONYMIZING` below.
 
+--reference-excluded-parents::
+	By default, running a command such as `git fast-export
+	master~5..master` will not include the commit master\~5 and
+	will make master\~4 no longer have master\~5 as a parent (though
+	both the old master\~4 and new master~4 will have all the same
+	files).  Use --reference-excluded-parents to instead have the
+	the stream refer to commits in the excluded range of history
+	by their sha1sum.  Note that the resulting stream can only be
+	used by a repository which already contains the necessary
+	parent commits.
+
 --refspec::
 	Apply the specified refspec to each ref exported. Multiple of them can
 	be specified.
@@ -119,8 +130,9 @@ marks the same across runs.
 	'git rev-list', that specifies the specific objects and references
 	to export.  For example, `master~10..master` causes the
 	current master reference to be exported along with all objects
-	added since its 10th ancestor commit and all files common to
-	master\~9 and master~10.
+	added since its 10th ancestor commit and (unless the
+	--reference-excluded-parents option is specified) all files
+	common to master\~9 and master~10.
 
 EXAMPLES
 --------
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 0d0bbd9445..ea9c5b1c00 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -37,6 +37,7 @@ static int fake_missing_tagger;
 static int use_done_feature;
 static int no_data;
 static int full_tree;
+static int reference_excluded_commits;
 static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
 static struct string_list tag_refs = STRING_LIST_INIT_NODUP;
 static struct refspec refspecs = REFSPEC_INIT_FETCH;
@@ -596,7 +597,8 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 		message += 2;
 
 	if (commit->parents &&
-	    get_object_mark(&commit->parents->item->object) != 0 &&
+	    (get_object_mark(&commit->parents->item->object) != 0 ||
+	     reference_excluded_commits) &&
 	    !full_tree) {
 		parse_commit_or_die(commit->parents->item);
 		diff_tree_oid(get_commit_tree_oid(commit->parents->item),
@@ -638,13 +640,21 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 	unuse_commit_buffer(commit, commit_buffer);
 
 	for (i = 0, p = commit->parents; p; p = p->next) {
-		int mark = get_object_mark(&p->item->object);
-		if (!mark)
+		struct object *obj = &p->item->object;
+		int mark = get_object_mark(obj);
+
+		if (!mark && !reference_excluded_commits)
 			continue;
 		if (i == 0)
-			printf("from :%d\n", mark);
+			printf("from ");
+		else
+			printf("merge ");
+		if (mark)
+			printf(":%d\n", mark);
 		else
-			printf("merge :%d\n", mark);
+			printf("%s\n", sha1_to_hex(anonymize ?
+						   anonymize_sha1(&obj->oid) :
+						   obj->oid.hash));
 		i++;
 	}
 
@@ -925,13 +935,22 @@ static void handle_tags_and_duplicates(struct string_list *extras)
 				/*
 				 * Getting here means we have a commit which
 				 * was excluded by a negative refspec (e.g.
-				 * fast-export ^master master).  If the user
+				 * fast-export ^master master).  If we are
+				 * referencing excluded commits, set the ref
+				 * to the exact commit.  Otherwise, the user
 				 * wants the branch exported but every commit
-				 * in its history to be deleted, that sounds
-				 * like a ref deletion to me.
+				 * in its history to be deleted, which basically
+				 * just means deletion of the ref.
 				 */
-				printf("reset %s\nfrom %s\n\n",
-				       name, sha1_to_hex(null_sha1));
+				if (!reference_excluded_commits) {
+					/* delete the ref */
+					printf("reset %s\nfrom %s\n\n",
+					       name, sha1_to_hex(null_sha1));
+					continue;
+				}
+				/* set ref to commit using oid, not mark */
+				printf("reset %s\nfrom %s\n\n", name,
+				       sha1_to_hex(commit->object.oid.hash));
 				continue;
 			}
 
@@ -1068,6 +1087,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_STRING_LIST(0, "refspec", &refspecs_list, N_("refspec"),
 			     N_("Apply refspec to exported refs")),
 		OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")),
+		OPT_BOOL(0, "reference-excluded-parents",
+			 &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by sha1sum")),
+
 		OPT_END()
 	};
 
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index a0c93f2212..c2f40d6a40 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -66,6 +66,17 @@ test_expect_success 'fast-export master~2..master' '
 
 '
 
+test_expect_success 'fast-export --reference-excluded-parents master~2..master' '
+
+	git fast-export --reference-excluded-parents master~2..master >actual &&
+	grep commit.refs/heads/master actual >commit-count &&
+	test_line_count = 2 commit-count &&
+	sed "s/master/rewrite/" actual |
+		(cd new &&
+		 git fast-import &&
+		 test $MASTER = $(git rev-parse --verify refs/heads/rewrite))
+'
+
 test_expect_success 'iso-8859-1' '
 
 	git config i18n.commitencoding ISO8859-1 &&
-- 
2.19.1.866.g82735bcbde


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 09/10] fast-export: add a --show-original-ids option to show original names
  2018-11-11  6:23       ` [PATCH 00/10] fast export and import fixes and features Elijah Newren
                           ` (7 preceding siblings ...)
  2018-11-11  6:23         ` [PATCH 08/10] fast-export: add --reference-excluded-parents option Elijah Newren
@ 2018-11-11  6:23         ` Elijah Newren
  2018-11-11  7:20           ` Jeff King
  2018-11-11  6:23         ` [PATCH 10/10] fast-export: add --always-show-modify-after-rename Elijah Newren
                           ` (2 subsequent siblings)
  11 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-11  6:23 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, Elijah Newren

Knowing the original names (hashes) of commits, blobs, and tags can
sometimes enable post-filtering that would otherwise be difficult or
impossible.  In particular, the desire to rewrite commit messages which
refer to other prior commits (on top of whatever other filtering is
being done) is very difficult without knowing the original names of each
commit.

This commit teaches a new --show-original-ids option to fast-export
which will make it add a 'originally <hash>' line to blob, commits, and
tags.  It also teaches fast-import to parse (and ignore) such lines.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-export.txt |  7 +++++++
 builtin/fast-export.c             | 20 +++++++++++++++-----
 fast-import.c                     | 17 +++++++++++++++++
 t/t9350-fast-export.sh            | 17 +++++++++++++++++
 4 files changed, 56 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index 2916096bdd..4e40f0b99a 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -121,6 +121,13 @@ marks the same across runs.
 	used by a repository which already contains the necessary
 	parent commits.
 
+--show-original-ids::
+	Add an extra directive to the output for commits and blobs,
+	`originally <SHA1SUM>`.  While such directives will likely be
+	ignored by importers such as git-fast-import, it may be useful
+	for intermediary filters (e.g. for rewriting commit messages
+	which refer to older commits, or for stripping blobs by id).
+
 --refspec::
 	Apply the specified refspec to each ref exported. Multiple of them can
 	be specified.
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index ea9c5b1c00..cc01dcc90c 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -38,6 +38,7 @@ static int use_done_feature;
 static int no_data;
 static int full_tree;
 static int reference_excluded_commits;
+static int show_original_ids;
 static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
 static struct string_list tag_refs = STRING_LIST_INIT_NODUP;
 static struct refspec refspecs = REFSPEC_INIT_FETCH;
@@ -271,7 +272,10 @@ static void export_blob(const struct object_id *oid)
 
 	mark_next_object(object);
 
-	printf("blob\nmark :%"PRIu32"\ndata %lu\n", last_idnum, size);
+	printf("blob\nmark :%"PRIu32"\n", last_idnum);
+	if (show_original_ids)
+		printf("originally %s\n", oid_to_hex(oid));
+	printf("data %lu\n", size);
 	if (size && fwrite(buf, size, 1, stdout) != 1)
 		die_errno("could not write blob '%s'", oid_to_hex(oid));
 	printf("\n");
@@ -628,8 +632,10 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 		reencoded = reencode_string(message, "UTF-8", encoding);
 	if (!commit->parents)
 		printf("reset %s\n", refname);
-	printf("commit %s\nmark :%"PRIu32"\n%.*s\n%.*s\ndata %u\n%s",
-	       refname, last_idnum,
+	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
+	if (show_original_ids)
+		printf("originally %s\n", oid_to_hex(&commit->object.oid));
+	printf("%.*s\n%.*s\ndata %u\n%s",
 	       (int)(author_end - author), author,
 	       (int)(committer_end - committer), committer,
 	       (unsigned)(reencoded
@@ -807,8 +813,10 @@ static void handle_tag(const char *name, struct tag *tag)
 
 	if (starts_with(name, "refs/tags/"))
 		name += 10;
-	printf("tag %s\nfrom :%d\n%.*s%sdata %d\n%.*s\n",
-	       name, tagged_mark,
+	printf("tag %s\nfrom :%d\n", name, tagged_mark);
+	if (show_original_ids)
+		printf("originally %s\n", oid_to_hex(&tag->object.oid));
+	printf("%.*s%sdata %d\n%.*s\n",
 	       (int)(tagger_end - tagger), tagger,
 	       tagger == tagger_end ? "" : "\n",
 	       (int)message_size, (int)message_size, message ? message : "");
@@ -1089,6 +1097,8 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")),
 		OPT_BOOL(0, "reference-excluded-parents",
 			 &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by sha1sum")),
+		OPT_BOOL(0, "show-original-ids", &show_original_ids,
+			    N_("Show original sha1sums of blobs/commits")),
 
 		OPT_END()
 	};
diff --git a/fast-import.c b/fast-import.c
index 95600c78e0..232b6a8b8d 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -14,11 +14,13 @@ Format of STDIN stream:
 
   new_blob ::= 'blob' lf
     mark?
+    originally?
     file_content;
   file_content ::= data;
 
   new_commit ::= 'commit' sp ref_str lf
     mark?
+    originally?
     ('author' (sp name)? sp '<' email '>' sp when lf)?
     'committer' (sp name)? sp '<' email '>' sp when lf
     commit_msg
@@ -49,6 +51,7 @@ Format of STDIN stream:
 
   new_tag ::= 'tag' sp tag_str lf
     'from' sp commit-ish lf
+    originally?
     ('tagger' (sp name)? sp '<' email '>' sp when lf)?
     tag_msg;
   tag_msg ::= data;
@@ -73,6 +76,8 @@ Format of STDIN stream:
   data ::= (delimited_data | exact_data)
     lf?;
 
+  originally ::= 'originally' sp not_lf+ lf
+
     # note: delim may be any string but must not contain lf.
     # data_line may contain any data but must not be exactly
     # delim.
@@ -1968,6 +1973,13 @@ static void parse_mark(void)
 		next_mark = 0;
 }
 
+static void parse_original_identifier(void)
+{
+	const char *v;
+	if (skip_prefix(command_buf.buf, "originally ", &v))
+		read_next_command();
+}
+
 static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res)
 {
 	const char *data;
@@ -2110,6 +2122,7 @@ static void parse_new_blob(void)
 {
 	read_next_command();
 	parse_mark();
+	parse_original_identifier();
 	parse_and_store_blob(&last_blob, NULL, next_mark);
 }
 
@@ -2733,6 +2746,7 @@ static void parse_new_commit(const char *arg)
 
 	read_next_command();
 	parse_mark();
+	parse_original_identifier();
 	if (skip_prefix(command_buf.buf, "author ", &v)) {
 		author = parse_ident(v);
 		read_next_command();
@@ -2865,6 +2879,9 @@ static void parse_new_tag(const char *arg)
 		die("Invalid ref name or SHA1 expression: %s", from);
 	read_next_command();
 
+	/* originally ... */
+	parse_original_identifier();
+
 	/* tagger ... */
 	if (skip_prefix(command_buf.buf, "tagger ", &v)) {
 		tagger = parse_ident(v);
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index c2f40d6a40..5ad6669910 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -77,6 +77,23 @@ test_expect_success 'fast-export --reference-excluded-parents master~2..master'
 		 test $MASTER = $(git rev-parse --verify refs/heads/rewrite))
 '
 
+test_expect_success 'fast-export --show-original-ids' '
+
+	git fast-export --show-original-ids master >output &&
+	grep ^originally output| sed -e s/^originally.// | sort >actual &&
+	git rev-list --objects master muss >objects-and-names &&
+	awk "{print \$1}" objects-and-names | sort >commits-trees-blobs &&
+	comm -23 actual commits-trees-blobs >unfound &&
+	test_must_be_empty unfound
+'
+
+test_expect_success 'fast-export --show-original-ids | git fast-import' '
+
+	git fast-export --show-original-ids master muss | git fast-import --quiet &&
+	test $MASTER = $(git rev-parse --verify refs/heads/master) &&
+	test $MUSS = $(git rev-parse --verify refs/tags/muss)
+'
+
 test_expect_success 'iso-8859-1' '
 
 	git config i18n.commitencoding ISO8859-1 &&
-- 
2.19.1.866.g82735bcbde


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 10/10] fast-export: add --always-show-modify-after-rename
  2018-11-11  6:23       ` [PATCH 00/10] fast export and import fixes and features Elijah Newren
                           ` (8 preceding siblings ...)
  2018-11-11  6:23         ` [PATCH 09/10] fast-export: add a --show-original-ids option to show original names Elijah Newren
@ 2018-11-11  6:23         ` Elijah Newren
  2018-11-11  7:23           ` Jeff King
  2018-11-11  7:27         ` [PATCH 00/10] fast export and import fixes and features Jeff King
  2018-11-14  0:25         ` [PATCH v2 00/11] " Elijah Newren
  11 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-11  6:23 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, Elijah Newren

fast-export output is traditionally used as an input to a fast-import
program, but it is also useful to help gather statistics about the
history of a repository (particularly when --no-data is also passed).
For example, two of the types of information we may want to collect
could include:
  1) general information about renames that have occurred
  2) what the biggest objects in a repository are and what names
     they appear under.

The first bit of information can be gathered by just passing -M to
fast-export.  The second piece of information can partially be gotten
from running
    git cat-file --batch-check --batch-all-objects
However, that only shows what the biggest objects in the repository are
and their sizes, not what names those objects appear as or what commits
they were introduced in.  We can get that information from fast-export,
but when we only see
    R oldname newname
instead of
    R oldname newname
    M 100644 $SHA1 newname
then it makes the job more difficult.  Add an option which allows us to
force the latter output even when commits have exact renames of files.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-export.txt | 11 ++++++++++
 builtin/fast-export.c             |  7 +++++-
 t/t9350-fast-export.sh            | 36 +++++++++++++++++++++++++++++++
 3 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index 4e40f0b99a..946a5aee1f 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -128,6 +128,17 @@ marks the same across runs.
 	for intermediary filters (e.g. for rewriting commit messages
 	which refer to older commits, or for stripping blobs by id).
 
+--always-show-modify-after-rename::
+	When a rename is detected, fast-export normally issues both a
+	'R' (rename) and a 'M' (modify) directive.  However, if the
+	contents of the old and new filename match exactly, it will
+	only issue the rename directive.  Use this flag to have it
+	always issue the modify directive after the rename, which may
+	be useful for tools which are using the fast-export stream as
+	a mechanism for gathering statistics about a repository.  Note
+	that this option only has effect when rename detection is
+	active (see the -M option).
+
 --refspec::
 	Apply the specified refspec to each ref exported. Multiple of them can
 	be specified.
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index cc01dcc90c..db606d1fd0 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -38,6 +38,7 @@ static int use_done_feature;
 static int no_data;
 static int full_tree;
 static int reference_excluded_commits;
+static int always_show_modify_after_rename;
 static int show_original_ids;
 static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
 static struct string_list tag_refs = STRING_LIST_INIT_NODUP;
@@ -407,7 +408,8 @@ static void show_filemodify(struct diff_queue_struct *q,
 				putchar('\n');
 
 				if (oideq(&ospec->oid, &spec->oid) &&
-				    ospec->mode == spec->mode)
+				    ospec->mode == spec->mode &&
+				    !always_show_modify_after_rename)
 					break;
 			}
 			/* fallthrough */
@@ -1099,6 +1101,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 			 &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by sha1sum")),
 		OPT_BOOL(0, "show-original-ids", &show_original_ids,
 			    N_("Show original sha1sums of blobs/commits")),
+		OPT_BOOL(0, "always-show-modify-after-rename",
+			    &always_show_modify_after_rename,
+			 N_("Always provide 'M' directive after 'R'")),
 
 		OPT_END()
 	};
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 5ad6669910..d0c30672ac 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -638,4 +638,40 @@ test_expect_success 'merge commit gets exported with --import-marks' '
 	)
 '
 
+test_expect_success 'rename detection and --always-show-modify-after-rename' '
+	test_create_repo renames &&
+	(
+		cd renames &&
+		test_seq 0  9  >single_digit &&
+		test_seq 10 98 >double_digit &&
+		git add . &&
+		git commit -m initial &&
+
+		echo 99 >>double_digit &&
+		git mv single_digit single-digit &&
+		git mv double_digit double-digit &&
+		git add double-digit &&
+		git commit -m renames &&
+
+		# First, check normal fast-export -M output
+		git fast-export -M --no-data master >out &&
+
+		grep double-digit out >out2 &&
+		test_line_count = 2 out2 &&
+
+		grep single-digit out >out2 &&
+		test_line_count = 1 out2 &&
+
+		# Now, test with --always-show-modify-after-rename; should
+		# have an extra "M" directive for "single-digit".
+		git fast-export -M --no-data --always-show-modify-after-rename master >out &&
+
+		grep double-digit out >out2 &&
+		test_line_count = 2 out2 &&
+
+		grep single-digit out >out2 &&
+		test_line_count = 2 out2
+	)
+'
+
 test_done
-- 
2.19.1.866.g82735bcbde


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/10] git-fast-import.txt: fix documentation for --quiet option
  2018-11-11  6:23         ` [PATCH 01/10] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
@ 2018-11-11  6:33           ` Jeff King
  0 siblings, 0 replies; 90+ messages in thread
From: Jeff King @ 2018-11-11  6:33 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git, larsxschneider, sandals, me, jrnieder

On Sat, Nov 10, 2018 at 10:23:03PM -0800, Elijah Newren wrote:

> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  Documentation/git-fast-import.txt | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
> index e81117d27f..7ab97745a6 100644
> --- a/Documentation/git-fast-import.txt
> +++ b/Documentation/git-fast-import.txt
> @@ -40,9 +40,10 @@ OPTIONS
>  	not contain the old commit).
>  
>  --quiet::
> -	Disable all non-fatal output, making fast-import silent when it
> -	is successful.  This option disables the output shown by
> -	--stats.
> +	Disable the output shown by --stats, making fast-import usually
> +	be silent when it is successful.  However, if the import stream
> +	has directives intended to show user output (e.g. `progress`
> +	directives), the corresponding messages will still be shown.

Makes sense. I think one could argue that it should disable those
messages, too, but probably the right answer is that the export side
should be told to be `--quiet` as well.

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/10] git-fast-export.txt: clarify misleading documentation about rev-list args
  2018-11-11  6:23         ` [PATCH 02/10] git-fast-export.txt: clarify misleading documentation about rev-list args Elijah Newren
@ 2018-11-11  6:36           ` Jeff King
  2018-11-11  7:17             ` Elijah Newren
  0 siblings, 1 reply; 90+ messages in thread
From: Jeff King @ 2018-11-11  6:36 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git, larsxschneider, sandals, me, jrnieder

On Sat, Nov 10, 2018 at 10:23:04PM -0800, Elijah Newren wrote:

> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  Documentation/git-fast-export.txt | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
> index ce954be532..677510b7f7 100644
> --- a/Documentation/git-fast-export.txt
> +++ b/Documentation/git-fast-export.txt
> @@ -119,7 +119,8 @@ marks the same across runs.
>  	'git rev-list', that specifies the specific objects and references
>  	to export.  For example, `master~10..master` causes the
>  	current master reference to be exported along with all objects
> -	added since its 10th ancestor commit.
> +	added since its 10th ancestor commit and all files common to
> +	master\~9 and master~10.

Do you need to backslash the second tilde?  Maybe `master~9` and
`master~10` instead of escaping?

I'm not sure what this is trying to say. I guess that we'd always show
all of the blobs necessary to reconstruct the first non-negative commit
(i.e., `master~9` here)?

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 03/10] fast-export: use value from correct enum
  2018-11-11  6:23         ` [PATCH 03/10] fast-export: use value from correct enum Elijah Newren
@ 2018-11-11  6:36           ` Jeff King
  2018-11-11 20:10             ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 90+ messages in thread
From: Jeff King @ 2018-11-11  6:36 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git, larsxschneider, sandals, me, jrnieder

On Sat, Nov 10, 2018 at 10:23:05PM -0800, Elijah Newren wrote:

> ABORT and ERROR happen to have the same value, but come from differnt
> enums.  Use the one from the correct enum.

Yikes. :)

This is a good argument for naming these SIGNED_TAG_ABORT, etc. But this
is obviously an improvement in the meantime.

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 04/10] fast-export: avoid dying when filtering by paths and old tags exist
  2018-11-11  6:23         ` [PATCH 04/10] fast-export: avoid dying when filtering by paths and old tags exist Elijah Newren
@ 2018-11-11  6:44           ` Jeff King
  2018-11-11  7:38             ` Elijah Newren
  2018-11-12 22:50             ` brian m. carlson
  0 siblings, 2 replies; 90+ messages in thread
From: Jeff King @ 2018-11-11  6:44 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git, larsxschneider, sandals, me, jrnieder

On Sat, Nov 10, 2018 at 10:23:06PM -0800, Elijah Newren wrote:

> If --tag-of-filtered-object=rewrite is specified along with a set of
> paths to limit what is exported, then any tags pointing to old commits
> that do not contain any of those specified paths cause problems.  Since
> the old tagged commit is not exported, fast-export attempts to rewrite
> such tags to an ancestor commit which was exported.  If no such commit
> exists, then fast-export currently die()s.  Five years after the tag
> rewriting logic was added to fast-export (see commit 2d8ad4691921,
> "fast-export: Add a --tag-of-filtered-object  option for newly dangling
> tags", 2009-06-25), fast-import gained the ability to delete refs (see
> commit 4ee1b225b99f, "fast-import: add support to delete refs",
> 2014-04-20), so now we do have a valid option to rewrite the tag to.
> Delete these tags instead of dying.

Hmm. That's the right thing to do if we're considering the export to be
an independent unit. But what if I'm just rewriting a portion of history
like:

  git fast-export HEAD~5..HEAD | some_filter | git fast-import

? If I have a tag pointing to HEAD~10, will this delete that? Ideally I
think it would be left alone.

> +test_expect_success 'rewrite tag predating pathspecs to nothing' '
> +	test_create_repo rewrite_tag_predating_pathspecs &&
> +	(
> +		cd rewrite_tag_predating_pathspecs &&
> +
> +		touch ignored &&

We usually prefer ">ignored" to create an empty file rather than
"touch".

> +		git add ignored &&
> +		test_commit initial &&

What do we need this "ignored" for? test_commit should create a file
"initial.t".

> +		echo foo >bar &&
> +		git add bar &&
> +		test_commit add-bar &&

Likewise, "test_commit bar" should work by itself (though note the
filename is "bar.t" in your fast-export command).

> +		git fast-export --tag-of-filtered-object=rewrite --all -- bar >output &&
> +		grep -A 1 refs/tags/v0.0.0.0.0.0.1 output | grep -E ^from.0{40}

I don't think "grep -A" is portable (and we don't seem to otherwise use
it). You can probably do something similar with sed.

Use $ZERO_OID instead of hard-coding 40, which future-proofs for the
hash transition (though I suppose the hash is not likely to get
_shorter_ ;) ).

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 05/10] fast-export: move commit rewriting logic into a function for reuse
  2018-11-11  6:23         ` [PATCH 05/10] fast-export: move commit rewriting logic into a function for reuse Elijah Newren
@ 2018-11-11  6:47           ` Jeff King
  0 siblings, 0 replies; 90+ messages in thread
From: Jeff King @ 2018-11-11  6:47 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git, larsxschneider, sandals, me, jrnieder

On Sat, Nov 10, 2018 at 10:23:07PM -0800, Elijah Newren wrote:

> Logic to replace a filtered commit with an unfiltered ancestor is useful
> elsewhere; put it into a function we can call.

OK. I had to stare at it for a minute to make sure there was not an
edge case with looking at "p" versus "p->parents", but I think it is a
faithful conversion.

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/10] fast-export: when using paths, avoid corrupt stream with non-existent mark
  2018-11-11  6:23         ` [PATCH 06/10] fast-export: when using paths, avoid corrupt stream with non-existent mark Elijah Newren
@ 2018-11-11  6:53           ` Jeff King
  2018-11-11  8:01             ` Elijah Newren
  0 siblings, 1 reply; 90+ messages in thread
From: Jeff King @ 2018-11-11  6:53 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git, larsxschneider, sandals, me, jrnieder

On Sat, Nov 10, 2018 at 10:23:08PM -0800, Elijah Newren wrote:

> If file paths are specified to fast-export and multiple refs point to a
> commit that does not touch any of the relevant file paths, then
> fast-export can hit problems.  fast-export has a list of additional refs
> that it needs to explicitly set after exporting all blobs and commits,
> and when it tries to get_object_mark() on the relevant commit, it can
> get a mark of 0, i.e. "not found", because the commit in question did
> not touch the relevant paths and thus was not exported.  Trying to
> import a stream with a mark corresponding to an unexported object will
> cause fast-import to crash.
> 
> Avoid this problem by taking the commit the ref points to and finding an
> ancestor of it that was exported, and make the ref point to that commit
> instead.

As with the earlier tag commit, I wonder if this might depend on the
context in which you're using fast-export. I suppose that if you did not
feed the ref on the command line that we would not be dealing with it at
all (and maybe that is the answer to my question about the tag thing,
too).

It does seem funny that the behavior for the earlier case (bounded
commits) and this case (skipping some commits) are different. Would you
ever want to keep walking backwards to find an ancestor in the earlier
case? Or vice versa, would you ever want to simply delete a tag in a
case like this one?

I'm not sure sure, but I suspect you may have thought about it a lot
harder than I have. :)

> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> index a3c044b0af..5648a8ce9c 100644
> --- a/builtin/fast-export.c
> +++ b/builtin/fast-export.c
> @@ -900,7 +900,18 @@ static void handle_tags_and_duplicates(void)
>  			if (anonymize)
>  				name = anonymize_refname(name);
>  			/* create refs pointing to already seen commits */
> -			commit = (struct commit *)object;
> +			commit = rewrite_commit((struct commit *)object);
> +			if (!commit) {
> +				/*
> +				 * Neither this object nor any of its
> +				 * ancestors touch any relevant paths, so
> +				 * it has been filtered to nothing.  Delete
> +				 * it.
> +				 */
> +				printf("reset %s\nfrom %s\n\n",
> +				       name, sha1_to_hex(null_sha1));
> +				continue;
> +			}

This hunk makes sense.

> --- a/t/t9350-fast-export.sh
> +++ b/t/t9350-fast-export.sh
> @@ -386,6 +386,30 @@ test_expect_success 'path limiting with import-marks does not lose unmodified fi
>  	grep file0 actual
>  '
>  
> +test_expect_success 'avoid corrupt stream with non-existent mark' '
> +	test_create_repo avoid_non_existent_mark &&
> +	(
> +		cd avoid_non_existent_mark &&
> +
> +		touch important-path &&
> +		git add important-path &&
> +		test_commit initial &&
> +
> +		touch ignored &&
> +		git add ignored &&
> +		test_commit whatever &&
> +
> +		git branch A &&
> +		git branch B &&
> +
> +		echo foo >>important-path &&
> +		git add important-path &&
> +		test_commit more changes &&
> +
> +		git fast-export --all -- important-path | git fast-import --force
> +	)
> +'

Similar comments apply about "touch" and "test_commit" to what I wrote
for the earlier patch.

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 07/10] fast-export: ensure we export requested refs
  2018-11-11  6:23         ` [PATCH 07/10] fast-export: ensure we export requested refs Elijah Newren
@ 2018-11-11  7:02           ` Jeff King
  2018-11-11  8:20             ` Elijah Newren
  0 siblings, 1 reply; 90+ messages in thread
From: Jeff King @ 2018-11-11  7:02 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git, larsxschneider, sandals, me, jrnieder

On Sat, Nov 10, 2018 at 10:23:09PM -0800, Elijah Newren wrote:

> If file paths are specified to fast-export and a ref points to a commit
> that does not touch any of the relevant paths, then that ref would
> sometimes fail to be exported.  (This depends on whether any ancestors
> of the commit which do touch the relevant paths would be exported with
> that same ref name or a different ref name.)  To avoid this problem,
> put *all* specified refs into extra_refs to start, and then as we export
> each commit, remove the refname used in the 'commit $REFNAME' directive
> from extra_refs.  Then, in handle_tags_and_duplicates() we know which
> refs actually do need a manual reset directive in order to be included.
> 
> This means that we do need some special handling for excluded refs; e.g.
> if someone runs
>    git fast-export ^master master
> then they've asked for master to be exported, but they have also asked
> for the commit which master points to and all of its history to be
> excluded.  That logically means ref deletion.  Previously, such refs
> were just silently omitted from being exported despite having been
> explicitly requested for export.

Hmm. Reading this it makes sense to me, but I remember from discussion
long ago that there were a lot of funny corner cases around "which refs
to include" and possibly even some ambiguous cases. Maybe that is all
sorted these days, with --refspec.

> ---
> NOTE: I was hoping the strmap API proposal would materialize, but I either
> missed it or it hasn't shown up.  The usage of string_list in this patch
> would be better replaced by what Peff suggested.

You didn't miss it. Junio did some manual conversions using hashmap,
which weren't too bad.  It's not entirely clear to me how often we'd be
able to use strmap instead of a full-on hashmap, so I haven't really
pursued it.

It looks like you generate the list here via append, and then sort at
the end. That's at least not quadratic. I think the string_list_remove()
is, though.

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 08/10] fast-export: add --reference-excluded-parents option
  2018-11-11  6:23         ` [PATCH 08/10] fast-export: add --reference-excluded-parents option Elijah Newren
@ 2018-11-11  7:11           ` Jeff King
  0 siblings, 0 replies; 90+ messages in thread
From: Jeff King @ 2018-11-11  7:11 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git, larsxschneider, sandals, me, jrnieder

On Sat, Nov 10, 2018 at 10:23:10PM -0800, Elijah Newren wrote:

> git filter-branch has a nifty feature allowing you to rewrite, e.g. just
> the last 8 commits of a linear history
>   git filter-branch $OPTIONS HEAD~8..HEAD
> 
> If you try the same with git fast-export, you instead get a history of
> only 8 commits, with HEAD~7 being rewritten into a root commit.  There
> are two alternatives:

Ah, I think this maybe answers some of my earlier questions, too. You
cannot use fast-import as it stands to do a partial rewrite.

>   1) Don't use the negative revision specification, and when you're
>      filtering the output to make modifications to the last 8 commits,
>      just be careful to not modify any earlier commits somehow.
> 
>   2) First run 'git fast-export --export-marks=somefile HEAD~8', then
>      run 'git fast-export --import-marks=somefile HEAD~8..HEAD'.
> 
> Both are more error prone than I'd like (the first for obvious reasons;
> with the second option I have sometimes accidentally included too many
> revisions in the first command and then found that the corresponding
> extra revisions were not exported by the second command and thus were
> not modified as I expected).  Also, both are poor from a performance
> perspective.

Yeah, this should be O(commits you're touching), and it the current code
does not allow that at all. So I think this feature makes a lot of sense
(it probably _should_ have been the default, but it's a bit late for
that now).

> @@ -638,13 +640,21 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
>  	unuse_commit_buffer(commit, commit_buffer);
>  
>  	for (i = 0, p = commit->parents; p; p = p->next) {
> -		int mark = get_object_mark(&p->item->object);
> -		if (!mark)
> +		struct object *obj = &p->item->object;
> +		int mark = get_object_mark(obj);
> +
> +		if (!mark && !reference_excluded_commits)
>  			continue;
>  		if (i == 0)
> -			printf("from :%d\n", mark);
> +			printf("from ");
> +		else
> +			printf("merge ");
> +		if (mark)
> +			printf(":%d\n", mark);
>  		else
> -			printf("merge :%d\n", mark);
> +			printf("%s\n", sha1_to_hex(anonymize ?
> +						   anonymize_sha1(&obj->oid) :
> +						   obj->oid.hash));
>  		i++;
>  	}

OK, so this just teaches us to start with the sensible "from" directive.
I think we might be able to do a little more optimization here. If we're
exporting HEAD^..HEAD and there's an object in HEAD^ which is unchanged
in HEAD, I think we'd still print it (because it would not be marked
SHOWN), but we could omit it (by walking the tree of the boundary
commits and marking them shown).

I don't think it's a blocker for what you're doing here, but just a
possible future optimization.

> @@ -925,13 +935,22 @@ static void handle_tags_and_duplicates(struct string_list *extras)
>  				/*
>  				 * Getting here means we have a commit which
>  				 * was excluded by a negative refspec (e.g.
> -				 * fast-export ^master master).  If the user
> +				 * fast-export ^master master).  If we are
> +				 * referencing excluded commits, set the ref
> +				 * to the exact commit.  Otherwise, the user
>  				 * wants the branch exported but every commit
> -				 * in its history to be deleted, that sounds
> -				 * like a ref deletion to me.
> +				 * in its history to be deleted, which basically
> +				 * just means deletion of the ref.
>  				 */
> -				printf("reset %s\nfrom %s\n\n",
> -				       name, sha1_to_hex(null_sha1));
> +				if (!reference_excluded_commits) {
> +					/* delete the ref */
> +					printf("reset %s\nfrom %s\n\n",
> +					       name, sha1_to_hex(null_sha1));
> +					continue;
> +				}
> +				/* set ref to commit using oid, not mark */
> +				printf("reset %s\nfrom %s\n\n", name,
> +				       sha1_to_hex(commit->object.oid.hash));

OK, and this is basically answering my earlier questions again: yes, you
_would_ want to keep old tags pointing at their commits. But only in
this much more sensible mode.

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/10] git-fast-export.txt: clarify misleading documentation about rev-list args
  2018-11-11  6:36           ` Jeff King
@ 2018-11-11  7:17             ` Elijah Newren
  2018-11-13 23:25               ` Elijah Newren
  0 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-11  7:17 UTC (permalink / raw)
  To: Jeff King
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Taylor Blau,
	Jonathan Nieder

On Sat, Nov 10, 2018 at 10:36 PM Jeff King <peff@peff.net> wrote:
>
> On Sat, Nov 10, 2018 at 10:23:04PM -0800, Elijah Newren wrote:
>
> > Signed-off-by: Elijah Newren <newren@gmail.com>
> > ---
> >  Documentation/git-fast-export.txt | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
> > index ce954be532..677510b7f7 100644
> > --- a/Documentation/git-fast-export.txt
> > +++ b/Documentation/git-fast-export.txt
> > @@ -119,7 +119,8 @@ marks the same across runs.
> >       'git rev-list', that specifies the specific objects and references
> >       to export.  For example, `master~10..master` causes the
> >       current master reference to be exported along with all objects
> > -     added since its 10th ancestor commit.
> > +     added since its 10th ancestor commit and all files common to
> > +     master\~9 and master~10.
>
> Do you need to backslash the second tilde?  Maybe `master~9` and
> `master~10` instead of escaping?

Oops, yeah, that needs to be consistent.

> I'm not sure what this is trying to say. I guess that we'd always show
> all of the blobs necessary to reconstruct the first non-negative commit
> (i.e., `master~9` here)?

For someone familiar with fast-export or fast-import, sure, you'd
guess that it'd show all the blobs necessary to reconstruct the first
non-negative commit.  But it's not clear to first time users and
readers of the docs that the first non-negative commit becomes a root
commit; by comparison, filter-branch suggests using a very similar
construction and yet behaves quite differently -- it does not turn the
first non-negative commit into a root but retains the original
parent(s) of the first non-negative commit without rewriting those
earlier commits.  The text as previously written, "along with all
objects added since its 10th ancestor commit", seems to suggest
behavior similar to how filter-branch behaves (particularly the
"Acked-by example"), i.e. it implies that files not touched in the
last 10 commits are not included.  My wording in this patch was an
attempt to fix that.  Was my attempt perhaps too clumsy, or was it
just the case that you had sufficient knowledge of fast-export that
the previous text didn't mislead you?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 09/10] fast-export: add a --show-original-ids option to show original names
  2018-11-11  6:23         ` [PATCH 09/10] fast-export: add a --show-original-ids option to show original names Elijah Newren
@ 2018-11-11  7:20           ` Jeff King
  2018-11-11  8:32             ` Elijah Newren
  0 siblings, 1 reply; 90+ messages in thread
From: Jeff King @ 2018-11-11  7:20 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git, larsxschneider, sandals, me, jrnieder

On Sat, Nov 10, 2018 at 10:23:11PM -0800, Elijah Newren wrote:

> Knowing the original names (hashes) of commits, blobs, and tags can
> sometimes enable post-filtering that would otherwise be difficult or
> impossible.  In particular, the desire to rewrite commit messages which
> refer to other prior commits (on top of whatever other filtering is
> being done) is very difficult without knowing the original names of each
> commit.
> 
> This commit teaches a new --show-original-ids option to fast-export
> which will make it add a 'originally <hash>' line to blob, commits, and
> tags.  It also teaches fast-import to parse (and ignore) such lines.

Makes sense as a feature; I think filter-branch can make its mappings
available, too.

Do we need to worry about compatibility with other fast-import programs?
I think no, because this is not enabled by default (so if sending the
extra lines to another importer hurts, the answer is "don't do that").

I have a vague feeling that there might be some way to combine this with
--export-marks or --no-data, but I can't really think of a way. They
seem related, but not quite.

> ---
>  Documentation/git-fast-export.txt |  7 +++++++
>  builtin/fast-export.c             | 20 +++++++++++++++-----
>  fast-import.c                     | 17 +++++++++++++++++
>  t/t9350-fast-export.sh            | 17 +++++++++++++++++
>  4 files changed, 56 insertions(+), 5 deletions(-)

The fast-import format is documented in Documentation/git-fast-import.txt.
It might need an update to cover the new format.

> --- a/Documentation/git-fast-export.txt
> +++ b/Documentation/git-fast-export.txt
> @@ -121,6 +121,13 @@ marks the same across runs.
>  	used by a repository which already contains the necessary
>  	parent commits.
>  
> +--show-original-ids::
> +	Add an extra directive to the output for commits and blobs,
> +	`originally <SHA1SUM>`.  While such directives will likely be
> +	ignored by importers such as git-fast-import, it may be useful
> +	for intermediary filters (e.g. for rewriting commit messages
> +	which refer to older commits, or for stripping blobs by id).

I'm not quite sure how a blob ends up being rewritten by fast-export (I
get that commits may change due to dropping parents).

The name "originally" doesn't seem great to me. Probably because I would
continually wonder if it has one "l" or two. ;) Perhaps something like
"original-oid" might be better. That's well into bikeshed territory,
though.

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/10] fast-export: add --always-show-modify-after-rename
  2018-11-11  6:23         ` [PATCH 10/10] fast-export: add --always-show-modify-after-rename Elijah Newren
@ 2018-11-11  7:23           ` Jeff King
  2018-11-11  8:42             ` Elijah Newren
  0 siblings, 1 reply; 90+ messages in thread
From: Jeff King @ 2018-11-11  7:23 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git, larsxschneider, sandals, me, jrnieder

On Sat, Nov 10, 2018 at 10:23:12PM -0800, Elijah Newren wrote:

> fast-export output is traditionally used as an input to a fast-import
> program, but it is also useful to help gather statistics about the
> history of a repository (particularly when --no-data is also passed).
> For example, two of the types of information we may want to collect
> could include:
>   1) general information about renames that have occurred
>   2) what the biggest objects in a repository are and what names
>      they appear under.
> 
> The first bit of information can be gathered by just passing -M to
> fast-export.  The second piece of information can partially be gotten
> from running
>     git cat-file --batch-check --batch-all-objects
> However, that only shows what the biggest objects in the repository are
> and their sizes, not what names those objects appear as or what commits
> they were introduced in.  We can get that information from fast-export,
> but when we only see
>     R oldname newname
> instead of
>     R oldname newname
>     M 100644 $SHA1 newname
> then it makes the job more difficult.  Add an option which allows us to
> force the latter output even when commits have exact renames of files.

fast-export seems like a funny tool to look up paths. What about "git
log --find-object=$SHA1" ?

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/10] fast export and import fixes and features
  2018-11-11  6:23       ` [PATCH 00/10] fast export and import fixes and features Elijah Newren
                           ` (9 preceding siblings ...)
  2018-11-11  6:23         ` [PATCH 10/10] fast-export: add --always-show-modify-after-rename Elijah Newren
@ 2018-11-11  7:27         ` Jeff King
  2018-11-11  8:44           ` Elijah Newren
  2018-11-14  0:25         ` [PATCH v2 00/11] " Elijah Newren
  11 siblings, 1 reply; 90+ messages in thread
From: Jeff King @ 2018-11-11  7:27 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git, larsxschneider, sandals, me, jrnieder

On Sat, Nov 10, 2018 at 10:23:02PM -0800, Elijah Newren wrote:

> This is a series of ten patches representing two doc corrections, one
> pedantic fix, three real bug fixes, one micro code refactor, and three
> new features.  Each of these ten changes is relatively small in size.
> These changes predominantly affect fast-export, but there's a couple
> small changes for fast-import as well.
> 
> I could potentially split these patches up, but I'd just end up
> chaining them sequentially since otherwise there'd be lots of
> conflicts; having 10 different single patch series with lots of
> dependencies sounded like a bigger pain to me, but let me know if you
> would prefer I split them up and how you suggest doing so.

I think it's fine to put them in sequence when there's a textual
dependency.  If it turns out that one of them needs more discussion and
we don't want it to hold later patches hostage, we can always re-roll at
that point.

(I also think it's fine to lump together thematically similar patches
even when they aren't strictly dependent, even textually. It's less work
for the maintainer to consider 1 group of 10 than 10 groups of 1).

> These patches were driven by the needs of git-repo-filter[1], but most
> if not all of them should be independently useful.

I left lots of comments. Some of the earlier ones may just be showing my
confusion about fast-export works (some of which was cleared up by your
later patches). But I like the overall direction for sure.

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 04/10] fast-export: avoid dying when filtering by paths and old tags exist
  2018-11-11  6:44           ` Jeff King
@ 2018-11-11  7:38             ` Elijah Newren
  2018-11-12 12:32               ` Jeff King
  2018-11-12 22:50             ` brian m. carlson
  1 sibling, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-11  7:38 UTC (permalink / raw)
  To: Jeff King
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Taylor Blau,
	Jonathan Nieder

On Sat, Nov 10, 2018 at 10:44 PM Jeff King <peff@peff.net> wrote:
>
> On Sat, Nov 10, 2018 at 10:23:06PM -0800, Elijah Newren wrote:
>
> > If --tag-of-filtered-object=rewrite is specified along with a set of
> > paths to limit what is exported, then any tags pointing to old commits
> > that do not contain any of those specified paths cause problems.  Since
> > the old tagged commit is not exported, fast-export attempts to rewrite
> > such tags to an ancestor commit which was exported.  If no such commit
> > exists, then fast-export currently die()s.  Five years after the tag
> > rewriting logic was added to fast-export (see commit 2d8ad4691921,
> > "fast-export: Add a --tag-of-filtered-object  option for newly dangling
> > tags", 2009-06-25), fast-import gained the ability to delete refs (see
> > commit 4ee1b225b99f, "fast-import: add support to delete refs",
> > 2014-04-20), so now we do have a valid option to rewrite the tag to.
> > Delete these tags instead of dying.
>
> Hmm. That's the right thing to do if we're considering the export to be
> an independent unit. But what if I'm just rewriting a portion of history
> like:
>
>   git fast-export HEAD~5..HEAD | some_filter | git fast-import
>
> ? If I have a tag pointing to HEAD~10, will this delete that? Ideally I
> think it would be left alone.

A couple things:
  * This code path only triggers in a very specific case: If a tag is
requested for export but points to a commit which is filtered out by
something else (e.g. path limiters and the commit in question didn't
modify any of the relevant paths), AND the user explicitly specified
--tag-of-filtered-object=rewrite (so that the tag in question can be
rewritten to the nearest non-filtered ancestor).
  * You didn't specify to export any tags, only HEAD, so this
situation isn't relevant (the tag wouldn't be exported or deleted).
  * You didn't specify --tag-of-filtered-object=rewrite, so this
situation isn't relevant (even if you had specified a tag to filter,
you'd get an abort instead)

But let's say you do modify the example some:
   git fast-export --tag-of-filtered-object=rewrite
--signed-tags=strip --tags master -- relatively_recent_subdirectory/ |
some_filter | git fast-import

The user asked that all tags and master be exported but only for the
history that touched relatively_recent_subdirectory/, and if any tags
point at commits that are pruned by only asking for commits touching
relatively_recent_subdirectory/, then rewrite what those tags point to
so that they instead point to the nearest non-filtered ancestor.  What
about a commit like v0.1.0 that likely pre-dated the introduction of
relatively_recent_subdirectory/?  It has no nearest ancestor to
rewrite to.  The previous answer was to abort, which is really bad,
especially since the user was clearly asking us to do whatever smart
rewriting we can (--signed-tags=strip and
--tag-of-filtered-object=rewrite).

Perhaps there's a different answer that's workable as well, but this
one, in these circumstances, seemed the most reasonable to me.

> > +test_expect_success 'rewrite tag predating pathspecs to nothing' '
> > +     test_create_repo rewrite_tag_predating_pathspecs &&
> > +     (
> > +             cd rewrite_tag_predating_pathspecs &&
> > +
> > +             touch ignored &&
>
> We usually prefer ">ignored" to create an empty file rather than
> "touch".

Will fix.

>
> > +             git add ignored &&
> > +             test_commit initial &&
>
> What do we need this "ignored" for? test_commit should create a file
> "initial.t".

I think I original had plain "git commit", then switched to
test_commit, then didn't recheck.  Thanks, will fix.

> > +             echo foo >bar &&
> > +             git add bar &&
> > +             test_commit add-bar &&
>
> Likewise, "test_commit bar" should work by itself (though note the
> filename is "bar.t" in your fast-export command).
>
> > +             git fast-export --tag-of-filtered-object=rewrite --all -- bar >output &&
> > +             grep -A 1 refs/tags/v0.0.0.0.0.0.1 output | grep -E ^from.0{40}
>
> I don't think "grep -A" is portable (and we don't seem to otherwise use
> it). You can probably do something similar with sed.
>
> Use $ZERO_OID instead of hard-coding 40, which future-proofs for the
> hash transition (though I suppose the hash is not likely to get
> _shorter_ ;) ).

Will fix these up as well...after waiting for more feedback on
possible alternate suggestions.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/10] fast-export: when using paths, avoid corrupt stream with non-existent mark
  2018-11-11  6:53           ` Jeff King
@ 2018-11-11  8:01             ` Elijah Newren
  2018-11-12 12:45               ` Jeff King
  0 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-11  8:01 UTC (permalink / raw)
  To: Jeff King
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Taylor Blau,
	Jonathan Nieder

On Sat, Nov 10, 2018 at 10:53 PM Jeff King <peff@peff.net> wrote:
>
> On Sat, Nov 10, 2018 at 10:23:08PM -0800, Elijah Newren wrote:
>
> > If file paths are specified to fast-export and multiple refs point to a
> > commit that does not touch any of the relevant file paths, then
> > fast-export can hit problems.  fast-export has a list of additional refs
> > that it needs to explicitly set after exporting all blobs and commits,
> > and when it tries to get_object_mark() on the relevant commit, it can
> > get a mark of 0, i.e. "not found", because the commit in question did
> > not touch the relevant paths and thus was not exported.  Trying to
> > import a stream with a mark corresponding to an unexported object will
> > cause fast-import to crash.
> >
> > Avoid this problem by taking the commit the ref points to and finding an
> > ancestor of it that was exported, and make the ref point to that commit
> > instead.
>
> As with the earlier tag commit, I wonder if this might depend on the
> context in which you're using fast-export. I suppose that if you did not
> feed the ref on the command line that we would not be dealing with it at
> all (and maybe that is the answer to my question about the tag thing,
> too).

Right, if you didn't feed the ref on the command line, we're not
dealing with the ref at all, so the code here doesn't affect any such
ref.

> It does seem funny that the behavior for the earlier case (bounded
> commits) and this case (skipping some commits) are different. Would you
> ever want to keep walking backwards to find an ancestor in the earlier
> case? Or vice versa, would you ever want to simply delete a tag in a
> case like this one?
>
> I'm not sure sure, but I suspect you may have thought about it a lot
> harder than I have. :)

I'm not sure why you thought the behavior for the two cases was
different?  For both patches, my testcases used path limiting; it was
you who suggested employing a negative revision to bound the commits.

Anyway, for both patches assuming you haven't bounded the commits, you
can attempt to keep walking backwards to find an earlier ancestor, but
the fundamental fact is you aren't guaranteed that you can find one
(i.e. some tag or branch points to a commit that didn't modify any of
the specified paths, and nor did any of its ancestors back to any root
commits).  I hit that case lots of times.  If the user explicitly
requested a tag or branch for export (and requested tag rewriting),
and limited to certain paths that had never existed in the repository
as of the time of the tag or branch, then you hit the cases these
patches worry about.  Patch 4 was about (annotated and signed) tags,
this patch is about unannotated tags and branches and other refs.

If you think about using negative revisions, for both cases, then
again you can keep walking back history to try to find a commit that
your tag or branch or ref can point to, but if you get back to the
negative revisions, then you are in the range the user requested to be
omitted from the resulting repository.  Sounds like tag/ref deletion
to me.

>
> > diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> > index a3c044b0af..5648a8ce9c 100644
> > --- a/builtin/fast-export.c
> > +++ b/builtin/fast-export.c
> > @@ -900,7 +900,18 @@ static void handle_tags_and_duplicates(void)
> >                       if (anonymize)
> >                               name = anonymize_refname(name);
> >                       /* create refs pointing to already seen commits */
> > -                     commit = (struct commit *)object;
> > +                     commit = rewrite_commit((struct commit *)object);
> > +                     if (!commit) {
> > +                             /*
> > +                              * Neither this object nor any of its
> > +                              * ancestors touch any relevant paths, so
> > +                              * it has been filtered to nothing.  Delete
> > +                              * it.
> > +                              */
> > +                             printf("reset %s\nfrom %s\n\n",
> > +                                    name, sha1_to_hex(null_sha1));
> > +                             continue;
> > +                     }
>
> This hunk makes sense.

Cool, this was the entirety of the code...so does this mean that the
code makes more sense than my commit message summary did?  ...and
perhaps that my attempts to answer your questions in this email
weren't necessary anymore?

> > --- a/t/t9350-fast-export.sh
> > +++ b/t/t9350-fast-export.sh
> > @@ -386,6 +386,30 @@ test_expect_success 'path limiting with import-marks does not lose unmodified fi
> >       grep file0 actual
> >  '
> >
> > +test_expect_success 'avoid corrupt stream with non-existent mark' '
> > +     test_create_repo avoid_non_existent_mark &&
> > +     (
> > +             cd avoid_non_existent_mark &&
> > +
> > +             touch important-path &&
> > +             git add important-path &&
> > +             test_commit initial &&
> > +
> > +             touch ignored &&
> > +             git add ignored &&
> > +             test_commit whatever &&
> > +
> > +             git branch A &&
> > +             git branch B &&
> > +
> > +             echo foo >>important-path &&
> > +             git add important-path &&
> > +             test_commit more changes &&
> > +
> > +             git fast-export --all -- important-path | git fast-import --force
> > +     )
> > +'
>
> Similar comments apply about "touch" and "test_commit" to what I wrote
> for the earlier patch.

Thanks; will fix.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 07/10] fast-export: ensure we export requested refs
  2018-11-11  7:02           ` Jeff King
@ 2018-11-11  8:20             ` Elijah Newren
  0 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-11  8:20 UTC (permalink / raw)
  To: Jeff King
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Taylor Blau,
	Jonathan Nieder

On Sat, Nov 10, 2018 at 11:02 PM Jeff King <peff@peff.net> wrote:
>
> On Sat, Nov 10, 2018 at 10:23:09PM -0800, Elijah Newren wrote:
>
> > If file paths are specified to fast-export and a ref points to a commit
> > that does not touch any of the relevant paths, then that ref would
> > sometimes fail to be exported.  (This depends on whether any ancestors
> > of the commit which do touch the relevant paths would be exported with
> > that same ref name or a different ref name.)  To avoid this problem,
> > put *all* specified refs into extra_refs to start, and then as we export
> > each commit, remove the refname used in the 'commit $REFNAME' directive
> > from extra_refs.  Then, in handle_tags_and_duplicates() we know which
> > refs actually do need a manual reset directive in order to be included.
> >
> > This means that we do need some special handling for excluded refs; e.g.
> > if someone runs
> >    git fast-export ^master master
> > then they've asked for master to be exported, but they have also asked
> > for the commit which master points to and all of its history to be
> > excluded.  That logically means ref deletion.  Previously, such refs
> > were just silently omitted from being exported despite having been
> > explicitly requested for export.
>
> Hmm. Reading this it makes sense to me, but I remember from discussion
> long ago that there were a lot of funny corner cases around "which refs
> to include" and possibly even some ambiguous cases. Maybe that is all
> sorted these days, with --refspec.

Oh yeah, there definitely were some funny corner cases around "which
refs to include" (though I don't think --refspec affects this, either
before or after my patch.)  Before this commit, fast-export would
often emit unnecessary reset directives at the end, AND fail to export
some other refs that had been explicitly requested for export.  It had
some simple logic to attempt to cover the cases, but it was just
wrong.  As far as I can tell, this patch fixes all of those.

...well, almost all.  We still fail on tags of tags of commits (or
higher level nestings), but that's a multi-pronged issue that feels
like a different beast. (We rewrite tags of tags of commits to just be
tags of commits, even without any special request from the user
somewhat contrary to otherwise requiring --signed-tags and
--tag-of-filtered-object options.  As far as I can tell, this isn't
documented for fast-export but I saw somewhere in the filter-branch
docs where it said it does this kind of thing on purpose.  However, to
make it even weirder, if the user requests
--tag-of-filtered-object=rewrite instead of the default of "abort"
then we actually abort on tags-of-tags-of-commits instead of
rewriting.  I don't think it was intentional, but
tags-of-tags-of-commits inverts the meaning of the
--tag-of-filtered-object={rewrite vs. abort} flag -- it's very weird).
I put more time into attempting to fix the nested tags issue than I
feel like it was worth.  git.git is the only repo I know of that seems
to have such tags, so I just gave up on them for now.

> > ---
> > NOTE: I was hoping the strmap API proposal would materialize, but I either
> > missed it or it hasn't shown up.  The usage of string_list in this patch
> > would be better replaced by what Peff suggested.
>
> You didn't miss it. Junio did some manual conversions using hashmap,
> which weren't too bad.  It's not entirely clear to me how often we'd be
> able to use strmap instead of a full-on hashmap, so I haven't really
> pursued it.
>
> It looks like you generate the list here via append, and then sort at
> the end. That's at least not quadratic. I think the string_list_remove()
> is, though.

I think it would have been useful in multiple places in
merge-recursive.c, in addition to here.  Maybe that just means I need
to add strmap to my list of things to do.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 09/10] fast-export: add a --show-original-ids option to show original names
  2018-11-11  7:20           ` Jeff King
@ 2018-11-11  8:32             ` Elijah Newren
  2018-11-12 12:53               ` Jeff King
  0 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-11  8:32 UTC (permalink / raw)
  To: Jeff King
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Taylor Blau,
	Jonathan Nieder

On Sat, Nov 10, 2018 at 11:20 PM Jeff King <peff@peff.net> wrote:
>
> On Sat, Nov 10, 2018 at 10:23:11PM -0800, Elijah Newren wrote:
>
> > Knowing the original names (hashes) of commits, blobs, and tags can
> > sometimes enable post-filtering that would otherwise be difficult or
> > impossible.  In particular, the desire to rewrite commit messages which
> > refer to other prior commits (on top of whatever other filtering is
> > being done) is very difficult without knowing the original names of each
> > commit.
> >
> > This commit teaches a new --show-original-ids option to fast-export
> > which will make it add a 'originally <hash>' line to blob, commits, and
> > tags.  It also teaches fast-import to parse (and ignore) such lines.
>
> Makes sense as a feature; I think filter-branch can make its mappings
> available, too.
>
> Do we need to worry about compatibility with other fast-import programs?
> I think no, because this is not enabled by default (so if sending the
> extra lines to another importer hurts, the answer is "don't do that").
>
> I have a vague feeling that there might be some way to combine this with
> --export-marks or --no-data, but I can't really think of a way. They
> seem related, but not quite.
>
> > ---
> >  Documentation/git-fast-export.txt |  7 +++++++
> >  builtin/fast-export.c             | 20 +++++++++++++++-----
> >  fast-import.c                     | 17 +++++++++++++++++
> >  t/t9350-fast-export.sh            | 17 +++++++++++++++++
> >  4 files changed, 56 insertions(+), 5 deletions(-)
>
> The fast-import format is documented in Documentation/git-fast-import.txt.
> It might need an update to cover the new format.

We document the format in both fast-import.c and
Documentation/git-fast-import.txt?  Maybe we should delete the long
comments in fast-import.c so this isn't duplicated?

> > --- a/Documentation/git-fast-export.txt
> > +++ b/Documentation/git-fast-export.txt
> > @@ -121,6 +121,13 @@ marks the same across runs.
> >       used by a repository which already contains the necessary
> >       parent commits.
> >
> > +--show-original-ids::
> > +     Add an extra directive to the output for commits and blobs,
> > +     `originally <SHA1SUM>`.  While such directives will likely be
> > +     ignored by importers such as git-fast-import, it may be useful
> > +     for intermediary filters (e.g. for rewriting commit messages
> > +     which refer to older commits, or for stripping blobs by id).
>
> I'm not quite sure how a blob ends up being rewritten by fast-export (I
> get that commits may change due to dropping parents).

It doesn't get rewritten by fast-export; it gets rewritten by other
intermediary filters, e.g. in something like this:

   git fast-export --show-original-ids --all | intermediary_filter |
git fast-import

The intermediary_filter program may want to strip out blobs by id, or
remove filemodify and filedelete directives unless they touch certain
paths, etc.

> The name "originally" doesn't seem great to me. Probably because I would
> continually wonder if it has one "l" or two. ;) Perhaps something like
> "original-oid" might be better. That's well into bikeshed territory,
> though.

I wasn't a huge fan of "originally" either, but I just couldn't come
up with anything else that wasn't really long.  I'd be happy to switch
to original-oid.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/10] fast-export: add --always-show-modify-after-rename
  2018-11-11  7:23           ` Jeff King
@ 2018-11-11  8:42             ` Elijah Newren
  2018-11-12 12:58               ` Jeff King
  0 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-11  8:42 UTC (permalink / raw)
  To: Jeff King
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Taylor Blau,
	Jonathan Nieder

On Sat, Nov 10, 2018 at 11:23 PM Jeff King <peff@peff.net> wrote:
>
> On Sat, Nov 10, 2018 at 10:23:12PM -0800, Elijah Newren wrote:
>
> > fast-export output is traditionally used as an input to a fast-import
> > program, but it is also useful to help gather statistics about the
> > history of a repository (particularly when --no-data is also passed).
> > For example, two of the types of information we may want to collect
> > could include:
> >   1) general information about renames that have occurred
> >   2) what the biggest objects in a repository are and what names
> >      they appear under.
> >
> > The first bit of information can be gathered by just passing -M to
> > fast-export.  The second piece of information can partially be gotten
> > from running
> >     git cat-file --batch-check --batch-all-objects
> > However, that only shows what the biggest objects in the repository are
> > and their sizes, not what names those objects appear as or what commits
> > they were introduced in.  We can get that information from fast-export,
> > but when we only see
> >     R oldname newname
> > instead of
> >     R oldname newname
> >     M 100644 $SHA1 newname
> > then it makes the job more difficult.  Add an option which allows us to
> > force the latter output even when commits have exact renames of files.
>
> fast-export seems like a funny tool to look up paths. What about "git
> log --find-object=$SHA1" ?

Eek, and give me O(N*M) behavior, where N is the number of commits in
the repository and M is the number of renames that occur in its
history?  Also, that's the inverse of the lookup I need anyway (I have
the commit and filename, but am missing the SHA).

One of the problems with filter-branch that people often run into is
they know what they want at a high-level (e.g. extract the history of
this directory for a new repository, or rewrite the history of this
repo to appear at a subdirectory so it can be merged into a bigger
repo and people passing filenames to log will still get the history of
those files, or I want to remove some of the big stuff in my history),
but often times that's not quite enough.  They need help finding big
objects, or may be unaware that the subset of files they want used to
be known by alternative names.

I want a simple --analyze mode that can report on all files that have
been renamed (so users don't just say "all I care about is these N
files, give me a rewritten history just including those" -- we can
point out to them whether those N files used to be known by other
names), as well as reporting on all big files and if they've been
deleted, and aggregations of the "big files" information across
directories and file extensions.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/10] fast export and import fixes and features
  2018-11-11  7:27         ` [PATCH 00/10] fast export and import fixes and features Jeff King
@ 2018-11-11  8:44           ` Elijah Newren
  2018-11-12 13:00             ` Jeff King
  0 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-11  8:44 UTC (permalink / raw)
  To: Jeff King
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Taylor Blau,
	Jonathan Nieder

On Sat, Nov 10, 2018 at 11:27 PM Jeff King <peff@peff.net> wrote:
>
> On Sat, Nov 10, 2018 at 10:23:02PM -0800, Elijah Newren wrote:
>
> > This is a series of ten patches representing two doc corrections, one
> > pedantic fix, three real bug fixes, one micro code refactor, and three
> > new features.  Each of these ten changes is relatively small in size.
> > These changes predominantly affect fast-export, but there's a couple
> > small changes for fast-import as well.
> >
> > I could potentially split these patches up, but I'd just end up
> > chaining them sequentially since otherwise there'd be lots of
> > conflicts; having 10 different single patch series with lots of
> > dependencies sounded like a bigger pain to me, but let me know if you
> > would prefer I split them up and how you suggest doing so.
>
> I think it's fine to put them in sequence when there's a textual
> dependency.  If it turns out that one of them needs more discussion and
> we don't want it to hold later patches hostage, we can always re-roll at
> that point.
>
> (I also think it's fine to lump together thematically similar patches
> even when they aren't strictly dependent, even textually. It's less work
> for the maintainer to consider 1 group of 10 than 10 groups of 1).
>
> > These patches were driven by the needs of git-repo-filter[1], but most
> > if not all of them should be independently useful.
>
> I left lots of comments. Some of the earlier ones may just be showing my
> confusion about fast-export works (some of which was cleared up by your
> later patches). But I like the overall direction for sure.

Thanks for taking the time to read over the series and providing lots
of feedback!  And, whoops, looks like it's gotten kinda late, so I'll
check any further feedback on Monday.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 03/10] fast-export: use value from correct enum
  2018-11-11  6:36           ` Jeff King
@ 2018-11-11 20:10             ` Ævar Arnfjörð Bjarmason
  2018-11-12  9:12               ` Ævar Arnfjörð Bjarmason
  2018-11-12 11:31               ` Jeff King
  0 siblings, 2 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-11 20:10 UTC (permalink / raw)
  To: Jeff King; +Cc: Elijah Newren, git, larsxschneider, sandals, me, jrnieder


On Sun, Nov 11 2018, Jeff King wrote:

> On Sat, Nov 10, 2018 at 10:23:05PM -0800, Elijah Newren wrote:
>
>> ABORT and ERROR happen to have the same value, but come from differnt
>> enums.  Use the one from the correct enum.
>
> Yikes. :)
>
> This is a good argument for naming these SIGNED_TAG_ABORT, etc. But this
> is obviously an improvement in the meantime.

In C enum values aren't the types of the enum, but I'd thought someone
would have added a warning for this:

    #include <stdio.h>

    enum { A, B } foo = A;
    enum { C, D } bar = C;

    int main(void)
    {
        switch (foo) {
          case C:
            puts("A");
            break;
          case B:
            puts("B");
            break;
        }
    }

But none of the 4 C compilers (gcc, clang, suncc & xlc) I have warn
about it. Good to know.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 03/10] fast-export: use value from correct enum
  2018-11-11 20:10             ` Ævar Arnfjörð Bjarmason
@ 2018-11-12  9:12               ` Ævar Arnfjörð Bjarmason
  2018-11-12 11:31               ` Jeff King
  1 sibling, 0 replies; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-12  9:12 UTC (permalink / raw)
  To: Jeff King
  Cc: Elijah Newren, Git Mailing List, Lars Schneider, brian m. carlson,
	Taylor Blau, Jonathan Nieder

On Sun, Nov 11, 2018 at 9:10 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Sun, Nov 11 2018, Jeff King wrote:
>
> > On Sat, Nov 10, 2018 at 10:23:05PM -0800, Elijah Newren wrote:
> >
> >> ABORT and ERROR happen to have the same value, but come from differnt
> >> enums.  Use the one from the correct enum.
> >
> > Yikes. :)
> >
> > This is a good argument for naming these SIGNED_TAG_ABORT, etc. But this
> > is obviously an improvement in the meantime.
>
> In C enum values aren't the types of the enum, but I'd thought someone
> would have added a warning for this:
>
>     #include <stdio.h>
>
>     enum { A, B } foo = A;
>     enum { C, D } bar = C;
>
>     int main(void)
>     {
>         switch (foo) {
>           case C:
>             puts("A");
>             break;
>           case B:
>             puts("B");
>             break;
>         }
>     }
>
> But none of the 4 C compilers (gcc, clang, suncc & xlc) I have warn
> about it. Good to know.

Asked GCC to implement it: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87983

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: Import/Export as a fast way to purge files from Git?
  2018-11-01  7:12     ` Elijah Newren
  2018-11-11  6:23       ` [PATCH 00/10] fast export and import fixes and features Elijah Newren
@ 2018-11-12  9:17       ` Ævar Arnfjörð Bjarmason
  2018-11-12 15:34         ` Elijah Newren
  1 sibling, 1 reply; 90+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-12  9:17 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Lars Schneider, Git Mailing List, Jeff King, Taylor Blau,
	brian m. carlson


On Thu, Nov 01 2018, Elijah Newren wrote:

> On Wed, Oct 31, 2018 at 12:16 PM Lars Schneider
> <larsxschneider@gmail.com> wrote:
>> > On Sep 24, 2018, at 7:24 PM, Elijah Newren <newren@gmail.com> wrote:
>> > On Sun, Sep 23, 2018 at 6:08 AM Lars Schneider <larsxschneider@gmail.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I recently had to purge files from large Git repos (many files, many commits).
>> >> The usual recommendation is to use `git filter-branch --index-filter` to purge
>> >> files. However, this is *very* slow for large repos (e.g. it takes 45min to
>> >> remove the `builtin` directory from git core). I realized that I can remove
>> >> files *way* faster by exporting the repo, removing the file references,
>> >> and then importing the repo (see Perl script below, it takes ~30sec to remove
>> >> the `builtin` directory from git core). Do you see any problem with this
>> >> approach?
>> >
>> > It looks like others have pointed you at other tools, and you're
>> > already shifting to that route.  But I think it's a useful question to
>> > answer more generally, so for those that are really curious...
>> >
>> >
>> > The basic approach is fine, though if you try to extend it much you
>> > can run into a few possible edge/corner cases (more on that below).
>> > I've been using this basic approach for years and even created a
>> > mini-python library[1] designed specifically to allow people to create
>> > "fast-filters", used as
>> >   git fast-export <options> | your-fast-filter | git fast-import <options>
>> >
>> > But that library didn't really take off; even I have rarely used it,
>> > often opting for filter-branch despite its horrible performance or a
>> > simple fast-export | long-sed-command | fast-import (with some extra
>> > pre-checking to make sure the sed wouldn't unintentionally munge other
>> > data).  BFG is great, as long as you're only interested in removing a
>> > few big items, but otherwise doesn't seem very useful (to be fair,
>> > it's very upfront about only wanting to solve that problem).
>> > Recently, due to continuing questions on filter-branch and folks still
>> > getting confused with it, I looked at existing tools, decided I didn't
>> > think any quite fit, and started looking into converting
>> > git_fast_filter into a filter-branch-like tool instead of just a
>> > libary.  Found some bugs and missing features in fast-export along the
>> > way (and have some patches I still need to send in).  But I kind of
>> > got stuck -- if the tool is in python, will that limit adoption too
>> > much?  It'd be kind of nice to have this tool in core git.  But I kind
>> > of like leaving open the possibility of using it as a tool _or_ as a
>> > library, the latter for the special cases where case-specific
>> > programmatic filtering is needed.  But a developer-convenience library
>> > makes almost no sense unless in a higher level language, such as
>> > python.  I'm still trying to make up my mind about what I want (and
>> > what others might want), and have been kind of blocking on that.  (If
>> > others have opinions, I'm all ears.)
>>
>> That library sounds like a very interesting idea. Unfortunately, the
>> referenced repo seems not to be available anymore:
>>     git://gitorious.org/git_fast_filter/mainline.git
>
> Yeah, gitorious went down at a time when I was busy with enough other
> things that I never bothered moving my repos to a new hosting site.
> Sorry about that.
>
> I've got a copy locally, but I've been editing it heavily, without the
> testing I should have in place, so I hesitate to point you at it right
> now.  (Also, the old version failed to handle things like --no-data
> output, which is important.)  I'll post an updated copy soon; feel
> free to ping me in a week if you haven't heard anything yet.
>
>> I very much like Python. However, more recently I started to
>> write Git tools in Perl as they work out of the box on every
>> machine with Git installed ... and I think Perl can be quite
>> readable if no shortcuts are used :-).
>
> Yeah, when portability matters, perl makes sense.  I thought about
> switching it over, but I'm not sure I want to rewrite 1-2k lines of
> code.  Especially since repo-filtering tools are kind of one-shot by
> nature, and only need to be done by one person of a team, on one
> specific machine, and won't affect daily development thereafter.
> (Also, since I don't depend on any libraries and use only stuff from
> the default python library, it ought to be relatively portable
> anyway.)

FWIW I'd be very happy to have this tool itself included in git.git
if/when it's stable / useful enough, and as you point out the language
doesn't really matter as much as what features it exposes.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 03/10] fast-export: use value from correct enum
  2018-11-11 20:10             ` Ævar Arnfjörð Bjarmason
  2018-11-12  9:12               ` Ævar Arnfjörð Bjarmason
@ 2018-11-12 11:31               ` Jeff King
  1 sibling, 0 replies; 90+ messages in thread
From: Jeff King @ 2018-11-12 11:31 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Elijah Newren, git, larsxschneider, sandals, me, jrnieder

On Sun, Nov 11, 2018 at 09:10:17PM +0100, Ævar Arnfjörð Bjarmason wrote:

> > This is a good argument for naming these SIGNED_TAG_ABORT, etc. But this
> > is obviously an improvement in the meantime.
> 
> In C enum values aren't the types of the enum, but I'd thought someone
> would have added a warning for this:
> 
>     #include <stdio.h>
> 
>     enum { A, B } foo = A;
>     enum { C, D } bar = C;
> 
>     int main(void)
>     {
>         switch (foo) {
>           case C:
>             puts("A");
>             break;
>           case B:
>             puts("B");
>             break;
>         }
>     }
> 
> But none of the 4 C compilers (gcc, clang, suncc & xlc) I have warn
> about it. Good to know.

There is -Wenum-compare, but it does not seem to catch this (and is
enabled by -Wall). It (gcc, at least) does catch:

	enum foo { A, B };
	enum bar { C, D };

	int f(enum foo x)
	{
		return x == C;
	}

but converting that equality check to:

	switch (x) {
	case C:
		return 1;
	default:
		return 0;
	}

is not (which is essentially the same as your snippet). So I think the
bug / feature request is to have -Wenum-compare apply to switch
statements.

Clang has -Wenum-compare-switch, but I cannot seem to get it to complain
about even the "==" version using -Wenum-compare. Not sure if it's
buggy, or if I'm holding it wrong. This patch seems to be what we want:

  https://reviews.llvm.org/D36407

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 04/10] fast-export: avoid dying when filtering by paths and old tags exist
  2018-11-11  7:38             ` Elijah Newren
@ 2018-11-12 12:32               ` Jeff King
  0 siblings, 0 replies; 90+ messages in thread
From: Jeff King @ 2018-11-12 12:32 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Taylor Blau,
	Jonathan Nieder

On Sat, Nov 10, 2018 at 11:38:45PM -0800, Elijah Newren wrote:

> > Hmm. That's the right thing to do if we're considering the export to be
> > an independent unit. But what if I'm just rewriting a portion of history
> > like:
> >
> >   git fast-export HEAD~5..HEAD | some_filter | git fast-import
> >
> > ? If I have a tag pointing to HEAD~10, will this delete that? Ideally I
> > think it would be left alone.
> 
> A couple things:
>   * This code path only triggers in a very specific case: If a tag is
> requested for export but points to a commit which is filtered out by
> something else (e.g. path limiters and the commit in question didn't
> modify any of the relevant paths), AND the user explicitly specified
> --tag-of-filtered-object=rewrite (so that the tag in question can be
> rewritten to the nearest non-filtered ancestor).

Right, I think this is the bit I was missing: somebody has to have
explicitly asked to export the tag. At which point the only sensible
thing to do is drop it.

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/10] fast-export: when using paths, avoid corrupt stream with non-existent mark
  2018-11-11  8:01             ` Elijah Newren
@ 2018-11-12 12:45               ` Jeff King
  2018-11-12 15:36                 ` Elijah Newren
  0 siblings, 1 reply; 90+ messages in thread
From: Jeff King @ 2018-11-12 12:45 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Taylor Blau,
	Jonathan Nieder

On Sun, Nov 11, 2018 at 12:01:43AM -0800, Elijah Newren wrote:

> > It does seem funny that the behavior for the earlier case (bounded
> > commits) and this case (skipping some commits) are different. Would you
> > ever want to keep walking backwards to find an ancestor in the earlier
> > case? Or vice versa, would you ever want to simply delete a tag in a
> > case like this one?
> >
> > I'm not sure sure, but I suspect you may have thought about it a lot
> > harder than I have. :)
> 
> I'm not sure why you thought the behavior for the two cases was
> different?  For both patches, my testcases used path limiting; it was
> you who suggested employing a negative revision to bound the commits.

Sorry, I think I just got confused. I was thinking about the
documentation fixup you started with, which did regard bounded commits.
But that's not relevant here.

> Anyway, for both patches assuming you haven't bounded the commits, you
> can attempt to keep walking backwards to find an earlier ancestor, but
> the fundamental fact is you aren't guaranteed that you can find one
> (i.e. some tag or branch points to a commit that didn't modify any of
> the specified paths, and nor did any of its ancestors back to any root
> commits).  I hit that case lots of times.  If the user explicitly
> requested a tag or branch for export (and requested tag rewriting),
> and limited to certain paths that had never existed in the repository
> as of the time of the tag or branch, then you hit the cases these
> patches worry about.  Patch 4 was about (annotated and signed) tags,
> this patch is about unannotated tags and branches and other refs.

OK, that makes more sense.

So I guess my question is: in patch 4, why do we not walk back to find
an appropriate ancestor pointed to by the signed tag object, as we do
here for the unannotated case?

And I think the answer is: we already do that. It's just that the
unannotated case never learned the same trick. So basically it's:

  1. rewriting annotated tags to ancestors is already known on "master"

  2. patch 4 further teaches it to drop a tag when that fails

  3. patch 6 teaches both (1) and (2) to the unannotated code path,
     which knew neither

Is that right?

> > This hunk makes sense.
> 
> Cool, this was the entirety of the code...so does this mean that the
> code makes more sense than my commit message summary did?  ...and
> perhaps that my attempts to answer your questions in this email
> weren't necessary anymore?

No, it only made sense that the hunk implemented what you claimed in the
commit message. ;)

I think your responses did help me understand that what the commit
message is claiming is a good thing.

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 09/10] fast-export: add a --show-original-ids option to show original names
  2018-11-11  8:32             ` Elijah Newren
@ 2018-11-12 12:53               ` Jeff King
  2018-11-12 15:46                 ` Elijah Newren
  0 siblings, 1 reply; 90+ messages in thread
From: Jeff King @ 2018-11-12 12:53 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Taylor Blau,
	Jonathan Nieder

On Sun, Nov 11, 2018 at 12:32:22AM -0800, Elijah Newren wrote:

> > >  Documentation/git-fast-export.txt |  7 +++++++
> > >  builtin/fast-export.c             | 20 +++++++++++++++-----
> > >  fast-import.c                     | 17 +++++++++++++++++
> > >  t/t9350-fast-export.sh            | 17 +++++++++++++++++
> > >  4 files changed, 56 insertions(+), 5 deletions(-)
> >
> > The fast-import format is documented in Documentation/git-fast-import.txt.
> > It might need an update to cover the new format.
> 
> We document the format in both fast-import.c and
> Documentation/git-fast-import.txt?  Maybe we should delete the long
> comments in fast-import.c so this isn't duplicated?

Yes, that is probably worth doing (see the comment at the top of
fast-import.c). Some information might need to be migrated.

If we're going to have just one spot, I think it needs to be the
user-facing documentation. This is a public interface that other people
are building compatible implementations for (including your new tool).

> > > +--show-original-ids::
> > > +     Add an extra directive to the output for commits and blobs,
> > > +     `originally <SHA1SUM>`.  While such directives will likely be
> > > +     ignored by importers such as git-fast-import, it may be useful
> > > +     for intermediary filters (e.g. for rewriting commit messages
> > > +     which refer to older commits, or for stripping blobs by id).
> >
> > I'm not quite sure how a blob ends up being rewritten by fast-export (I
> > get that commits may change due to dropping parents).
> 
> It doesn't get rewritten by fast-export; it gets rewritten by other
> intermediary filters, e.g. in something like this:
> 
>    git fast-export --show-original-ids --all | intermediary_filter |
> git fast-import
> 
> The intermediary_filter program may want to strip out blobs by id, or
> remove filemodify and filedelete directives unless they touch certain
> paths, etc.

OK, that matches my understanding. So why does fast-export need to print
the blob ids? If the intermediary is rewriting blobs, it can then
produce the "originally" line itself, can't it?

The more interesting case I guess is your "strip out blobs by id"
example. There the intermediary _could_ do so itself, but it would
require recomputing the object id of each blob.

If you use "--no-data", then this just works (we specify tree entries by
object id, rather than by mark). But I can see how it would be useful to
have the information even without "--no-data" (i.e., if you are doing
multiple kinds of rewrites on a single stream).

I think the thing that confused me is that this "originally" is doing
two things:

  - mentioning blob ids as an optimization / convenience for the reader

  - mentioning rewritten commit (and presumably tag?) ids that were
    rewritten as part of a partial history export. I suppose even trees
    could be rewritten that way, too, but fast-import doesn't generally
    consider trees to be a first-class item.

So I'm OK with it, but I wonder if there is an easier way to explain it.

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/10] fast-export: add --always-show-modify-after-rename
  2018-11-11  8:42             ` Elijah Newren
@ 2018-11-12 12:58               ` Jeff King
  2018-11-12 18:08                 ` Elijah Newren
  0 siblings, 1 reply; 90+ messages in thread
From: Jeff King @ 2018-11-12 12:58 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Taylor Blau,
	Jonathan Nieder

On Sun, Nov 11, 2018 at 12:42:58AM -0800, Elijah Newren wrote:

> > > fast-export output is traditionally used as an input to a fast-import
> > > program, but it is also useful to help gather statistics about the
> > > history of a repository (particularly when --no-data is also passed).
> > > For example, two of the types of information we may want to collect
> > > could include:
> > >   1) general information about renames that have occurred
> > >   2) what the biggest objects in a repository are and what names
> > >      they appear under.
> > >
> > > The first bit of information can be gathered by just passing -M to
> > > fast-export.  The second piece of information can partially be gotten
> > > from running
> > >     git cat-file --batch-check --batch-all-objects
> > > However, that only shows what the biggest objects in the repository are
> > > and their sizes, not what names those objects appear as or what commits
> > > they were introduced in.  We can get that information from fast-export,
> > > but when we only see
> > >     R oldname newname
> > > instead of
> > >     R oldname newname
> > >     M 100644 $SHA1 newname
> > > then it makes the job more difficult.  Add an option which allows us to
> > > force the latter output even when commits have exact renames of files.
> >
> > fast-export seems like a funny tool to look up paths. What about "git
> > log --find-object=$SHA1" ?
> 
> Eek, and give me O(N*M) behavior, where N is the number of commits in
> the repository and M is the number of renames that occur in its
> history?  Also, that's the inverse of the lookup I need anyway (I have
> the commit and filename, but am missing the SHA).

Maybe I don't understand what you're trying to accomplish. I was
thinking specifically of your "cat-file can tell you the large objects,
but you don't know their names/commits" from above.

I would do:

   git log --raw $(
     git cat-file --batch-check='%(objectsize:disk) %(objectname)' --batch-all-objects |
     sort -rn | head -3 |
     awk '{print "--find-object=" $2 }'
   )

I'm not sure how renames enter into it at all.

> One of the problems with filter-branch that people often run into is
> they know what they want at a high-level (e.g. extract the history of
> this directory for a new repository, or rewrite the history of this
> repo to appear at a subdirectory so it can be merged into a bigger
> repo and people passing filenames to log will still get the history of
> those files, or I want to remove some of the big stuff in my history),
> but often times that's not quite enough.  They need help finding big
> objects, or may be unaware that the subset of files they want used to
> be known by alternative names.
> 
> I want a simple --analyze mode that can report on all files that have
> been renamed (so users don't just say "all I care about is these N
> files, give me a rewritten history just including those" -- we can
> point out to them whether those N files used to be known by other
> names), as well as reporting on all big files and if they've been
> deleted, and aggregations of the "big files" information across
> directories and file extensions.

So this seems like a separate problem than what the commit message talks
about.

There I think you'd want to assemble the list with something like "git
log --follow --name-only paths-of-interest" except that --follow sucks
too much to handle more than one path at a time.

But if you wanted to do it manually, then:

  git log --diff-filter=R --name-only

would be enough to let you track it down, wouldn't it?

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/10] fast export and import fixes and features
  2018-11-11  8:44           ` Elijah Newren
@ 2018-11-12 13:00             ` Jeff King
  0 siblings, 0 replies; 90+ messages in thread
From: Jeff King @ 2018-11-12 13:00 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Taylor Blau,
	Jonathan Nieder

On Sun, Nov 11, 2018 at 12:44:47AM -0800, Elijah Newren wrote:

> > > These patches were driven by the needs of git-repo-filter[1], but most
> > > if not all of them should be independently useful.
> >
> > I left lots of comments. Some of the earlier ones may just be showing my
> > confusion about fast-export works (some of which was cleared up by your
> > later patches). But I like the overall direction for sure.
> 
> Thanks for taking the time to read over the series and providing lots
> of feedback!  And, whoops, looks like it's gotten kinda late, so I'll
> check any further feedback on Monday.

Thank you for your patience with my sometimes-confused responses. :)

Overall it makes more sense to me now (and everything seems like a good
direction), with the exception that I'm still a bit confused about patch
10.

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: Import/Export as a fast way to purge files from Git?
  2018-11-12  9:17       ` Import/Export as a fast way to purge files from Git? Ævar Arnfjörð Bjarmason
@ 2018-11-12 15:34         ` Elijah Newren
  0 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-12 15:34 UTC (permalink / raw)
  To: Ævar Arnfjörð
  Cc: Lars Schneider, Git Mailing List, Jeff King, Taylor Blau,
	brian m. carlson

On Mon, Nov 12, 2018 at 1:17 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Thu, Nov 01 2018, Elijah Newren wrote:
>
> > On Wed, Oct 31, 2018 at 12:16 PM Lars Schneider
> > <larsxschneider@gmail.com> wrote:
> >> > On Sep 24, 2018, at 7:24 PM, Elijah Newren <newren@gmail.com> wrote:
> >> > On Sun, Sep 23, 2018 at 6:08 AM Lars Schneider <larsxschneider@gmail.com> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> I recently had to purge files from large Git repos (many files, many commits).
> >> >> The usual recommendation is to use `git filter-branch --index-filter` to purge
> >> >> files. However, this is *very* slow for large repos (e.g. it takes 45min to
> >> >> remove the `builtin` directory from git core). I realized that I can remove
> >> >> files *way* faster by exporting the repo, removing the file references,
> >> >> and then importing the repo (see Perl script below, it takes ~30sec to remove
> >> >> the `builtin` directory from git core). Do you see any problem with this
> >> >> approach?
> >> >
> >> > It looks like others have pointed you at other tools, and you're
> >> > already shifting to that route.  But I think it's a useful question to
> >> > answer more generally, so for those that are really curious...
> >> >
> >> >
> >> > The basic approach is fine, though if you try to extend it much you
> >> > can run into a few possible edge/corner cases (more on that below).
> >> > I've been using this basic approach for years and even created a
> >> > mini-python library[1] designed specifically to allow people to create
> >> > "fast-filters", used as
> >> >   git fast-export <options> | your-fast-filter | git fast-import <options>
> >> >
> >> > But that library didn't really take off; even I have rarely used it,
> >> > often opting for filter-branch despite its horrible performance or a
> >> > simple fast-export | long-sed-command | fast-import (with some extra
> >> > pre-checking to make sure the sed wouldn't unintentionally munge other
> >> > data).  BFG is great, as long as you're only interested in removing a
> >> > few big items, but otherwise doesn't seem very useful (to be fair,
> >> > it's very upfront about only wanting to solve that problem).
> >> > Recently, due to continuing questions on filter-branch and folks still
> >> > getting confused with it, I looked at existing tools, decided I didn't
> >> > think any quite fit, and started looking into converting
> >> > git_fast_filter into a filter-branch-like tool instead of just a
> >> > libary.  Found some bugs and missing features in fast-export along the
> >> > way (and have some patches I still need to send in).  But I kind of
> >> > got stuck -- if the tool is in python, will that limit adoption too
> >> > much?  It'd be kind of nice to have this tool in core git.  But I kind
> >> > of like leaving open the possibility of using it as a tool _or_ as a
> >> > library, the latter for the special cases where case-specific
> >> > programmatic filtering is needed.  But a developer-convenience library
> >> > makes almost no sense unless in a higher level language, such as
> >> > python.  I'm still trying to make up my mind about what I want (and
> >> > what others might want), and have been kind of blocking on that.  (If
> >> > others have opinions, I'm all ears.)
> >>
> >> That library sounds like a very interesting idea. Unfortunately, the
> >> referenced repo seems not to be available anymore:
> >>     git://gitorious.org/git_fast_filter/mainline.git
> >
> > Yeah, gitorious went down at a time when I was busy with enough other
> > things that I never bothered moving my repos to a new hosting site.
> > Sorry about that.
> >
> > I've got a copy locally, but I've been editing it heavily, without the
> > testing I should have in place, so I hesitate to point you at it right
> > now.  (Also, the old version failed to handle things like --no-data
> > output, which is important.)  I'll post an updated copy soon; feel
> > free to ping me in a week if you haven't heard anything yet.
> >
> >> I very much like Python. However, more recently I started to
> >> write Git tools in Perl as they work out of the box on every
> >> machine with Git installed ... and I think Perl can be quite
> >> readable if no shortcuts are used :-).
> >
> > Yeah, when portability matters, perl makes sense.  I thought about
> > switching it over, but I'm not sure I want to rewrite 1-2k lines of
> > code.  Especially since repo-filtering tools are kind of one-shot by
> > nature, and only need to be done by one person of a team, on one
> > specific machine, and won't affect daily development thereafter.
> > (Also, since I don't depend on any libraries and use only stuff from
> > the default python library, it ought to be relatively portable
> > anyway.)
>
> FWIW I'd be very happy to have this tool itself included in git.git
> if/when it's stable / useful enough, and as you point out the language
> doesn't really matter as much as what features it exposes.

Well, I'm happy to propose it for inclusion once it gets to that
point.  I'll bring it up on the list to get wider feedback once I've
removed at least some of the sharp edges.  I suspect it'll be at least
a few weeks.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/10] fast-export: when using paths, avoid corrupt stream with non-existent mark
  2018-11-12 12:45               ` Jeff King
@ 2018-11-12 15:36                 ` Elijah Newren
  0 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-12 15:36 UTC (permalink / raw)
  To: Jeff King
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Taylor Blau,
	Jonathan Nieder

On Mon, Nov 12, 2018 at 4:45 AM Jeff King <peff@peff.net> wrote:
> On Sun, Nov 11, 2018 at 12:01:43AM -0800, Elijah Newren wrote:
>
> > > It does seem funny that the behavior for the earlier case (bounded
> > > commits) and this case (skipping some commits) are different. Would you
> > > ever want to keep walking backwards to find an ancestor in the earlier
> > > case? Or vice versa, would you ever want to simply delete a tag in a
> > > case like this one?
> > >
> > > I'm not sure sure, but I suspect you may have thought about it a lot
> > > harder than I have. :)
> >
> > I'm not sure why you thought the behavior for the two cases was
> > different?  For both patches, my testcases used path limiting; it was
> > you who suggested employing a negative revision to bound the commits.
>
> Sorry, I think I just got confused. I was thinking about the
> documentation fixup you started with, which did regard bounded commits.
> But that's not relevant here.
>
> > Anyway, for both patches assuming you haven't bounded the commits, you
> > can attempt to keep walking backwards to find an earlier ancestor, but
> > the fundamental fact is you aren't guaranteed that you can find one
> > (i.e. some tag or branch points to a commit that didn't modify any of
> > the specified paths, and nor did any of its ancestors back to any root
> > commits).  I hit that case lots of times.  If the user explicitly
> > requested a tag or branch for export (and requested tag rewriting),
> > and limited to certain paths that had never existed in the repository
> > as of the time of the tag or branch, then you hit the cases these
> > patches worry about.  Patch 4 was about (annotated and signed) tags,
> > this patch is about unannotated tags and branches and other refs.
>
> OK, that makes more sense.
>
> So I guess my question is: in patch 4, why do we not walk back to find
> an appropriate ancestor pointed to by the signed tag object, as we do
> here for the unannotated case?
>
> And I think the answer is: we already do that. It's just that the
> unannotated case never learned the same trick. So basically it's:
>
>   1. rewriting annotated tags to ancestors is already known on "master"
>
>   2. patch 4 further teaches it to drop a tag when that fails
>
>   3. patch 6 teaches both (1) and (2) to the unannotated code path,
>      which knew neither
>
> Is that right?

Ah, now I see where the slight disconnect was.  And yes, you are correct.

> > > This hunk makes sense.
> >
> > Cool, this was the entirety of the code...so does this mean that the
> > code makes more sense than my commit message summary did?  ...and
> > perhaps that my attempts to answer your questions in this email
> > weren't necessary anymore?
>
> No, it only made sense that the hunk implemented what you claimed in the
> commit message. ;)
>
> I think your responses did help me understand that what the commit
> message is claiming is a good thing.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 09/10] fast-export: add a --show-original-ids option to show original names
  2018-11-12 12:53               ` Jeff King
@ 2018-11-12 15:46                 ` Elijah Newren
  2018-11-12 16:31                   ` Jeff King
  0 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-12 15:46 UTC (permalink / raw)
  To: Jeff King
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Taylor Blau,
	Jonathan Nieder

On Mon, Nov 12, 2018 at 4:53 AM Jeff King <peff@peff.net> wrote:
> On Sun, Nov 11, 2018 at 12:32:22AM -0800, Elijah Newren wrote:
>
> > > >  Documentation/git-fast-export.txt |  7 +++++++
> > > >  builtin/fast-export.c             | 20 +++++++++++++++-----
> > > >  fast-import.c                     | 17 +++++++++++++++++
> > > >  t/t9350-fast-export.sh            | 17 +++++++++++++++++
> > > >  4 files changed, 56 insertions(+), 5 deletions(-)
> > >
> > > The fast-import format is documented in Documentation/git-fast-import.txt.
> > > It might need an update to cover the new format.
> >
> > We document the format in both fast-import.c and
> > Documentation/git-fast-import.txt?  Maybe we should delete the long
> > comments in fast-import.c so this isn't duplicated?
>
> Yes, that is probably worth doing (see the comment at the top of
> fast-import.c). Some information might need to be migrated.
>
> If we're going to have just one spot, I think it needs to be the
> user-facing documentation. This is a public interface that other people
> are building compatible implementations for (including your new tool).

Okay, I'll work on that.

> OK, that matches my understanding. So why does fast-export need to print
> the blob ids? If the intermediary is rewriting blobs, it can then
> produce the "originally" line itself, can't it?
>
> The more interesting case I guess is your "strip out blobs by id"
> example. There the intermediary _could_ do so itself, but it would
> require recomputing the object id of each blob.
>
> If you use "--no-data", then this just works (we specify tree entries by
> object id, rather than by mark). But I can see how it would be useful to
> have the information even without "--no-data" (i.e., if you are doing
> multiple kinds of rewrites on a single stream).
>
> I think the thing that confused me is that this "originally" is doing
> two things:
>
>   - mentioning blob ids as an optimization / convenience for the reader
>
>   - mentioning rewritten commit (and presumably tag?) ids that were
>     rewritten as part of a partial history export. I suppose even trees
>     could be rewritten that way, too, but fast-import doesn't generally
>     consider trees to be a first-class item.
>
> So I'm OK with it, but I wonder if there is an easier way to explain it.

Yeah, I started out just needing to add the original oids for commits.
Once I added them there, I wondered whether someone would need them
for tags and blobs too (not trees since fast-import doesn't work with
those).  For blobs, it made sense as a small performance optimization
(when running without --no-data), as you pointed out.  I can't think
of a use for them in tags, but once I've included them in blobs and
commits it felt like I might as well include them there for
completeness.  So maybe my commit message should have been something
more like:

"""
Knowing the original names (hashes) of commits can sometimes enable
post-filtering that would otherwise be difficult or impossible.  In
particular, the desire to rewrite commit messages which refer to other
prior commits (on top of whatever other filtering is being done) is
very difficult without knowing the original names of each commit.

In addition, knowing the original names (hashes) of blobs can allow
filtering by blob-id without requiring re-hashing the content of the
blob, and is thus useful as a small optimization.

Once we add original ids for both commits and blobs, we may as well
add them for tags too for completeness.  Perhaps someone will have a
use for them.

This commit teaches a new --show-original-ids option to fast-export
which will make it add a 'original-oid <hash>' line to blob, commits,
and tags.  It also teaches fast-import to parse (and ignore) such
lines.
"""

?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 09/10] fast-export: add a --show-original-ids option to show original names
  2018-11-12 15:46                 ` Elijah Newren
@ 2018-11-12 16:31                   ` Jeff King
  0 siblings, 0 replies; 90+ messages in thread
From: Jeff King @ 2018-11-12 16:31 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Taylor Blau,
	Jonathan Nieder

On Mon, Nov 12, 2018 at 07:46:14AM -0800, Elijah Newren wrote:

> So maybe my commit message should have been something
> more like:
> 
> """
> Knowing the original names (hashes) of commits can sometimes enable
> post-filtering that would otherwise be difficult or impossible.  In
> particular, the desire to rewrite commit messages which refer to other
> prior commits (on top of whatever other filtering is being done) is
> very difficult without knowing the original names of each commit.
> 
> In addition, knowing the original names (hashes) of blobs can allow
> filtering by blob-id without requiring re-hashing the content of the
> blob, and is thus useful as a small optimization.
> 
> Once we add original ids for both commits and blobs, we may as well
> add them for tags too for completeness.  Perhaps someone will have a
> use for them.
> 
> This commit teaches a new --show-original-ids option to fast-export
> which will make it add a 'original-oid <hash>' line to blob, commits,
> and tags.  It also teaches fast-import to parse (and ignore) such
> lines.
> """
> 
> ?

Yes, that makes much more sense to me (though of course I've also been
discussing it with you, so just about anything would at this point ;) ).

It's possible that somebody would want to filter on tree id's, too. A
fast-import stream just has trees incidentally as part of commit state,
but we could say something like "by the way, this tree is X". You can
even do "fast-export -t" to see subtrees, though I am not sure if that
is intentional or just an artifact of being based on the diff code.

I guess that is not all that useful, though. I was mostly thinking about
it because of your "we may as well add them for tags too for
completeness" above. But the issues around trees are sufficiently subtle
that we're probably better off not trying to handle them here. There's a
good chance we'd get it wrong, making our "let's just add this for
completeness while we're here" totally backfire.

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/10] fast-export: add --always-show-modify-after-rename
  2018-11-12 12:58               ` Jeff King
@ 2018-11-12 18:08                 ` Elijah Newren
  2018-11-13 14:45                   ` Jeff King
  0 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-12 18:08 UTC (permalink / raw)
  To: Jeff King
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Taylor Blau,
	Jonathan Nieder

On Mon, Nov 12, 2018 at 4:58 AM Jeff King <peff@peff.net> wrote:
> On Sun, Nov 11, 2018 at 12:42:58AM -0800, Elijah Newren wrote:
>
> Maybe I don't understand what you're trying to accomplish. I was
> thinking specifically of your "cat-file can tell you the large objects,
> but you don't know their names/commits" from above.

Fair enough.  And just to be clear, the first 9 patches were fixes and
features around trying to rewrite history; patch 10 is orthogonal and
was used for a separate run to just gather data.  It is entirely
possible I could gather that data other ways.

> I would do:
>
>    git log --raw $(
>      git cat-file --batch-check='%(objectsize:disk) %(objectname)' --batch-all-objects |
>      sort -rn | head -3 |
>      awk '{print "--find-object=" $2 }'
>    )
>
> I'm not sure how renames enter into it at all.

How did I miss objectsize:disk??  Especially since it is right next to
objectsize in the manpage to boot?  That's awesome, thanks for that
pointer.

I do have a separate cat-file --batch-check --batch-all-objects
process already, since I can't get sizes out of either log or
fast-export.  However, I wouldn't use your 'head -3' since I'm not
looking for the N biggest, but reporting on _all_ objects (in reverse
size order) and letting the user look over the report and deciding
where to stop reading.  So, this is a big and expensive log command.
Granted, we will need a big and expensive log command, but let's keep
in mind that we have this one.

> > One of the problems with filter-branch that people often run into is
> > they know what they want at a high-level (e.g. extract the history of
> > this directory for a new repository, or rewrite the history of this
> > repo to appear at a subdirectory so it can be merged into a bigger
> > repo and people passing filenames to log will still get the history of
> > those files, or I want to remove some of the big stuff in my history),
> > but often times that's not quite enough.  They need help finding big
> > objects, or may be unaware that the subset of files they want used to
> > be known by alternative names.
> >
> > I want a simple --analyze mode that can report on all files that have
> > been renamed (so users don't just say "all I care about is these N
> > files, give me a rewritten history just including those" -- we can
> > point out to them whether those N files used to be known by other
> > names), as well as reporting on all big files and if they've been
> > deleted, and aggregations of the "big files" information across
> > directories and file extensions.
>
> So this seems like a separate problem than what the commit message talks
> about.
>
> There I think you'd want to assemble the list with something like "git
> log --follow --name-only paths-of-interest" except that --follow sucks
> too much to handle more than one path at a time.
>
> But if you wanted to do it manually, then:
>
>   git log --diff-filter=R --name-only
>
> would be enough to let you track it down, wouldn't it?

Without a -M you'd only catch 100% renames, right?  Those aren't the
only ones I'd want to catch, so I'd need to add -M.  You are right
that we could get basic renames this way, but it doesn't cover
everything I need.  Let's use this as a starting point, though, and
build up to what I need...

I also want to know when files were deleted.  I've generally found
that people are more okay with purging parts of history [corresponding
to large ojbects] that were deleted longer ago than more recent stuff,
for a variety of reasons.  So we could either run yet another log, or
modify the command to:

  git log -M --diff-filter=RD --name-status

However, I don't just want to know when files were deleted, I'd like
to know when directories are deleted.  I only knew how to derive that
from knowing what files existed within those directories, so that
would take me to:

  git log -M --diff-filter=RAD --name-status

[Edit: I just saw your other email and for the first time learned
about the -t rev-list option which might simplify this a little,
although "need to worry about deleted files being reinstated" below
might require the 'A' anyway.]

At this point, let's remember that we had another full git-log
invocation for mapping object sizes to filenames.  We might as well
coalesce the two log commands into one, by extending this latest one
to:

  git log -M --diff-filter=RAMD --no-abbrev --raw

Also, I wanted commit date rather than author date, so we need to
extend the headers a bit.  Also, for reasons I won't bother detailing,
I think I want to traverse commits in reverse topological order.  So
our command is:

  git log --pretty=fuller --topo-order --reverse -M --diff-filter=RAMD
--no-abbrev --raw

But that still leaves us with four problems, three of which we can
solve with further extensions to this command:

1) There are some weird edge cases with deletions and renames.  Lots
of them in fact.  At a simple level, branching and merging and
multiple refs means that "is-this-deleted" isn't a binary flag for a
given filename (but rather a binary flag per-ref).  Also, it makes
"the set of names associated with a single 'file' as perceived by the
user" possibly rather ill-defined as well.  This can get really hairy,
but I'd at least like to handle the very basic cases of (a) "user
re-instates filename that used to be deleted" (i.e. the file isn't
deleted anymore) and (b) "user re-instates a filename that used to
exist but was renamed to something else" (in such cases, we can't just
treat the two filenames as being different names of the same content).
Handling the (b) usecase sanely requires some topology information, so
we need parents as well.  So our command extends to:

   git log --parents --pretty=fuller --topo-order --reverse -M
--diff-filter=RAMD --no-abbrev --raw

2) log is not plumbing, so parsing the stuff before the file
modifications is not a good idea. This could be fixed by using
--format:

  git log --format='%H%n%P%n%cd' --date=short --topo-order --reverse
-M --diff-filter-RAMD --no-abbrev --raw

3) log won't show changes for merge commits by default; we'd need to add -c:

  git log --format='%H%n%P%n%cd' --date=short --topo-order --reverse
-M --diff-filter-RAMD --no-abbrev --raw -c

4) log is not plumbing, revisited: although at this point I've
specified the log output explicitly enough that it ought to be safe to
parse, there are a few things that make me slightly worried.  I can
depend on fast-export to be stable; it only gives 'M' and 'D' unless
you explicitly ask for more types (e.g. -M to detect renames will add
'R').  With log, I'm no so sure; do I need to worry about new types
appearing in the future?  Also, should I just drop --diff-filter=RAMD
since it covers just about everything anyway?  Also, while --raw is
stable, is the combination of -c and --raw stable?  Is --date=short
stable (most likely, but still seems more likely to change than
fast-export would be)?  Is there something else I need to be worried
about?  Granted, each of those is only a small worry with log, but
they add up and give me pause about whether I should be parsing it
output in another tool.

So we've come up with an alternate way to get the data I need, though
with some worries.

I could potentially switch to using this and drop patch 10/10.  Maybe
there's even a good reason to prefer using log.  But at the time I was
thinking in terms of "I already have a tool that parses fast-export
output and I know it's stable...and it has access to all the
information I need so why not just get the information from it?"  So I
did that, and then realized towards the end that although it had all
the needed info, it stripped one piece from me.  Namely, when it had a
100% rename, I'd only get
   R oldname newname
and wouldn't know the sha1sum of newname (for mapping object sizes to
all their names).  If I cached the information about all file shas for
all trees I could pull it from that cache (which could be expensive
memory-wise for large repos), or I could use the original-oid
directive and keep another long running "git cat-file
--batch-check='%(objectname)' process and just pass it
"$ORIGINAL_OID:$NEWNAME" lines as I come across them.  However,
fast-export had the information and did special work to try to avoid
showing it when it thought it woudln't be needed, so why not just add
a flag to tell it to just give me the filemodify?

At this point, if folks don't like this patch, I'm more likely to use
the supplementary cat-file process than switching to log, unless
someone can ameliorate my concerns with it and suggest a good reason
why it's actually better.

Anyway, I hope it makes a little more sense why I created this patch.
Does it, or have I just made things even more confusing?

...and if you've read this far, I'm impressed.  Thanks for reading.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 04/10] fast-export: avoid dying when filtering by paths and old tags exist
  2018-11-11  6:44           ` Jeff King
  2018-11-11  7:38             ` Elijah Newren
@ 2018-11-12 22:50             ` brian m. carlson
  2018-11-13 14:38               ` Jeff King
  1 sibling, 1 reply; 90+ messages in thread
From: brian m. carlson @ 2018-11-12 22:50 UTC (permalink / raw)
  To: Jeff King; +Cc: Elijah Newren, git, larsxschneider, me, jrnieder

[-- Attachment #1: Type: text/plain, Size: 733 bytes --]

On Sun, Nov 11, 2018 at 01:44:43AM -0500, Jeff King wrote:
> > +		git fast-export --tag-of-filtered-object=rewrite --all -- bar >output &&
> > +		grep -A 1 refs/tags/v0.0.0.0.0.0.1 output | grep -E ^from.0{40}
> 
> I don't think "grep -A" is portable (and we don't seem to otherwise use
> it). You can probably do something similar with sed.
> 
> Use $ZERO_OID instead of hard-coding 40, which future-proofs for the
> hash transition (though I suppose the hash is not likely to get
> _shorter_ ;) ).

It would indeed be nice if we used $ZERO_OID.  Also, we prefer to write
"egrep", since some less capable systems don't have a grep with -E.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 04/10] fast-export: avoid dying when filtering by paths and old tags exist
  2018-11-12 22:50             ` brian m. carlson
@ 2018-11-13 14:38               ` Jeff King
  0 siblings, 0 replies; 90+ messages in thread
From: Jeff King @ 2018-11-13 14:38 UTC (permalink / raw)
  To: brian m. carlson, Elijah Newren, git, larsxschneider, me,
	jrnieder

On Mon, Nov 12, 2018 at 10:50:43PM +0000, brian m. carlson wrote:

> On Sun, Nov 11, 2018 at 01:44:43AM -0500, Jeff King wrote:
> > > +		git fast-export --tag-of-filtered-object=rewrite --all -- bar >output &&
> > > +		grep -A 1 refs/tags/v0.0.0.0.0.0.1 output | grep -E ^from.0{40}
> > 
> > I don't think "grep -A" is portable (and we don't seem to otherwise use
> > it). You can probably do something similar with sed.
> > 
> > Use $ZERO_OID instead of hard-coding 40, which future-proofs for the
> > hash transition (though I suppose the hash is not likely to get
> > _shorter_ ;) ).
> 
> It would indeed be nice if we used $ZERO_OID.  Also, we prefer to write
> "egrep", since some less capable systems don't have a grep with -E.

I thought that, too, but it is only "grep -F" that has been a problem
for us in the past, and we have many "grep -E" calls already. c.f.
https://public-inbox.org/git/20180910154453.GA15270@sigill.intra.peff.net/

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/10] fast-export: add --always-show-modify-after-rename
  2018-11-12 18:08                 ` Elijah Newren
@ 2018-11-13 14:45                   ` Jeff King
  2018-11-13 17:10                     ` Elijah Newren
  0 siblings, 1 reply; 90+ messages in thread
From: Jeff King @ 2018-11-13 14:45 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Taylor Blau,
	Jonathan Nieder

On Mon, Nov 12, 2018 at 10:08:10AM -0800, Elijah Newren wrote:

> > I would do:
> >
> >    git log --raw $(
> >      git cat-file --batch-check='%(objectsize:disk) %(objectname)' --batch-all-objects |
> >      sort -rn | head -3 |
> >      awk '{print "--find-object=" $2 }'
> >    )
> >
> > I'm not sure how renames enter into it at all.
> 
> How did I miss objectsize:disk??  Especially since it is right next to
> objectsize in the manpage to boot?  That's awesome, thanks for that
> pointer.
> 
> I do have a separate cat-file --batch-check --batch-all-objects
> process already, since I can't get sizes out of either log or
> fast-export.  However, I wouldn't use your 'head -3' since I'm not
> looking for the N biggest, but reporting on _all_ objects (in reverse
> size order) and letting the user look over the report and deciding
> where to stop reading.  So, this is a big and expensive log command.
> Granted, we will need a big and expensive log command, but let's keep
> in mind that we have this one.

It is an expensive log command, but it's the same expense as running
fast-export, no? And I think maybe that is the disconnect.

I am looking at this problem as "how do you answer question X in a
repository". And I think you are looking at as "I am receiving a
fast-export stream, and I need to answer question X on the fly".

And that would explain why you want to get extra annotations into the
fast-export stream. Is that right?

> > There I think you'd want to assemble the list with something like "git
> > log --follow --name-only paths-of-interest" except that --follow sucks
> > too much to handle more than one path at a time.
> >
> > But if you wanted to do it manually, then:
> >
> >   git log --diff-filter=R --name-only
> >
> > would be enough to let you track it down, wouldn't it?
> 
> Without a -M you'd only catch 100% renames, right?  Those aren't the
> only ones I'd want to catch, so I'd need to add -M.  You are right
> that we could get basic renames this way, but it doesn't cover
> everything I need.  Let's use this as a starting point, though, and
> build up to what I need...

No, renames are on by default these days, and that includes inexact
renames. That said, if you're scripting you probably ought to be doing:

  git rev-list HEAD | git diff-tree --stdin

and there yes, you'd have to enable "-M" yourself (you touched on
scripting and formatting below; diff-tree can accept the format options
you'd want).

> I also want to know when files were deleted.  I've generally found
> that people are more okay with purging parts of history [corresponding
> to large ojbects] that were deleted longer ago than more recent stuff,
> for a variety of reasons.  So we could either run yet another log, or
> modify the command to:
> 
>   git log -M --diff-filter=RD --name-status
> 
> However, I don't just want to know when files were deleted, I'd like
> to know when directories are deleted.  I only knew how to derive that
> from knowing what files existed within those directories, so that
> would take me to:
> 
>   git log -M --diff-filter=RAD --name-status
> 
> [Edit: I just saw your other email and for the first time learned
> about the -t rev-list option which might simplify this a little,
> although "need to worry about deleted files being reinstated" below
> might require the 'A' anyway.]

Yeah, I think "-t" would help your tree deletion problem.

> At this point, let's remember that we had another full git-log
> invocation for mapping object sizes to filenames.  We might as well
> coalesce the two log commands into one, by extending this latest one
> to:
> 
>   git log -M --diff-filter=RAMD --no-abbrev --raw

What is there besides RAMD? :)

> I could potentially switch to using this and drop patch 10/10.

So I'm still not _entirely_ clear on what you're trying to do with
10/10. I think maybe the "disconnect" part I wrote above explains it. If
that's correct, then I think framing it in terms of the operations that
you'd be able to perform _without running a separate traverse_ would
make it more obvious.

> Anyway, I hope it makes a little more sense why I created this patch.
> Does it, or have I just made things even more confusing?

Some of both, I think.

> ...and if you've read this far, I'm impressed.  Thanks for reading.

I'll admit I skimmed near the end. ;)

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/10] fast-export: add --always-show-modify-after-rename
  2018-11-13 14:45                   ` Jeff King
@ 2018-11-13 17:10                     ` Elijah Newren
  2018-11-14  7:14                       ` Jeff King
  0 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-13 17:10 UTC (permalink / raw)
  To: Jeff King
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Taylor Blau,
	Jonathan Nieder

On Tue, Nov 13, 2018 at 6:45 AM Jeff King <peff@peff.net> wrote:
> It is an expensive log command, but it's the same expense as running
> fast-export, no? And I think maybe that is the disconnect.

I would expect an expensive log command to generally be the same
expense as running fast-export, yes.  But I would expect two expensive
log commands to be twice the expense of a single fast-export (and you
suggested two log commands: both the --find-object= one and the
--diff-filter one).

> I am looking at this problem as "how do you answer question X in a
> repository". And I think you are looking at as "I am receiving a
> fast-export stream, and I need to answer question X on the fly".
>
> And that would explain why you want to get extra annotations into the
> fast-export stream. Is that right?

I'm not trying to get information on the fly during a rewrite or
anything like that.  This is an optional pre-rewrite step (from a
separate invocation of the tool) where I have multiple questions I
want to answer.  I'd like to answer them all relatively quickly, if
possible, and I think all of them should be answerable with a single
history traversal (plus a cat-file --batch-all-objects call to get
object sizes, since I don't know of another way to get those).  I'd be
fine with switching from fast-export to log or something else if it
met the needs better.

As far as I can tell, you're trying to split each question apart and
do a history traversal for each, and I don't see why that's better.
Simpler, perhaps, but it seems worse for performance.  Am I missing
something?

> > > There I think you'd want to assemble the list with something like "git
> > > log --follow --name-only paths-of-interest" except that --follow sucks
> > > too much to handle more than one path at a time.
> > >
> > > But if you wanted to do it manually, then:
> > >
> > >   git log --diff-filter=R --name-only
> > >
> > > would be enough to let you track it down, wouldn't it?
> >
> > Without a -M you'd only catch 100% renames, right?  Those aren't the
> > only ones I'd want to catch, so I'd need to add -M.  You are right
> > that we could get basic renames this way, but it doesn't cover
> > everything I need.  Let's use this as a starting point, though, and
> > build up to what I need...
>
> No, renames are on by default these days, and that includes inexact
> renames. That said, if you're scripting you probably ought to be doing:
>
>   git rev-list HEAD | git diff-tree --stdin
>
> and there yes, you'd have to enable "-M" yourself (you touched on
> scripting and formatting below; diff-tree can accept the format options
> you'd want).

Ah, I didn't know renames were on by default; I somehow missed that.
Also, the rev-list to diff-tree pipe is nice, but I also need parent
and commit timestamp information.

....
> Yeah, I think "-t" would help your tree deletion problem.

Absolutely, thanks for the hint.  Much appreciated.  :-)

> > At this point, let's remember that we had another full git-log
> > invocation for mapping object sizes to filenames.  We might as well
> > coalesce the two log commands into one, by extending this latest one
> > to:
> >
> >   git log -M --diff-filter=RAMD --no-abbrev --raw
>
> What is there besides RAMD? :)

Well, as you pointed out above, log detects renames by default,
whereas it didn't used to.
So, if someone had written some similar-ish history walking/parsing
tool years ago that didn't depend need renames and was based on log
output, there's a good chance their tool might start failing when
rename detection was turned on by default, because instead of getting
both a 'D' and an 'M' change, they'd get an unexpected 'R'.

For my case, do I have to worry about similar future changes?  Will
copy detection ('C') or break detection ('B') become the default in
the future?  Do I have to worry about typechanges ('T")?  Will new
change types be added?  I mean, the fast-export output could maybe
change too, but it seems much less likely than with log.

> > I could potentially switch to using this and drop patch 10/10.
>
> So I'm still not _entirely_ clear on what you're trying to do with
> 10/10. I think maybe the "disconnect" part I wrote above explains it. If
> that's correct, then I think framing it in terms of the operations that
> you'd be able to perform _without running a separate traverse_ would
> make it more obvious.

Let me try to put it as briefly as I can.  With as few traversals as
possible, I want to:
  * Get all blob sizes
  * Map blob shas to filename(s) they appeared under in the history
  * Find when files and directories were deleted (and whether they
were later reinstated, since that means they aren't actually gone)
  * Find sets of filenames referring to the same logical 'file'. (e.g.
foo->bar in commit A and bar->baz in commit B mean that {foo,bar,baz}
refer to the same 'file' so that a user has an easy report to look at
to find out that if they just want to "keep baz and its history" then
they need foo & bar & baz.  I need to know about things like another
foo or bar being introduced after the rename though, since that breaks
the connection between filenames)
  * Do a few aggregations on the above data as well (e.g. all copies
of postgres.exe add up to 20M -- why were those checked in anyway?,
*.webm files in aggregate are .5G, your long-deleted src/video-server/
directory from that aborted experimental project years ago takes up 2G
of your history, etc.)

Right now, my best solution for this combination of questions is
'cat-file --batch-all-objects' plus fast-export, if I get patch 10/10
in place.  I'm totally open to better solutions, including ones that
don't use fast-export.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/10] git-fast-export.txt: clarify misleading documentation about rev-list args
  2018-11-11  7:17             ` Elijah Newren
@ 2018-11-13 23:25               ` Elijah Newren
  2018-11-13 23:39                 ` Jonathan Nieder
  0 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-13 23:25 UTC (permalink / raw)
  To: Jeff King
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Taylor Blau,
	Jonathan Nieder

On Sat, Nov 10, 2018 at 11:17 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Sat, Nov 10, 2018 at 10:36 PM Jeff King <peff@peff.net> wrote:
> >
> > On Sat, Nov 10, 2018 at 10:23:04PM -0800, Elijah Newren wrote:
> >
> > > Signed-off-by: Elijah Newren <newren@gmail.com>
> > > ---
> > >  Documentation/git-fast-export.txt | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
> > > index ce954be532..677510b7f7 100644
> > > --- a/Documentation/git-fast-export.txt
> > > +++ b/Documentation/git-fast-export.txt
> > > @@ -119,7 +119,8 @@ marks the same across runs.
> > >       'git rev-list', that specifies the specific objects and references
> > >       to export.  For example, `master~10..master` causes the
> > >       current master reference to be exported along with all objects
> > > -     added since its 10th ancestor commit.
> > > +     added since its 10th ancestor commit and all files common to
> > > +     master\~9 and master~10.
> >
> > Do you need to backslash the second tilde?  Maybe `master~9` and
> > `master~10` instead of escaping?
>
> Oops, yeah, that needs to be consistent.

Actually, no, it actually needs to be inconsistent.

Different Input Choices (neither backslashed, both backslashed, then just one):
  master~9 and master~10
  master\~9 and master\~10
  master\~9 and master~10

What the outputs look like:
  master9 and master10
  master~9 and master\~10
  master~9 and master~10

I have no idea why asciidoc behaves this way, but it appears my
backslash escaping of just one of the two was necessary.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/10] git-fast-export.txt: clarify misleading documentation about rev-list args
  2018-11-13 23:25               ` Elijah Newren
@ 2018-11-13 23:39                 ` Jonathan Nieder
  2018-11-14  0:02                   ` Elijah Newren
  0 siblings, 1 reply; 90+ messages in thread
From: Jonathan Nieder @ 2018-11-13 23:39 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Jeff King, Git Mailing List, Lars Schneider, brian m. carlson,
	Taylor Blau

Elijah Newren wrote:

> Actually, no, it actually needs to be inconsistent.
>
> Different Input Choices (neither backslashed, both backslashed, then just one):
>   master~9 and master~10
>   master\~9 and master\~10
>   master\~9 and master~10
>
> What the outputs look like:
>   master9 and master10
>   master~9 and master\~10
>   master~9 and master~10
>
> I have no idea why asciidoc behaves this way, but it appears my
> backslash escaping of just one of the two was necessary.

{tilde} should work consistently.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/10] git-fast-export.txt: clarify misleading documentation about rev-list args
  2018-11-13 23:39                 ` Jonathan Nieder
@ 2018-11-14  0:02                   ` Elijah Newren
  0 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-14  0:02 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Jeff King, Git Mailing List, Lars Schneider, brian m. carlson,
	Taylor Blau

On Tue, Nov 13, 2018 at 3:39 PM Jonathan Nieder <jrnieder@gmail.com> wrote:
> Elijah Newren wrote:
> > Actually, no, it actually needs to be inconsistent.
> >
> > Different Input Choices (neither backslashed, both backslashed, then just one):
> >   master~9 and master~10
> >   master\~9 and master\~10
> >   master\~9 and master~10
> >
> > What the outputs look like:
> >   master9 and master10
> >   master~9 and master\~10
> >   master~9 and master~10
> >
> > I have no idea why asciidoc behaves this way, but it appears my
> > backslash escaping of just one of the two was necessary.
>
> {tilde} should work consistently.

Indeed it does (well, outside of `backtick blocks`); thanks for the tip.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v2 00/11] fast export and import fixes and features
  2018-11-11  6:23       ` [PATCH 00/10] fast export and import fixes and features Elijah Newren
                           ` (10 preceding siblings ...)
  2018-11-11  7:27         ` [PATCH 00/10] fast export and import fixes and features Jeff King
@ 2018-11-14  0:25         ` Elijah Newren
  2018-11-14  0:25           ` [PATCH v2 01/11] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
                             ` (12 more replies)
  11 siblings, 13 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-14  0:25 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, gitster,
	Elijah Newren

This is a series of small fixes and features for fast-export and
fast-import, mostly on the fast-export side.

Changes since v1 (full range-diff below):
  - used {tilde} in asciidoc documentation to avoid subscripting and
    escaping problems
  - renamed ABORT/ERROR enum values to help avoid further misusage
  - multiple small testcase cleanups (use $ZERO_OID, remove grep -A, etc.)
  - add FIXME comment to code about string_list usage
  - record Peff's idea for a future optimization in patch 8 commit message
    (is there a better place to put that??)
  - New patch (9/11): remove the unmaintained copy of fast-import stream
    format documentation at the beginning of fast-import.c
  - Rewrite commit message for 10/11 to match the wording Peff liked
    better, s/originally/original-oid/, and add documentation to
    git-fast-import.txt
  - Rewrite commit message for 11/11; the last one didn't make sense to
    Peff.  I hope this one does.

Elijah Newren (11):
  git-fast-import.txt: fix documentation for --quiet option
  git-fast-export.txt: clarify misleading documentation about rev-list
    args
  fast-export: use value from correct enum
  fast-export: avoid dying when filtering by paths and old tags exist
  fast-export: move commit rewriting logic into a function for reuse
  fast-export: when using paths, avoid corrupt stream with non-existent
    mark
  fast-export: ensure we export requested refs
  fast-export: add --reference-excluded-parents option
  fast-import: remove unmaintained duplicate documentation
  fast-export: add a --show-original-ids option to show original names
  fast-export: add --always-show-modify-after-rename

 Documentation/git-fast-export.txt |  34 +++++-
 Documentation/git-fast-import.txt |  23 +++-
 builtin/fast-export.c             | 172 ++++++++++++++++++++++--------
 fast-import.c                     | 166 +++-------------------------
 t/t9350-fast-export.sh            | 116 +++++++++++++++++++-
 5 files changed, 308 insertions(+), 203 deletions(-)

 1:  0744f65b0d =  1:  8870fb1340 git-fast-import.txt: fix documentation for --quiet option
 2:  aba1e22fdd !  2:  16d1c3e22d git-fast-export.txt: clarify misleading documentation about rev-list args
    @@ -13,7 +13,7 @@
      	current master reference to be exported along with all objects
     -	added since its 10th ancestor commit.
     +	added since its 10th ancestor commit and all files common to
    -+	master\~9 and master~10.
    ++	master{tilde}9 and master{tilde}10.
      
      EXAMPLES
      --------
 3:  6983e845b2 <  -:  ---------- fast-export: use value from correct enum
 -:  ---------- >  3:  e19f6b36f9 fast-export: use value from correct enum
 4:  761ba324d5 !  4:  2b305561d5 fast-export: avoid dying when filtering by paths and old tags exist
    @@ -49,18 +49,14 @@
     +	(
     +		cd rewrite_tag_predating_pathspecs &&
     +
    -+		touch ignored &&
    -+		git add ignored &&
     +		test_commit initial &&
     +
     +		git tag -a -m "Some old tag" v0.0.0.0.0.0.1 &&
     +
    -+		echo foo >bar &&
    -+		git add bar &&
    -+		test_commit add-bar &&
    ++		test_commit bar &&
     +
    -+		git fast-export --tag-of-filtered-object=rewrite --all -- bar >output &&
    -+		grep -A 1 refs/tags/v0.0.0.0.0.0.1 output | grep -E ^from.0{40}
    ++		git fast-export --tag-of-filtered-object=rewrite --all -- bar.t >output &&
    ++		grep from.$ZERO_OID output
     +	)
     +'
     +
 5:  64e9f0d360 =  5:  607b1dc2b2 fast-export: move commit rewriting logic into a function for reuse
 6:  fd14d9749a !  6:  ec1862e858 fast-export: when using paths, avoid corrupt stream with non-existent mark
    @@ -54,22 +54,18 @@
     +	(
     +		cd avoid_non_existent_mark &&
     +
    -+		touch important-path &&
    -+		git add important-path &&
    -+		test_commit initial &&
    ++		test_commit important-path &&
     +
    -+		touch ignored &&
    -+		git add ignored &&
    -+		test_commit whatever &&
    ++		test_commit ignored &&
     +
     +		git branch A &&
     +		git branch B &&
     +
    -+		echo foo >>important-path &&
    -+		git add important-path &&
    ++		echo foo >>important-path.t &&
    ++		git add important-path.t &&
     +		test_commit more changes &&
     +
    -+		git fast-export --all -- important-path | git fast-import --force
    ++		git fast-export --all -- important-path.t | git fast-import --force
     +	)
     +'
     +
 7:  4e67a2bc7f !  7:  9da26e3ccb fast-export: ensure we export requested refs
    @@ -21,9 +21,6 @@
         were just silently omitted from being exported despite having been
         explicitly requested for export.
     
    -    NOTE: The usage of string_list should really be replaced with the
    -    strmap proposal, once it materializes.
    -
         Signed-off-by: Elijah Newren <newren@gmail.com>
     
      diff --git a/builtin/fast-export.c b/builtin/fast-export.c
    @@ -41,6 +38,12 @@
      			export_blob(&diff_queued_diff.queue[i]->two->oid);
      
      	refname = *revision_sources_at(&revision_sources, commit);
    ++	/*
    ++	 * FIXME: string_list_remove() below for each ref is overall
    ++	 * O(N^2).  Compared to a history walk and diffing trees, this is
    ++	 * just lost in the noise in practice.  However, theoretically a
    ++	 * repo may have enough refs for this to become slow.
    ++	 */
     +	string_list_remove(&extra_refs, refname, 0);
      	if (anonymize) {
      		refname = anonymize_refname(refname);
 8:  be02337f29 !  8:  7e5fe2f02e fast-export: add --reference-excluded-parents option
    @@ -30,6 +30,15 @@
         repository which already contains the necessary commits (much like the
         restriction imposed when using --no-data).
     
    +    Note from Peff:
    +      I think we might be able to do a little more optimization here. If
    +      we're exporting HEAD^..HEAD and there's an object in HEAD^ which is
    +      unchanged in HEAD, I think we'd still print it (because it would not
    +      be marked SHOWN), but we could omit it (by walking the tree of the
    +      boundary commits and marking them shown).  I don't think it's a
    +      blocker for what you're doing here, but just a possible future
    +      optimization.
    +
         Signed-off-by: Elijah Newren <newren@gmail.com>
     
      diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
    @@ -41,14 +50,15 @@
      
     +--reference-excluded-parents::
     +	By default, running a command such as `git fast-export
    -+	master~5..master` will not include the commit master\~5 and
    -+	will make master\~4 no longer have master\~5 as a parent (though
    -+	both the old master\~4 and new master~4 will have all the same
    -+	files).  Use --reference-excluded-parents to instead have the
    -+	the stream refer to commits in the excluded range of history
    -+	by their sha1sum.  Note that the resulting stream can only be
    -+	used by a repository which already contains the necessary
    -+	parent commits.
    ++	master~5..master` will not include the commit master{tilde}5
    ++	and will make master{tilde}4 no longer have master{tilde}5 as
    ++	a parent (though both the old master{tilde}4 and new
    ++	master{tilde}4 will have all the same files).  Use
    ++	--reference-excluded-parents to instead have the the stream
    ++	refer to commits in the excluded range of history by their
    ++	sha1sum.  Note that the resulting stream can only be used by a
    ++	repository which already contains the necessary parent
    ++	commits.
     +
      --refspec::
      	Apply the specified refspec to each ref exported. Multiple of them can
    @@ -58,10 +68,10 @@
      	to export.  For example, `master~10..master` causes the
      	current master reference to be exported along with all objects
     -	added since its 10th ancestor commit and all files common to
    --	master\~9 and master~10.
    +-	master{tilde}9 and master{tilde}10.
     +	added since its 10th ancestor commit and (unless the
     +	--reference-excluded-parents option is specified) all files
    -+	common to master\~9 and master~10.
    ++	common to master{tilde}9 and master{tilde}10.
      
      EXAMPLES
      --------
 -:  ---------- >  9:  14306a8436 fast-import: remove unmaintained duplicate documentation
 9:  7ab314849d ! 10:  72487a61e4 fast-export: add a --show-original-ids option to show original names
    @@ -2,16 +2,24 @@
     
         fast-export: add a --show-original-ids option to show original names
     
    -    Knowing the original names (hashes) of commits, blobs, and tags can
    -    sometimes enable post-filtering that would otherwise be difficult or
    -    impossible.  In particular, the desire to rewrite commit messages which
    -    refer to other prior commits (on top of whatever other filtering is
    -    being done) is very difficult without knowing the original names of each
    -    commit.
    +    Knowing the original names (hashes) of commits can sometimes enable
    +    post-filtering that would otherwise be difficult or impossible.  In
    +    particular, the desire to rewrite commit messages which refer to other
    +    prior commits (on top of whatever other filtering is being done) is
    +    very difficult without knowing the original names of each commit.
    +
    +    In addition, knowing the original names (hashes) of blobs can allow
    +    filtering by blob-id without requiring re-hashing the content of the
    +    blob, and is thus useful as a small optimization.
    +
    +    Once we add original ids for both commits and blobs, we may as well
    +    add them for tags too for completeness.  Perhaps someone will have a
    +    use for them.
     
         This commit teaches a new --show-original-ids option to fast-export
    -    which will make it add a 'originally <hash>' line to blob, commits, and
    -    tags.  It also teaches fast-import to parse (and ignore) such lines.
    +    which will make it add a 'original-oid <hash>' line to blob, commits,
    +    and tags.  It also teaches fast-import to parse (and ignore) such
    +    lines.
     
         Signed-off-by: Elijah Newren <newren@gmail.com>
     
    @@ -19,12 +27,12 @@
      --- a/Documentation/git-fast-export.txt
      +++ b/Documentation/git-fast-export.txt
     @@
    - 	used by a repository which already contains the necessary
    - 	parent commits.
    + 	repository which already contains the necessary parent
    + 	commits.
      
     +--show-original-ids::
     +	Add an extra directive to the output for commits and blobs,
    -+	`originally <SHA1SUM>`.  While such directives will likely be
    ++	`original-oid <SHA1SUM>`.  While such directives will likely be
     +	ignored by importers such as git-fast-import, it may be useful
     +	for intermediary filters (e.g. for rewriting commit messages
     +	which refer to older commits, or for stripping blobs by id).
    @@ -33,6 +41,54 @@
      	Apply the specified refspec to each ref exported. Multiple of them can
      	be specified.
     
    + diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
    + --- a/Documentation/git-fast-import.txt
    + +++ b/Documentation/git-fast-import.txt
    +@@
    + ....
    + 	'commit' SP <ref> LF
    + 	mark?
    ++	original-oid?
    + 	('author' (SP <name>)? SP LT <email> GT SP <when> LF)?
    + 	'committer' (SP <name>)? SP LT <email> GT SP <when> LF
    + 	data
    +@@
    + to another object simply by reusing the same `<idnum>` in another
    + `mark` command.
    + 
    ++`original-oid`
    ++~~~~~~~~~~~~~~
    ++Provides the name of the object in the original source control system.
    ++fast-import will simply ignore this directive, but filter processes
    ++which operate on and modify the stream before feeding to fast-import
    ++may have uses for this information
    ++
    ++....
    ++	'original-oid' SP <object-identifier> LF
    ++....
    ++
    ++where `<object-identifer>` is any string not containing LF.
    ++
    + `tag`
    + ~~~~~
    + Creates an annotated tag referring to a specific commit.  To create
    +@@
    + ....
    + 	'tag' SP <name> LF
    + 	'from' SP <commit-ish> LF
    ++	original-oid?
    + 	'tagger' (SP <name>)? SP LT <email> GT SP <when> LF
    + 	data
    + ....
    +@@
    + ....
    + 	'blob' LF
    + 	mark?
    ++	original-oid?
    + 	data
    + ....
    + 
    +
      diff --git a/builtin/fast-export.c b/builtin/fast-export.c
      --- a/builtin/fast-export.c
      +++ b/builtin/fast-export.c
    @@ -51,7 +107,7 @@
     -	printf("blob\nmark :%"PRIu32"\ndata %lu\n", last_idnum, size);
     +	printf("blob\nmark :%"PRIu32"\n", last_idnum);
     +	if (show_original_ids)
    -+		printf("originally %s\n", oid_to_hex(oid));
    ++		printf("original-oid %s\n", oid_to_hex(oid));
     +	printf("data %lu\n", size);
      	if (size && fwrite(buf, size, 1, stdout) != 1)
      		die_errno("could not write blob '%s'", oid_to_hex(oid));
    @@ -64,7 +120,7 @@
     -	       refname, last_idnum,
     +	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
     +	if (show_original_ids)
    -+		printf("originally %s\n", oid_to_hex(&commit->object.oid));
    ++		printf("original-oid %s\n", oid_to_hex(&commit->object.oid));
     +	printf("%.*s\n%.*s\ndata %u\n%s",
      	       (int)(author_end - author), author,
      	       (int)(committer_end - committer), committer,
    @@ -77,7 +133,7 @@
     -	       name, tagged_mark,
     +	printf("tag %s\nfrom :%d\n", name, tagged_mark);
     +	if (show_original_ids)
    -+		printf("originally %s\n", oid_to_hex(&tag->object.oid));
    ++		printf("original-oid %s\n", oid_to_hex(&tag->object.oid));
     +	printf("%.*s%sdata %d\n%.*s\n",
      	       (int)(tagger_end - tagger), tagger,
      	       tagger == tagger_end ? "" : "\n",
    @@ -96,44 +152,13 @@
      --- a/fast-import.c
      +++ b/fast-import.c
     @@
    - 
    -   new_blob ::= 'blob' lf
    -     mark?
    -+    originally?
    -     file_content;
    -   file_content ::= data;
    - 
    -   new_commit ::= 'commit' sp ref_str lf
    -     mark?
    -+    originally?
    -     ('author' (sp name)? sp '<' email '>' sp when lf)?
    -     'committer' (sp name)? sp '<' email '>' sp when lf
    -     commit_msg
    -@@
    - 
    -   new_tag ::= 'tag' sp tag_str lf
    -     'from' sp commit-ish lf
    -+    originally?
    -     ('tagger' (sp name)? sp '<' email '>' sp when lf)?
    -     tag_msg;
    -   tag_msg ::= data;
    -@@
    -   data ::= (delimited_data | exact_data)
    -     lf?;
    - 
    -+  originally ::= 'originally' sp not_lf+ lf
    -+
    -     # note: delim may be any string but must not contain lf.
    -     # data_line may contain any data but must not be exactly
    -     # delim.
    -@@
      		next_mark = 0;
      }
      
     +static void parse_original_identifier(void)
     +{
     +	const char *v;
    -+	if (skip_prefix(command_buf.buf, "originally ", &v))
    ++	if (skip_prefix(command_buf.buf, "original-oid ", &v))
     +		read_next_command();
     +}
     +
    @@ -160,7 +185,7 @@
      		die("Invalid ref name or SHA1 expression: %s", from);
      	read_next_command();
      
    -+	/* originally ... */
    ++	/* original-oid ... */
     +	parse_original_identifier();
     +
      	/* tagger ... */
    @@ -177,7 +202,7 @@
     +test_expect_success 'fast-export --show-original-ids' '
     +
     +	git fast-export --show-original-ids master >output &&
    -+	grep ^originally output| sed -e s/^originally.// | sort >actual &&
    ++	grep ^original-oid output| sed -e s/^original-oid.// | sort >actual &&
     +	git rev-list --objects master muss >objects-and-names &&
     +	awk "{print \$1}" objects-and-names | sort >commits-trees-blobs &&
     +	comm -23 actual commits-trees-blobs >unfound &&
10:  82735bcbde ! 11:  1796373474 fast-export: add --always-show-modify-after-rename
    @@ -2,29 +2,53 @@
     
         fast-export: add --always-show-modify-after-rename
     
    -    fast-export output is traditionally used as an input to a fast-import
    -    program, but it is also useful to help gather statistics about the
    -    history of a repository (particularly when --no-data is also passed).
    -    For example, two of the types of information we may want to collect
    -    could include:
    -      1) general information about renames that have occurred
    -      2) what the biggest objects in a repository are and what names
    -         they appear under.
    +    I wanted a way to gather all the following information efficiently
    +    (with as few history traversals as possible):
    +      * Get all blob sizes
    +      * Map blob shas to filename(s) they appeared under in the history
    +      * Find when files and directories were deleted (and whether they
    +        were later reinstated, since that means they aren't actually gone)
    +      * Find sets of filenames referring to the same logical 'file'. (e.g.
    +        foo->bar in commit A and bar->baz in commit B mean that
    +        {foo,bar,baz} refer to the same 'file', so someone wanting to just
    +        "keep baz and its history" need all versions of those three
    +        filenames).  I need to know about things like another foo or bar
    +        being introduced after the rename though, since that breaks the
    +        connection between filenames)
    +    and then I would generate various aggregations on the data and display
    +    some type of report for the user.
     
    -    The first bit of information can be gathered by just passing -M to
    -    fast-export.  The second piece of information can partially be gotten
    -    from running
    -        git cat-file --batch-check --batch-all-objects
    -    However, that only shows what the biggest objects in the repository are
    -    and their sizes, not what names those objects appear as or what commits
    -    they were introduced in.  We can get that information from fast-export,
    -    but when we only see
    +    The only way I know of to get blob sizes is via
    +      cat-file --batch-all-objects --batch-check
    +
    +    The rest of the data would traditionally be gathered from a log command,
    +    e.g.
    +
    +      git log --format='%H%n%P%n%cd' --date=short --topo-order --reverse \
    +          -M --diff-filter=RAMD --no-abbrev --raw -c
    +
    +    however, parsing log output seems slightly dangerous given that it is a
    +    porcelain command.  While we have specified --format and --raw to try
    +    to avoid the most obvious problems, I'm still slightly concerned about
    +    --date=short, the combinations of --raw and -c, options that might
    +    colorize the output, and also the --diff-filter (there is no current
    +    option named --no-find-copies or --no-break-rewrites, but what if those
    +    turn on by default in the future much as we changed the default with
    +    detecting renames?).  Each of those is a small worry, but they add up.
    +
    +    A command meant for data serialization, such as fast-export, seems like
    +    a better candidate for this job.  There's just one missing item: in
    +    order to connect blob sizes to filenames, I need fast-export to tell me
    +    the blob sha1sum of any file changes.  It does this for modifies, but
    +    not always for renames.  In particular, if a file is a 100% rename, it
    +    only prints
             R oldname newname
         instead of
             R oldname newname
             M 100644 $SHA1 newname
    -    then it makes the job more difficult.  Add an option which allows us to
    -    force the latter output even when commits have exact renames of files.
    +    as occurs when there is a rename+modify.  Add an option which allows us
    +    to force the latter output even when commits have exact renames of
    +    files.
     
         Signed-off-by: Elijah Newren <newren@gmail.com>

-- 
2.19.1.1063.g2b8e4a4f82.dirty

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v2 01/11] git-fast-import.txt: fix documentation for --quiet option
  2018-11-14  0:25         ` [PATCH v2 00/11] " Elijah Newren
@ 2018-11-14  0:25           ` Elijah Newren
  2018-11-14  0:25           ` [PATCH v2 02/11] git-fast-export.txt: clarify misleading documentation about rev-list args Elijah Newren
                             ` (11 subsequent siblings)
  12 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-14  0:25 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, gitster,
	Elijah Newren

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-import.txt | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index e81117d27f..7ab97745a6 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -40,9 +40,10 @@ OPTIONS
 	not contain the old commit).
 
 --quiet::
-	Disable all non-fatal output, making fast-import silent when it
-	is successful.  This option disables the output shown by
-	--stats.
+	Disable the output shown by --stats, making fast-import usually
+	be silent when it is successful.  However, if the import stream
+	has directives intended to show user output (e.g. `progress`
+	directives), the corresponding messages will still be shown.
 
 --stats::
 	Display some basic statistics about the objects fast-import has
-- 
2.19.1.1063.g2b8e4a4f82.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v2 02/11] git-fast-export.txt: clarify misleading documentation about rev-list args
  2018-11-14  0:25         ` [PATCH v2 00/11] " Elijah Newren
  2018-11-14  0:25           ` [PATCH v2 01/11] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
@ 2018-11-14  0:25           ` Elijah Newren
  2018-11-14  0:25           ` [PATCH v2 03/11] fast-export: use value from correct enum Elijah Newren
                             ` (10 subsequent siblings)
  12 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-14  0:25 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, gitster,
	Elijah Newren

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-export.txt | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index ce954be532..fda55b3284 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -119,7 +119,8 @@ marks the same across runs.
 	'git rev-list', that specifies the specific objects and references
 	to export.  For example, `master~10..master` causes the
 	current master reference to be exported along with all objects
-	added since its 10th ancestor commit.
+	added since its 10th ancestor commit and all files common to
+	master{tilde}9 and master{tilde}10.
 
 EXAMPLES
 --------
-- 
2.19.1.1063.g2b8e4a4f82.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v2 03/11] fast-export: use value from correct enum
  2018-11-14  0:25         ` [PATCH v2 00/11] " Elijah Newren
  2018-11-14  0:25           ` [PATCH v2 01/11] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
  2018-11-14  0:25           ` [PATCH v2 02/11] git-fast-export.txt: clarify misleading documentation about rev-list args Elijah Newren
@ 2018-11-14  0:25           ` Elijah Newren
  2018-11-14  0:25           ` [PATCH v2 04/11] fast-export: avoid dying when filtering by paths and old tags exist Elijah Newren
                             ` (9 subsequent siblings)
  12 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-14  0:25 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, gitster,
	Elijah Newren

ABORT and ERROR happen to have the same value, but come from differnt
enums.  Use the one from the correct enum, and while at it, rename the
values to avoid such problems.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 456797c12a..af724e9937 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -31,8 +31,8 @@ static const char *fast_export_usage[] = {
 };
 
 static int progress;
-static enum { ABORT, VERBATIM, WARN, WARN_STRIP, STRIP } signed_tag_mode = ABORT;
-static enum { ERROR, DROP, REWRITE } tag_of_filtered_mode = ERROR;
+static enum { SIGNED_TAG_ABORT, VERBATIM, WARN, WARN_STRIP, STRIP } signed_tag_mode = SIGNED_TAG_ABORT;
+static enum { TAG_FILTERING_ABORT, DROP, REWRITE } tag_of_filtered_mode = TAG_FILTERING_ABORT;
 static int fake_missing_tagger;
 static int use_done_feature;
 static int no_data;
@@ -46,7 +46,7 @@ static int parse_opt_signed_tag_mode(const struct option *opt,
 				     const char *arg, int unset)
 {
 	if (unset || !strcmp(arg, "abort"))
-		signed_tag_mode = ABORT;
+		signed_tag_mode = SIGNED_TAG_ABORT;
 	else if (!strcmp(arg, "verbatim") || !strcmp(arg, "ignore"))
 		signed_tag_mode = VERBATIM;
 	else if (!strcmp(arg, "warn"))
@@ -64,7 +64,7 @@ static int parse_opt_tag_of_filtered_mode(const struct option *opt,
 					  const char *arg, int unset)
 {
 	if (unset || !strcmp(arg, "abort"))
-		tag_of_filtered_mode = ERROR;
+		tag_of_filtered_mode = TAG_FILTERING_ABORT;
 	else if (!strcmp(arg, "drop"))
 		tag_of_filtered_mode = DROP;
 	else if (!strcmp(arg, "rewrite"))
@@ -727,7 +727,7 @@ static void handle_tag(const char *name, struct tag *tag)
 					       "\n-----BEGIN PGP SIGNATURE-----\n");
 		if (signature)
 			switch(signed_tag_mode) {
-			case ABORT:
+			case SIGNED_TAG_ABORT:
 				die("encountered signed tag %s; use "
 				    "--signed-tags=<mode> to handle it",
 				    oid_to_hex(&tag->object.oid));
@@ -752,7 +752,7 @@ static void handle_tag(const char *name, struct tag *tag)
 	tagged_mark = get_object_mark(tagged);
 	if (!tagged_mark) {
 		switch(tag_of_filtered_mode) {
-		case ABORT:
+		case TAG_FILTERING_ABORT:
 			die("tag %s tags unexported object; use "
 			    "--tag-of-filtered-object=<mode> to handle it",
 			    oid_to_hex(&tag->object.oid));
-- 
2.19.1.1063.g2b8e4a4f82.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v2 04/11] fast-export: avoid dying when filtering by paths and old tags exist
  2018-11-14  0:25         ` [PATCH v2 00/11] " Elijah Newren
                             ` (2 preceding siblings ...)
  2018-11-14  0:25           ` [PATCH v2 03/11] fast-export: use value from correct enum Elijah Newren
@ 2018-11-14  0:25           ` Elijah Newren
  2018-11-14 19:17             ` SZEDER Gábor
  2018-11-14  0:25           ` [PATCH v2 05/11] fast-export: move commit rewriting logic into a function for reuse Elijah Newren
                             ` (8 subsequent siblings)
  12 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-14  0:25 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, gitster,
	Elijah Newren

If --tag-of-filtered-object=rewrite is specified along with a set of
paths to limit what is exported, then any tags pointing to old commits
that do not contain any of those specified paths cause problems.  Since
the old tagged commit is not exported, fast-export attempts to rewrite
such tags to an ancestor commit which was exported.  If no such commit
exists, then fast-export currently die()s.  Five years after the tag
rewriting logic was added to fast-export (see commit 2d8ad4691921,
"fast-export: Add a --tag-of-filtered-object  option for newly dangling
tags", 2009-06-25), fast-import gained the ability to delete refs (see
commit 4ee1b225b99f, "fast-import: add support to delete refs",
2014-04-20), so now we do have a valid option to rewrite the tag to.
Delete these tags instead of dying.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c  |  9 ++++++---
 t/t9350-fast-export.sh | 16 ++++++++++++++++
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index af724e9937..b984a44224 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -774,9 +774,12 @@ static void handle_tag(const char *name, struct tag *tag)
 					break;
 				if (!(p->object.flags & TREESAME))
 					break;
-				if (!p->parents)
-					die("can't find replacement commit for tag %s",
-					     oid_to_hex(&tag->object.oid));
+				if (!p->parents) {
+					printf("reset %s\nfrom %s\n\n",
+					       name, sha1_to_hex(null_sha1));
+					free(buf);
+					return;
+				}
 				p = p->parents->item;
 			}
 			tagged_mark = get_object_mark(&p->object);
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 6a392e87bc..3400ebeb51 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -325,6 +325,22 @@ test_expect_success 'rewriting tag of filtered out object' '
 )
 '
 
+test_expect_success 'rewrite tag predating pathspecs to nothing' '
+	test_create_repo rewrite_tag_predating_pathspecs &&
+	(
+		cd rewrite_tag_predating_pathspecs &&
+
+		test_commit initial &&
+
+		git tag -a -m "Some old tag" v0.0.0.0.0.0.1 &&
+
+		test_commit bar &&
+
+		git fast-export --tag-of-filtered-object=rewrite --all -- bar.t >output &&
+		grep from.$ZERO_OID output
+	)
+'
+
 cat > limit-by-paths/expected << EOF
 blob
 mark :1
-- 
2.19.1.1063.g2b8e4a4f82.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v2 05/11] fast-export: move commit rewriting logic into a function for reuse
  2018-11-14  0:25         ` [PATCH v2 00/11] " Elijah Newren
                             ` (3 preceding siblings ...)
  2018-11-14  0:25           ` [PATCH v2 04/11] fast-export: avoid dying when filtering by paths and old tags exist Elijah Newren
@ 2018-11-14  0:25           ` Elijah Newren
  2018-11-14  0:25           ` [PATCH v2 06/11] fast-export: when using paths, avoid corrupt stream with non-existent mark Elijah Newren
                             ` (7 subsequent siblings)
  12 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-14  0:25 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, gitster,
	Elijah Newren

Logic to replace a filtered commit with an unfiltered ancestor is useful
elsewhere; put it into a function we can call.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c | 37 ++++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index b984a44224..7888fc98b5 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -187,6 +187,22 @@ static int get_object_mark(struct object *object)
 	return ptr_to_mark(decoration);
 }
 
+static struct commit *rewrite_commit(struct commit *p)
+{
+	for (;;) {
+		if (p->parents && p->parents->next)
+			break;
+		if (p->object.flags & UNINTERESTING)
+			break;
+		if (!(p->object.flags & TREESAME))
+			break;
+		if (!p->parents)
+			return NULL;
+		p = p->parents->item;
+	}
+	return p;
+}
+
 static void show_progress(void)
 {
 	static int counter = 0;
@@ -766,21 +782,12 @@ static void handle_tag(const char *name, struct tag *tag)
 				    oid_to_hex(&tag->object.oid),
 				    type_name(tagged->type));
 			}
-			p = (struct commit *)tagged;
-			for (;;) {
-				if (p->parents && p->parents->next)
-					break;
-				if (p->object.flags & UNINTERESTING)
-					break;
-				if (!(p->object.flags & TREESAME))
-					break;
-				if (!p->parents) {
-					printf("reset %s\nfrom %s\n\n",
-					       name, sha1_to_hex(null_sha1));
-					free(buf);
-					return;
-				}
-				p = p->parents->item;
+			p = rewrite_commit((struct commit *)tagged);
+			if (!p) {
+				printf("reset %s\nfrom %s\n\n",
+				       name, sha1_to_hex(null_sha1));
+				free(buf);
+				return;
 			}
 			tagged_mark = get_object_mark(&p->object);
 		}
-- 
2.19.1.1063.g2b8e4a4f82.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v2 06/11] fast-export: when using paths, avoid corrupt stream with non-existent mark
  2018-11-14  0:25         ` [PATCH v2 00/11] " Elijah Newren
                             ` (4 preceding siblings ...)
  2018-11-14  0:25           ` [PATCH v2 05/11] fast-export: move commit rewriting logic into a function for reuse Elijah Newren
@ 2018-11-14  0:25           ` Elijah Newren
  2018-11-14  0:25           ` [PATCH v2 07/11] fast-export: ensure we export requested refs Elijah Newren
                             ` (6 subsequent siblings)
  12 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-14  0:25 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, gitster,
	Elijah Newren

If file paths are specified to fast-export and multiple refs point to a
commit that does not touch any of the relevant file paths, then
fast-export can hit problems.  fast-export has a list of additional refs
that it needs to explicitly set after exporting all blobs and commits,
and when it tries to get_object_mark() on the relevant commit, it can
get a mark of 0, i.e. "not found", because the commit in question did
not touch the relevant paths and thus was not exported.  Trying to
import a stream with a mark corresponding to an unexported object will
cause fast-import to crash.

Avoid this problem by taking the commit the ref points to and finding an
ancestor of it that was exported, and make the ref point to that commit
instead.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c  | 13 ++++++++++++-
 t/t9350-fast-export.sh | 20 ++++++++++++++++++++
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 7888fc98b5..2eafe351ea 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -900,7 +900,18 @@ static void handle_tags_and_duplicates(void)
 			if (anonymize)
 				name = anonymize_refname(name);
 			/* create refs pointing to already seen commits */
-			commit = (struct commit *)object;
+			commit = rewrite_commit((struct commit *)object);
+			if (!commit) {
+				/*
+				 * Neither this object nor any of its
+				 * ancestors touch any relevant paths, so
+				 * it has been filtered to nothing.  Delete
+				 * it.
+				 */
+				printf("reset %s\nfrom %s\n\n",
+				       name, sha1_to_hex(null_sha1));
+				continue;
+			}
 			printf("reset %s\nfrom :%d\n\n", name,
 			       get_object_mark(&commit->object));
 			show_progress();
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 3400ebeb51..299120ba70 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -382,6 +382,26 @@ test_expect_success 'path limiting with import-marks does not lose unmodified fi
 	grep file0 actual
 '
 
+test_expect_success 'avoid corrupt stream with non-existent mark' '
+	test_create_repo avoid_non_existent_mark &&
+	(
+		cd avoid_non_existent_mark &&
+
+		test_commit important-path &&
+
+		test_commit ignored &&
+
+		git branch A &&
+		git branch B &&
+
+		echo foo >>important-path.t &&
+		git add important-path.t &&
+		test_commit more changes &&
+
+		git fast-export --all -- important-path.t | git fast-import --force
+	)
+'
+
 test_expect_success 'full-tree re-shows unmodified files'        '
 	git checkout -f simple &&
 	git fast-export --full-tree simple >actual &&
-- 
2.19.1.1063.g2b8e4a4f82.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v2 07/11] fast-export: ensure we export requested refs
  2018-11-14  0:25         ` [PATCH v2 00/11] " Elijah Newren
                             ` (5 preceding siblings ...)
  2018-11-14  0:25           ` [PATCH v2 06/11] fast-export: when using paths, avoid corrupt stream with non-existent mark Elijah Newren
@ 2018-11-14  0:25           ` Elijah Newren
  2018-11-14  0:25           ` [PATCH v2 08/11] fast-export: add --reference-excluded-parents option Elijah Newren
                             ` (5 subsequent siblings)
  12 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-14  0:25 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, gitster,
	Elijah Newren

If file paths are specified to fast-export and a ref points to a commit
that does not touch any of the relevant paths, then that ref would
sometimes fail to be exported.  (This depends on whether any ancestors
of the commit which do touch the relevant paths would be exported with
that same ref name or a different ref name.)  To avoid this problem,
put *all* specified refs into extra_refs to start, and then as we export
each commit, remove the refname used in the 'commit $REFNAME' directive
from extra_refs.  Then, in handle_tags_and_duplicates() we know which
refs actually do need a manual reset directive in order to be included.

This means that we do need some special handling for excluded refs; e.g.
if someone runs
   git fast-export ^master master
then they've asked for master to be exported, but they have also asked
for the commit which master points to and all of its history to be
excluded.  That logically means ref deletion.  Previously, such refs
were just silently omitted from being exported despite having been
explicitly requested for export.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c  | 54 ++++++++++++++++++++++++++++++++----------
 t/t9350-fast-export.sh | 16 ++++++++++---
 2 files changed, 55 insertions(+), 15 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 2eafe351ea..2fef00436b 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -38,6 +38,7 @@ static int use_done_feature;
 static int no_data;
 static int full_tree;
 static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
+static struct string_list tag_refs = STRING_LIST_INIT_NODUP;
 static struct refspec refspecs = REFSPEC_INIT_FETCH;
 static int anonymize;
 static struct revision_sources revision_sources;
@@ -611,6 +612,13 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 			export_blob(&diff_queued_diff.queue[i]->two->oid);
 
 	refname = *revision_sources_at(&revision_sources, commit);
+	/*
+	 * FIXME: string_list_remove() below for each ref is overall
+	 * O(N^2).  Compared to a history walk and diffing trees, this is
+	 * just lost in the noise in practice.  However, theoretically a
+	 * repo may have enough refs for this to become slow.
+	 */
+	string_list_remove(&extra_refs, refname, 0);
 	if (anonymize) {
 		refname = anonymize_refname(refname);
 		anonymize_ident_line(&committer, &committer_end);
@@ -814,7 +822,7 @@ static struct commit *get_commit(struct rev_cmdline_entry *e, char *full_name)
 		/* handle nested tags */
 		while (tag && tag->object.type == OBJ_TAG) {
 			parse_object(the_repository, &tag->object.oid);
-			string_list_append(&extra_refs, full_name)->util = tag;
+			string_list_append(&tag_refs, full_name)->util = tag;
 			tag = (struct tag *)tag->tagged;
 		}
 		if (!tag)
@@ -873,25 +881,30 @@ static void get_tags_and_duplicates(struct rev_cmdline_info *info)
 		}
 
 		/*
-		 * This ref will not be updated through a commit, lets make
-		 * sure it gets properly updated eventually.
+		 * Make sure this ref gets properly updated eventually, whether
+		 * through a commit or manually at the end.
 		 */
-		if (*revision_sources_at(&revision_sources, commit) ||
-		    commit->object.flags & SHOWN)
+		if (e->item->type != OBJ_TAG)
 			string_list_append(&extra_refs, full_name)->util = commit;
+
 		if (!*revision_sources_at(&revision_sources, commit))
 			*revision_sources_at(&revision_sources, commit) = full_name;
 	}
+
+	string_list_sort(&extra_refs);
+	string_list_remove_duplicates(&extra_refs, 0);
 }
 
-static void handle_tags_and_duplicates(void)
+static void handle_tags_and_duplicates(struct string_list *extras)
 {
 	struct commit *commit;
 	int i;
 
-	for (i = extra_refs.nr - 1; i >= 0; i--) {
-		const char *name = extra_refs.items[i].string;
-		struct object *object = extra_refs.items[i].util;
+	for (i = extras->nr - 1; i >= 0; i--) {
+		const char *name = extras->items[i].string;
+		struct object *object = extras->items[i].util;
+		int mark;
+
 		switch (object->type) {
 		case OBJ_TAG:
 			handle_tag(name, (struct tag *)object);
@@ -912,8 +925,24 @@ static void handle_tags_and_duplicates(void)
 				       name, sha1_to_hex(null_sha1));
 				continue;
 			}
-			printf("reset %s\nfrom :%d\n\n", name,
-			       get_object_mark(&commit->object));
+
+			mark = get_object_mark(&commit->object);
+			if (!mark) {
+				/*
+				 * Getting here means we have a commit which
+				 * was excluded by a negative refspec (e.g.
+				 * fast-export ^master master).  If the user
+				 * wants the branch exported but every commit
+				 * in its history to be deleted, that sounds
+				 * like a ref deletion to me.
+				 */
+				printf("reset %s\nfrom %s\n\n",
+				       name, sha1_to_hex(null_sha1));
+				continue;
+			}
+
+			printf("reset %s\nfrom :%d\n\n", name, mark
+			       );
 			show_progress();
 			break;
 		}
@@ -1101,7 +1130,8 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		}
 	}
 
-	handle_tags_and_duplicates();
+	handle_tags_and_duplicates(&extra_refs);
+	handle_tags_and_duplicates(&tag_refs);
 	handle_deletes();
 
 	if (export_filename && lastimportid != last_idnum)
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 299120ba70..50c2fceef4 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -544,10 +544,20 @@ test_expect_success 'use refspec' '
 	test_cmp expected actual
 '
 
-test_expect_success 'delete refspec' '
+test_expect_success 'delete ref because entire history excluded' '
 	git branch to-delete &&
-	git fast-export --refspec :refs/heads/to-delete to-delete ^to-delete > actual &&
-	cat > expected <<-EOF &&
+	git fast-export to-delete ^to-delete >actual &&
+	cat >expected <<-EOF &&
+	reset refs/heads/to-delete
+	from 0000000000000000000000000000000000000000
+
+	EOF
+	test_cmp expected actual
+'
+
+test_expect_success 'delete refspec' '
+	git fast-export --refspec :refs/heads/to-delete >actual &&
+	cat >expected <<-EOF &&
 	reset refs/heads/to-delete
 	from 0000000000000000000000000000000000000000
 
-- 
2.19.1.1063.g2b8e4a4f82.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v2 08/11] fast-export: add --reference-excluded-parents option
  2018-11-14  0:25         ` [PATCH v2 00/11] " Elijah Newren
                             ` (6 preceding siblings ...)
  2018-11-14  0:25           ` [PATCH v2 07/11] fast-export: ensure we export requested refs Elijah Newren
@ 2018-11-14  0:25           ` Elijah Newren
  2018-11-14 19:27             ` SZEDER Gábor
  2018-11-14  0:25           ` [PATCH v2 09/11] fast-import: remove unmaintained duplicate documentation Elijah Newren
                             ` (4 subsequent siblings)
  12 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-14  0:25 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, gitster,
	Elijah Newren

git filter-branch has a nifty feature allowing you to rewrite, e.g. just
the last 8 commits of a linear history
  git filter-branch $OPTIONS HEAD~8..HEAD

If you try the same with git fast-export, you instead get a history of
only 8 commits, with HEAD~7 being rewritten into a root commit.  There
are two alternatives:

  1) Don't use the negative revision specification, and when you're
     filtering the output to make modifications to the last 8 commits,
     just be careful to not modify any earlier commits somehow.

  2) First run 'git fast-export --export-marks=somefile HEAD~8', then
     run 'git fast-export --import-marks=somefile HEAD~8..HEAD'.

Both are more error prone than I'd like (the first for obvious reasons;
with the second option I have sometimes accidentally included too many
revisions in the first command and then found that the corresponding
extra revisions were not exported by the second command and thus were
not modified as I expected).  Also, both are poor from a performance
perspective.

Add a new --reference-excluded-parents option which will cause
fast-export to refer to commits outside the specified rev-list-args
range by their sha1sum.  Such a stream will only be useful in a
repository which already contains the necessary commits (much like the
restriction imposed when using --no-data).

Note from Peff:
  I think we might be able to do a little more optimization here. If
  we're exporting HEAD^..HEAD and there's an object in HEAD^ which is
  unchanged in HEAD, I think we'd still print it (because it would not
  be marked SHOWN), but we could omit it (by walking the tree of the
  boundary commits and marking them shown).  I don't think it's a
  blocker for what you're doing here, but just a possible future
  optimization.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-export.txt | 17 +++++++++++--
 builtin/fast-export.c             | 42 +++++++++++++++++++++++--------
 t/t9350-fast-export.sh            | 11 ++++++++
 3 files changed, 58 insertions(+), 12 deletions(-)

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index fda55b3284..f65026662a 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -110,6 +110,18 @@ marks the same across runs.
 	the shape of the history and stored tree.  See the section on
 	`ANONYMIZING` below.
 
+--reference-excluded-parents::
+	By default, running a command such as `git fast-export
+	master~5..master` will not include the commit master{tilde}5
+	and will make master{tilde}4 no longer have master{tilde}5 as
+	a parent (though both the old master{tilde}4 and new
+	master{tilde}4 will have all the same files).  Use
+	--reference-excluded-parents to instead have the the stream
+	refer to commits in the excluded range of history by their
+	sha1sum.  Note that the resulting stream can only be used by a
+	repository which already contains the necessary parent
+	commits.
+
 --refspec::
 	Apply the specified refspec to each ref exported. Multiple of them can
 	be specified.
@@ -119,8 +131,9 @@ marks the same across runs.
 	'git rev-list', that specifies the specific objects and references
 	to export.  For example, `master~10..master` causes the
 	current master reference to be exported along with all objects
-	added since its 10th ancestor commit and all files common to
-	master{tilde}9 and master{tilde}10.
+	added since its 10th ancestor commit and (unless the
+	--reference-excluded-parents option is specified) all files
+	common to master{tilde}9 and master{tilde}10.
 
 EXAMPLES
 --------
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 2fef00436b..3cc98c31ad 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -37,6 +37,7 @@ static int fake_missing_tagger;
 static int use_done_feature;
 static int no_data;
 static int full_tree;
+static int reference_excluded_commits;
 static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
 static struct string_list tag_refs = STRING_LIST_INIT_NODUP;
 static struct refspec refspecs = REFSPEC_INIT_FETCH;
@@ -596,7 +597,8 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 		message += 2;
 
 	if (commit->parents &&
-	    get_object_mark(&commit->parents->item->object) != 0 &&
+	    (get_object_mark(&commit->parents->item->object) != 0 ||
+	     reference_excluded_commits) &&
 	    !full_tree) {
 		parse_commit_or_die(commit->parents->item);
 		diff_tree_oid(get_commit_tree_oid(commit->parents->item),
@@ -644,13 +646,21 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 	unuse_commit_buffer(commit, commit_buffer);
 
 	for (i = 0, p = commit->parents; p; p = p->next) {
-		int mark = get_object_mark(&p->item->object);
-		if (!mark)
+		struct object *obj = &p->item->object;
+		int mark = get_object_mark(obj);
+
+		if (!mark && !reference_excluded_commits)
 			continue;
 		if (i == 0)
-			printf("from :%d\n", mark);
+			printf("from ");
+		else
+			printf("merge ");
+		if (mark)
+			printf(":%d\n", mark);
 		else
-			printf("merge :%d\n", mark);
+			printf("%s\n", sha1_to_hex(anonymize ?
+						   anonymize_sha1(&obj->oid) :
+						   obj->oid.hash));
 		i++;
 	}
 
@@ -931,13 +941,22 @@ static void handle_tags_and_duplicates(struct string_list *extras)
 				/*
 				 * Getting here means we have a commit which
 				 * was excluded by a negative refspec (e.g.
-				 * fast-export ^master master).  If the user
+				 * fast-export ^master master).  If we are
+				 * referencing excluded commits, set the ref
+				 * to the exact commit.  Otherwise, the user
 				 * wants the branch exported but every commit
-				 * in its history to be deleted, that sounds
-				 * like a ref deletion to me.
+				 * in its history to be deleted, which basically
+				 * just means deletion of the ref.
 				 */
-				printf("reset %s\nfrom %s\n\n",
-				       name, sha1_to_hex(null_sha1));
+				if (!reference_excluded_commits) {
+					/* delete the ref */
+					printf("reset %s\nfrom %s\n\n",
+					       name, sha1_to_hex(null_sha1));
+					continue;
+				}
+				/* set ref to commit using oid, not mark */
+				printf("reset %s\nfrom %s\n\n", name,
+				       sha1_to_hex(commit->object.oid.hash));
 				continue;
 			}
 
@@ -1074,6 +1093,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_STRING_LIST(0, "refspec", &refspecs_list, N_("refspec"),
 			     N_("Apply refspec to exported refs")),
 		OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")),
+		OPT_BOOL(0, "reference-excluded-parents",
+			 &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by sha1sum")),
+
 		OPT_END()
 	};
 
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 50c2fceef4..d7d73061d0 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -66,6 +66,17 @@ test_expect_success 'fast-export master~2..master' '
 
 '
 
+test_expect_success 'fast-export --reference-excluded-parents master~2..master' '
+
+	git fast-export --reference-excluded-parents master~2..master >actual &&
+	grep commit.refs/heads/master actual >commit-count &&
+	test_line_count = 2 commit-count &&
+	sed "s/master/rewrite/" actual |
+		(cd new &&
+		 git fast-import &&
+		 test $MASTER = $(git rev-parse --verify refs/heads/rewrite))
+'
+
 test_expect_success 'iso-8859-1' '
 
 	git config i18n.commitencoding ISO8859-1 &&
-- 
2.19.1.1063.g2b8e4a4f82.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v2 09/11] fast-import: remove unmaintained duplicate documentation
  2018-11-14  0:25         ` [PATCH v2 00/11] " Elijah Newren
                             ` (7 preceding siblings ...)
  2018-11-14  0:25           ` [PATCH v2 08/11] fast-export: add --reference-excluded-parents option Elijah Newren
@ 2018-11-14  0:25           ` Elijah Newren
  2018-11-14  0:25           ` [PATCH v2 10/11] fast-export: add a --show-original-ids option to show original names Elijah Newren
                             ` (3 subsequent siblings)
  12 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-14  0:25 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, gitster,
	Elijah Newren

fast-import.c has started with a comment for nine and a half years
re-directing the reader to Documentation/git-fast-import.txt for
maintained documentation.  Instead of leaving the unmaintained
documentation in place, just excise it.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 fast-import.c | 154 --------------------------------------------------
 1 file changed, 154 deletions(-)

diff --git a/fast-import.c b/fast-import.c
index 95600c78e0..555d49ad23 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1,157 +1,3 @@
-/*
-(See Documentation/git-fast-import.txt for maintained documentation.)
-Format of STDIN stream:
-
-  stream ::= cmd*;
-
-  cmd ::= new_blob
-        | new_commit
-        | new_tag
-        | reset_branch
-        | checkpoint
-        | progress
-        ;
-
-  new_blob ::= 'blob' lf
-    mark?
-    file_content;
-  file_content ::= data;
-
-  new_commit ::= 'commit' sp ref_str lf
-    mark?
-    ('author' (sp name)? sp '<' email '>' sp when lf)?
-    'committer' (sp name)? sp '<' email '>' sp when lf
-    commit_msg
-    ('from' sp commit-ish lf)?
-    ('merge' sp commit-ish lf)*
-    (file_change | ls)*
-    lf?;
-  commit_msg ::= data;
-
-  ls ::= 'ls' sp '"' quoted(path) '"' lf;
-
-  file_change ::= file_clr
-    | file_del
-    | file_rnm
-    | file_cpy
-    | file_obm
-    | file_inm;
-  file_clr ::= 'deleteall' lf;
-  file_del ::= 'D' sp path_str lf;
-  file_rnm ::= 'R' sp path_str sp path_str lf;
-  file_cpy ::= 'C' sp path_str sp path_str lf;
-  file_obm ::= 'M' sp mode sp (hexsha1 | idnum) sp path_str lf;
-  file_inm ::= 'M' sp mode sp 'inline' sp path_str lf
-    data;
-  note_obm ::= 'N' sp (hexsha1 | idnum) sp commit-ish lf;
-  note_inm ::= 'N' sp 'inline' sp commit-ish lf
-    data;
-
-  new_tag ::= 'tag' sp tag_str lf
-    'from' sp commit-ish lf
-    ('tagger' (sp name)? sp '<' email '>' sp when lf)?
-    tag_msg;
-  tag_msg ::= data;
-
-  reset_branch ::= 'reset' sp ref_str lf
-    ('from' sp commit-ish lf)?
-    lf?;
-
-  checkpoint ::= 'checkpoint' lf
-    lf?;
-
-  progress ::= 'progress' sp not_lf* lf
-    lf?;
-
-     # note: the first idnum in a stream should be 1 and subsequent
-     # idnums should not have gaps between values as this will cause
-     # the stream parser to reserve space for the gapped values.  An
-     # idnum can be updated in the future to a new object by issuing
-     # a new mark directive with the old idnum.
-     #
-  mark ::= 'mark' sp idnum lf;
-  data ::= (delimited_data | exact_data)
-    lf?;
-
-    # note: delim may be any string but must not contain lf.
-    # data_line may contain any data but must not be exactly
-    # delim.
-  delimited_data ::= 'data' sp '<<' delim lf
-    (data_line lf)*
-    delim lf;
-
-     # note: declen indicates the length of binary_data in bytes.
-     # declen does not include the lf preceding the binary data.
-     #
-  exact_data ::= 'data' sp declen lf
-    binary_data;
-
-     # note: quoted strings are C-style quoting supporting \c for
-     # common escapes of 'c' (e..g \n, \t, \\, \") or \nnn where nnn
-     # is the signed byte value in octal.  Note that the only
-     # characters which must actually be escaped to protect the
-     # stream formatting is: \, " and LF.  Otherwise these values
-     # are UTF8.
-     #
-  commit-ish  ::= (ref_str | hexsha1 | sha1exp_str | idnum);
-  ref_str     ::= ref;
-  sha1exp_str ::= sha1exp;
-  tag_str     ::= tag;
-  path_str    ::= path    | '"' quoted(path)    '"' ;
-  mode        ::= '100644' | '644'
-                | '100755' | '755'
-                | '120000'
-                ;
-
-  declen ::= # unsigned 32 bit value, ascii base10 notation;
-  bigint ::= # unsigned integer value, ascii base10 notation;
-  binary_data ::= # file content, not interpreted;
-
-  when         ::= raw_when | rfc2822_when;
-  raw_when     ::= ts sp tz;
-  rfc2822_when ::= # Valid RFC 2822 date and time;
-
-  sp ::= # ASCII space character;
-  lf ::= # ASCII newline (LF) character;
-
-     # note: a colon (':') must precede the numerical value assigned to
-     # an idnum.  This is to distinguish it from a ref or tag name as
-     # GIT does not permit ':' in ref or tag strings.
-     #
-  idnum   ::= ':' bigint;
-  path    ::= # GIT style file path, e.g. "a/b/c";
-  ref     ::= # GIT ref name, e.g. "refs/heads/MOZ_GECKO_EXPERIMENT";
-  tag     ::= # GIT tag name, e.g. "FIREFOX_1_5";
-  sha1exp ::= # Any valid GIT SHA1 expression;
-  hexsha1 ::= # SHA1 in hexadecimal format;
-
-     # note: name and email are UTF8 strings, however name must not
-     # contain '<' or lf and email must not contain any of the
-     # following: '<', '>', lf.
-     #
-  name  ::= # valid GIT author/committer name;
-  email ::= # valid GIT author/committer email;
-  ts    ::= # time since the epoch in seconds, ascii base10 notation;
-  tz    ::= # GIT style timezone;
-
-     # note: comments, get-mark, ls-tree, and cat-blob requests may
-     # appear anywhere in the input, except within a data command. Any
-     # form of the data command always escapes the related input from
-     # comment processing.
-     #
-     # In case it is not clear, the '#' that starts the comment
-     # must be the first character on that line (an lf
-     # preceded it).
-     #
-
-  get_mark ::= 'get-mark' sp idnum lf;
-  cat_blob ::= 'cat-blob' sp (hexsha1 | idnum) lf;
-  ls_tree  ::= 'ls' sp (hexsha1 | idnum) sp path_str lf;
-
-  comment ::= '#' not_lf* lf;
-  not_lf  ::= # Any byte that is not ASCII newline (LF);
-*/
-
 #include "builtin.h"
 #include "cache.h"
 #include "repository.h"
-- 
2.19.1.1063.g2b8e4a4f82.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v2 10/11] fast-export: add a --show-original-ids option to show original names
  2018-11-14  0:25         ` [PATCH v2 00/11] " Elijah Newren
                             ` (8 preceding siblings ...)
  2018-11-14  0:25           ` [PATCH v2 09/11] fast-import: remove unmaintained duplicate documentation Elijah Newren
@ 2018-11-14  0:25           ` Elijah Newren
  2018-11-14  0:26           ` [PATCH v2 11/11] fast-export: add --always-show-modify-after-rename Elijah Newren
                             ` (2 subsequent siblings)
  12 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-14  0:25 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, gitster,
	Elijah Newren

Knowing the original names (hashes) of commits can sometimes enable
post-filtering that would otherwise be difficult or impossible.  In
particular, the desire to rewrite commit messages which refer to other
prior commits (on top of whatever other filtering is being done) is
very difficult without knowing the original names of each commit.

In addition, knowing the original names (hashes) of blobs can allow
filtering by blob-id without requiring re-hashing the content of the
blob, and is thus useful as a small optimization.

Once we add original ids for both commits and blobs, we may as well
add them for tags too for completeness.  Perhaps someone will have a
use for them.

This commit teaches a new --show-original-ids option to fast-export
which will make it add a 'original-oid <hash>' line to blob, commits,
and tags.  It also teaches fast-import to parse (and ignore) such
lines.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-export.txt |  7 +++++++
 Documentation/git-fast-import.txt | 16 ++++++++++++++++
 builtin/fast-export.c             | 20 +++++++++++++++-----
 fast-import.c                     | 12 ++++++++++++
 t/t9350-fast-export.sh            | 17 +++++++++++++++++
 5 files changed, 67 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index f65026662a..64c01ba918 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -122,6 +122,13 @@ marks the same across runs.
 	repository which already contains the necessary parent
 	commits.
 
+--show-original-ids::
+	Add an extra directive to the output for commits and blobs,
+	`original-oid <SHA1SUM>`.  While such directives will likely be
+	ignored by importers such as git-fast-import, it may be useful
+	for intermediary filters (e.g. for rewriting commit messages
+	which refer to older commits, or for stripping blobs by id).
+
 --refspec::
 	Apply the specified refspec to each ref exported. Multiple of them can
 	be specified.
diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index 7ab97745a6..43ab3b1637 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -385,6 +385,7 @@ change to the project.
 ....
 	'commit' SP <ref> LF
 	mark?
+	original-oid?
 	('author' (SP <name>)? SP LT <email> GT SP <when> LF)?
 	'committer' (SP <name>)? SP LT <email> GT SP <when> LF
 	data
@@ -741,6 +742,19 @@ New marks are created automatically.  Existing marks can be moved
 to another object simply by reusing the same `<idnum>` in another
 `mark` command.
 
+`original-oid`
+~~~~~~~~~~~~~~
+Provides the name of the object in the original source control system.
+fast-import will simply ignore this directive, but filter processes
+which operate on and modify the stream before feeding to fast-import
+may have uses for this information
+
+....
+	'original-oid' SP <object-identifier> LF
+....
+
+where `<object-identifer>` is any string not containing LF.
+
 `tag`
 ~~~~~
 Creates an annotated tag referring to a specific commit.  To create
@@ -749,6 +763,7 @@ lightweight (non-annotated) tags see the `reset` command below.
 ....
 	'tag' SP <name> LF
 	'from' SP <commit-ish> LF
+	original-oid?
 	'tagger' (SP <name>)? SP LT <email> GT SP <when> LF
 	data
 ....
@@ -823,6 +838,7 @@ assigned mark.
 ....
 	'blob' LF
 	mark?
+	original-oid?
 	data
 ....
 
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 3cc98c31ad..e0f794811e 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -38,6 +38,7 @@ static int use_done_feature;
 static int no_data;
 static int full_tree;
 static int reference_excluded_commits;
+static int show_original_ids;
 static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
 static struct string_list tag_refs = STRING_LIST_INIT_NODUP;
 static struct refspec refspecs = REFSPEC_INIT_FETCH;
@@ -271,7 +272,10 @@ static void export_blob(const struct object_id *oid)
 
 	mark_next_object(object);
 
-	printf("blob\nmark :%"PRIu32"\ndata %lu\n", last_idnum, size);
+	printf("blob\nmark :%"PRIu32"\n", last_idnum);
+	if (show_original_ids)
+		printf("original-oid %s\n", oid_to_hex(oid));
+	printf("data %lu\n", size);
 	if (size && fwrite(buf, size, 1, stdout) != 1)
 		die_errno("could not write blob '%s'", oid_to_hex(oid));
 	printf("\n");
@@ -634,8 +638,10 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 		reencoded = reencode_string(message, "UTF-8", encoding);
 	if (!commit->parents)
 		printf("reset %s\n", refname);
-	printf("commit %s\nmark :%"PRIu32"\n%.*s\n%.*s\ndata %u\n%s",
-	       refname, last_idnum,
+	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
+	if (show_original_ids)
+		printf("original-oid %s\n", oid_to_hex(&commit->object.oid));
+	printf("%.*s\n%.*s\ndata %u\n%s",
 	       (int)(author_end - author), author,
 	       (int)(committer_end - committer), committer,
 	       (unsigned)(reencoded
@@ -813,8 +819,10 @@ static void handle_tag(const char *name, struct tag *tag)
 
 	if (starts_with(name, "refs/tags/"))
 		name += 10;
-	printf("tag %s\nfrom :%d\n%.*s%sdata %d\n%.*s\n",
-	       name, tagged_mark,
+	printf("tag %s\nfrom :%d\n", name, tagged_mark);
+	if (show_original_ids)
+		printf("original-oid %s\n", oid_to_hex(&tag->object.oid));
+	printf("%.*s%sdata %d\n%.*s\n",
 	       (int)(tagger_end - tagger), tagger,
 	       tagger == tagger_end ? "" : "\n",
 	       (int)message_size, (int)message_size, message ? message : "");
@@ -1095,6 +1103,8 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")),
 		OPT_BOOL(0, "reference-excluded-parents",
 			 &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by sha1sum")),
+		OPT_BOOL(0, "show-original-ids", &show_original_ids,
+			    N_("Show original sha1sums of blobs/commits")),
 
 		OPT_END()
 	};
diff --git a/fast-import.c b/fast-import.c
index 555d49ad23..71b6cba00f 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1814,6 +1814,13 @@ static void parse_mark(void)
 		next_mark = 0;
 }
 
+static void parse_original_identifier(void)
+{
+	const char *v;
+	if (skip_prefix(command_buf.buf, "original-oid ", &v))
+		read_next_command();
+}
+
 static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res)
 {
 	const char *data;
@@ -1956,6 +1963,7 @@ static void parse_new_blob(void)
 {
 	read_next_command();
 	parse_mark();
+	parse_original_identifier();
 	parse_and_store_blob(&last_blob, NULL, next_mark);
 }
 
@@ -2579,6 +2587,7 @@ static void parse_new_commit(const char *arg)
 
 	read_next_command();
 	parse_mark();
+	parse_original_identifier();
 	if (skip_prefix(command_buf.buf, "author ", &v)) {
 		author = parse_ident(v);
 		read_next_command();
@@ -2711,6 +2720,9 @@ static void parse_new_tag(const char *arg)
 		die("Invalid ref name or SHA1 expression: %s", from);
 	read_next_command();
 
+	/* original-oid ... */
+	parse_original_identifier();
+
 	/* tagger ... */
 	if (skip_prefix(command_buf.buf, "tagger ", &v)) {
 		tagger = parse_ident(v);
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index d7d73061d0..5690fe2810 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -77,6 +77,23 @@ test_expect_success 'fast-export --reference-excluded-parents master~2..master'
 		 test $MASTER = $(git rev-parse --verify refs/heads/rewrite))
 '
 
+test_expect_success 'fast-export --show-original-ids' '
+
+	git fast-export --show-original-ids master >output &&
+	grep ^original-oid output| sed -e s/^original-oid.// | sort >actual &&
+	git rev-list --objects master muss >objects-and-names &&
+	awk "{print \$1}" objects-and-names | sort >commits-trees-blobs &&
+	comm -23 actual commits-trees-blobs >unfound &&
+	test_must_be_empty unfound
+'
+
+test_expect_success 'fast-export --show-original-ids | git fast-import' '
+
+	git fast-export --show-original-ids master muss | git fast-import --quiet &&
+	test $MASTER = $(git rev-parse --verify refs/heads/master) &&
+	test $MUSS = $(git rev-parse --verify refs/tags/muss)
+'
+
 test_expect_success 'iso-8859-1' '
 
 	git config i18n.commitencoding ISO8859-1 &&
-- 
2.19.1.1063.g2b8e4a4f82.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v2 11/11] fast-export: add --always-show-modify-after-rename
  2018-11-14  0:25         ` [PATCH v2 00/11] " Elijah Newren
                             ` (9 preceding siblings ...)
  2018-11-14  0:25           ` [PATCH v2 10/11] fast-export: add a --show-original-ids option to show original names Elijah Newren
@ 2018-11-14  0:26           ` Elijah Newren
  2018-11-14  7:25           ` [PATCH v2 00/11] fast export and import fixes and features Jeff King
  2018-11-16  7:59           ` [PATCH v3 " Elijah Newren
  12 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-14  0:26 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, sandals, peff, me, jrnieder, gitster,
	Elijah Newren

I wanted a way to gather all the following information efficiently
(with as few history traversals as possible):
  * Get all blob sizes
  * Map blob shas to filename(s) they appeared under in the history
  * Find when files and directories were deleted (and whether they
    were later reinstated, since that means they aren't actually gone)
  * Find sets of filenames referring to the same logical 'file'. (e.g.
    foo->bar in commit A and bar->baz in commit B mean that
    {foo,bar,baz} refer to the same 'file', so someone wanting to just
    "keep baz and its history" need all versions of those three
    filenames).  I need to know about things like another foo or bar
    being introduced after the rename though, since that breaks the
    connection between filenames)
and then I would generate various aggregations on the data and display
some type of report for the user.

The only way I know of to get blob sizes is via
  cat-file --batch-all-objects --batch-check

The rest of the data would traditionally be gathered from a log command,
e.g.

  git log --format='%H%n%P%n%cd' --date=short --topo-order --reverse \
      -M --diff-filter=RAMD --no-abbrev --raw -c

however, parsing log output seems slightly dangerous given that it is a
porcelain command.  While we have specified --format and --raw to try
to avoid the most obvious problems, I'm still slightly concerned about
--date=short, the combinations of --raw and -c, options that might
colorize the output, and also the --diff-filter (there is no current
option named --no-find-copies or --no-break-rewrites, but what if those
turn on by default in the future much as we changed the default with
detecting renames?).  Each of those is a small worry, but they add up.

A command meant for data serialization, such as fast-export, seems like
a better candidate for this job.  There's just one missing item: in
order to connect blob sizes to filenames, I need fast-export to tell me
the blob sha1sum of any file changes.  It does this for modifies, but
not always for renames.  In particular, if a file is a 100% rename, it
only prints
    R oldname newname
instead of
    R oldname newname
    M 100644 $SHA1 newname
as occurs when there is a rename+modify.  Add an option which allows us
to force the latter output even when commits have exact renames of
files.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-export.txt | 11 ++++++++++
 builtin/fast-export.c             |  7 +++++-
 t/t9350-fast-export.sh            | 36 +++++++++++++++++++++++++++++++
 3 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index 64c01ba918..b663b6f8af 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -129,6 +129,17 @@ marks the same across runs.
 	for intermediary filters (e.g. for rewriting commit messages
 	which refer to older commits, or for stripping blobs by id).
 
+--always-show-modify-after-rename::
+	When a rename is detected, fast-export normally issues both a
+	'R' (rename) and a 'M' (modify) directive.  However, if the
+	contents of the old and new filename match exactly, it will
+	only issue the rename directive.  Use this flag to have it
+	always issue the modify directive after the rename, which may
+	be useful for tools which are using the fast-export stream as
+	a mechanism for gathering statistics about a repository.  Note
+	that this option only has effect when rename detection is
+	active (see the -M option).
+
 --refspec::
 	Apply the specified refspec to each ref exported. Multiple of them can
 	be specified.
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index e0f794811e..31ad43077a 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -38,6 +38,7 @@ static int use_done_feature;
 static int no_data;
 static int full_tree;
 static int reference_excluded_commits;
+static int always_show_modify_after_rename;
 static int show_original_ids;
 static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
 static struct string_list tag_refs = STRING_LIST_INIT_NODUP;
@@ -407,7 +408,8 @@ static void show_filemodify(struct diff_queue_struct *q,
 				putchar('\n');
 
 				if (oideq(&ospec->oid, &spec->oid) &&
-				    ospec->mode == spec->mode)
+				    ospec->mode == spec->mode &&
+				    !always_show_modify_after_rename)
 					break;
 			}
 			/* fallthrough */
@@ -1105,6 +1107,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 			 &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by sha1sum")),
 		OPT_BOOL(0, "show-original-ids", &show_original_ids,
 			    N_("Show original sha1sums of blobs/commits")),
+		OPT_BOOL(0, "always-show-modify-after-rename",
+			    &always_show_modify_after_rename,
+			 N_("Always provide 'M' directive after 'R'")),
 
 		OPT_END()
 	};
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 5690fe2810..5c20065e39 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -630,4 +630,40 @@ test_expect_success 'merge commit gets exported with --import-marks' '
 	)
 '
 
+test_expect_success 'rename detection and --always-show-modify-after-rename' '
+	test_create_repo renames &&
+	(
+		cd renames &&
+		test_seq 0  9  >single_digit &&
+		test_seq 10 98 >double_digit &&
+		git add . &&
+		git commit -m initial &&
+
+		echo 99 >>double_digit &&
+		git mv single_digit single-digit &&
+		git mv double_digit double-digit &&
+		git add double-digit &&
+		git commit -m renames &&
+
+		# First, check normal fast-export -M output
+		git fast-export -M --no-data master >out &&
+
+		grep double-digit out >out2 &&
+		test_line_count = 2 out2 &&
+
+		grep single-digit out >out2 &&
+		test_line_count = 1 out2 &&
+
+		# Now, test with --always-show-modify-after-rename; should
+		# have an extra "M" directive for "single-digit".
+		git fast-export -M --no-data --always-show-modify-after-rename master >out &&
+
+		grep double-digit out >out2 &&
+		test_line_count = 2 out2 &&
+
+		grep single-digit out >out2 &&
+		test_line_count = 2 out2
+	)
+'
+
 test_done
-- 
2.19.1.1063.g2b8e4a4f82.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/10] fast-export: add --always-show-modify-after-rename
  2018-11-13 17:10                     ` Elijah Newren
@ 2018-11-14  7:14                       ` Jeff King
  0 siblings, 0 replies; 90+ messages in thread
From: Jeff King @ 2018-11-14  7:14 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Taylor Blau,
	Jonathan Nieder

On Tue, Nov 13, 2018 at 09:10:36AM -0800, Elijah Newren wrote:

> > I am looking at this problem as "how do you answer question X in a
> > repository". And I think you are looking at as "I am receiving a
> > fast-export stream, and I need to answer question X on the fly".
> >
> > And that would explain why you want to get extra annotations into the
> > fast-export stream. Is that right?
> 
> I'm not trying to get information on the fly during a rewrite or
> anything like that.  This is an optional pre-rewrite step (from a
> separate invocation of the tool) where I have multiple questions I
> want to answer.  I'd like to answer them all relatively quickly, if
> possible, and I think all of them should be answerable with a single
> history traversal (plus a cat-file --batch-all-objects call to get
> object sizes, since I don't know of another way to get those).  I'd be
> fine with switching from fast-export to log or something else if it
> met the needs better.

Ah, OK. Yes, if we're just trying to query, then I think you should be
able to do what you want with the existing traversal and diff tools. And
if not, we should think about a new feature there, and not try to
shoe-horn it into fast-export.

> As far as I can tell, you're trying to split each question apart and
> do a history traversal for each, and I don't see why that's better.
> Simpler, perhaps, but it seems worse for performance.  Am I missing
> something?

I was only trying to address each possible query individually. I agree
that if you are querying both things, you should be able to do it in a
single traversal (and that is strictly better). It may require a little
more parsing of the output (e.g., `--find-object` is easy to implement
yourself looking at --raw output).

> Ah, I didn't know renames were on by default; I somehow missed that.
> Also, the rev-list to diff-tree pipe is nice, but I also need parent
> and commit timestamp information.

diff-tree will format the commit info as well (before git-log was a C
builtin, it was just a rev-list/diff-tree pipeline in a shell script).
So you can do:

  git rev-list ... |
  git diff-tree --stdin --format='%h %ct %p' --raw -r -M

and get dump very similar to what fast-export would give you.

> > >   git log -M --diff-filter=RAMD --no-abbrev --raw
> >
> > What is there besides RAMD? :)
> 
> Well, as you pointed out above, log detects renames by default,
> whereas it didn't used to.
> So, if someone had written some similar-ish history walking/parsing
> tool years ago that didn't depend need renames and was based on log
> output, there's a good chance their tool might start failing when
> rename detection was turned on by default, because instead of getting
> both a 'D' and an 'M' change, they'd get an unexpected 'R'.

Mostly I just meant: your diff-filter includes basically everything, so
why bother filtering? You're going to have to parse the result anyway,
and you can throw away uninteresting bits there.

> For my case, do I have to worry about similar future changes?  Will
> copy detection ('C') or break detection ('B') become the default in
> the future?  Do I have to worry about typechanges ('T")?  Will new
> change types be added?  I mean, the fast-export output could maybe
> change too, but it seems much less likely than with log.

If you use diff-tree, then it won't ever enable copy or break detection
without you explicitly asking for it.

> Let me try to put it as briefly as I can.  With as few traversals as
> possible, I want to:
>   * Get all blob sizes
>   * Map blob shas to filename(s) they appeared under in the history
>   * Find when files and directories were deleted (and whether they
> were later reinstated, since that means they aren't actually gone)
>   * Find sets of filenames referring to the same logical 'file'. (e.g.
> foo->bar in commit A and bar->baz in commit B mean that {foo,bar,baz}
> refer to the same 'file' so that a user has an easy report to look at
> to find out that if they just want to "keep baz and its history" then
> they need foo & bar & baz.  I need to know about things like another
> foo or bar being introduced after the rename though, since that breaks
> the connection between filenames)
>   * Do a few aggregations on the above data as well (e.g. all copies
> of postgres.exe add up to 20M -- why were those checked in anyway?,
> *.webm files in aggregate are .5G, your long-deleted src/video-server/
> directory from that aborted experimental project years ago takes up 2G
> of your history, etc.)
> 
> Right now, my best solution for this combination of questions is
> 'cat-file --batch-all-objects' plus fast-export, if I get patch 10/10
> in place.  I'm totally open to better solutions, including ones that
> don't use fast-export.

OK, I think I understand your problem better now. I don't think there's
anything fast-export can show that log/diff-tree could not, aside from
actual blob contents. But I don't think you want them (and if you did,
you can use "cat-file --batch" to selectively request them).

I think there's a general problem with any serialized output (log or
fast-export) that things like rename tracking depend on the topology. If
I rename "foo" to "bar" on one branch, and "bar" to "baz" on another
branch, without reconstructing the parent graph you don't realize that
those two things were on parallel branches, and not a sequence.  But
with the parent ids, you can delve as deep as you like in your analysis
script.

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v2 00/11] fast export and import fixes and features
  2018-11-14  0:25         ` [PATCH v2 00/11] " Elijah Newren
                             ` (10 preceding siblings ...)
  2018-11-14  0:26           ` [PATCH v2 11/11] fast-export: add --always-show-modify-after-rename Elijah Newren
@ 2018-11-14  7:25           ` Jeff King
  2018-11-16  7:59           ` [PATCH v3 " Elijah Newren
  12 siblings, 0 replies; 90+ messages in thread
From: Jeff King @ 2018-11-14  7:25 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git, larsxschneider, sandals, me, jrnieder, gitster

On Tue, Nov 13, 2018 at 04:25:49PM -0800, Elijah Newren wrote:

> This is a series of small fixes and features for fast-export and
> fast-import, mostly on the fast-export side.

I looked over this, and I think you've addressed all of my questions.

A few quick comments:

> Changes since v1 (full range-diff below):
>   - used {tilde} in asciidoc documentation to avoid subscripting and
>     escaping problems

I think just using backticks would make the source more readable, as
well as make the output prettier. But that's pretty minor.

>   - renamed ABORT/ERROR enum values to help avoid further misusage

This is an improvement, I think. It's a little funny that we still have
bare names for the non-ABORT bits, though (there's less semantic
overlap, but if it's a good practice to use qualified enum names, we
should probably just do so consistently).

>   - multiple small testcase cleanups (use $ZERO_OID, remove grep -A, etc.)

Looks good.

>   - add FIXME comment to code about string_list usage

Makes sense.

>   - record Peff's idea for a future optimization in patch 8 commit message
>     (is there a better place to put that??)

Seems like a reasonable place (though you are welcome to restate it if
you like).

>   - New patch (9/11): remove the unmaintained copy of fast-import stream
>     format documentation at the beginning of fast-import.c

Looks good. I wondered if there might be bits that need migrated, but
given the length of time that comment has been there, it's unlikely. And
in the worst case, if somebody finds some information missing from
git-fast-import.txt, they can still consult the history.

>   - Rewrite commit message for 10/11 to match the wording Peff liked
>     better, s/originally/original-oid/, and add documentation to
>     git-fast-import.txt

Looks good.

>   - Rewrite commit message for 11/11; the last one didn't make sense to
>     Peff.  I hope this one does.

Thanks for your patience in getting me to understand what you're trying
to do. At this point I still think that using rev-list and diff-tree is
probably the right solution for your use case.

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v2 04/11] fast-export: avoid dying when filtering by paths and old tags exist
  2018-11-14  0:25           ` [PATCH v2 04/11] fast-export: avoid dying when filtering by paths and old tags exist Elijah Newren
@ 2018-11-14 19:17             ` SZEDER Gábor
  2018-11-14 23:13               ` Elijah Newren
  0 siblings, 1 reply; 90+ messages in thread
From: SZEDER Gábor @ 2018-11-14 19:17 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git, larsxschneider, sandals, peff, me, jrnieder, gitster

On Tue, Nov 13, 2018 at 04:25:53PM -0800, Elijah Newren wrote:
> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> index af724e9937..b984a44224 100644
> --- a/builtin/fast-export.c
> +++ b/builtin/fast-export.c
> @@ -774,9 +774,12 @@ static void handle_tag(const char *name, struct tag *tag)
>  					break;
>  				if (!(p->object.flags & TREESAME))
>  					break;
> -				if (!p->parents)
> -					die("can't find replacement commit for tag %s",
> -					     oid_to_hex(&tag->object.oid));
> +				if (!p->parents) {
> +					printf("reset %s\nfrom %s\n\n",
> +					       name, sha1_to_hex(null_sha1));

Please use oid_to_hex(&null_oid) instead.

> +					free(buf);
> +					return;
> +				}
>  				p = p->parents->item;
>  			}
>  			tagged_mark = get_object_mark(&p->object);

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v2 08/11] fast-export: add --reference-excluded-parents option
  2018-11-14  0:25           ` [PATCH v2 08/11] fast-export: add --reference-excluded-parents option Elijah Newren
@ 2018-11-14 19:27             ` SZEDER Gábor
  2018-11-14 23:16               ` Elijah Newren
  0 siblings, 1 reply; 90+ messages in thread
From: SZEDER Gábor @ 2018-11-14 19:27 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git, larsxschneider, sandals, peff, me, jrnieder, gitster

On Tue, Nov 13, 2018 at 04:25:57PM -0800, Elijah Newren wrote:
> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> index 2fef00436b..3cc98c31ad 100644
> --- a/builtin/fast-export.c
> +++ b/builtin/fast-export.c
> @@ -37,6 +37,7 @@ static int fake_missing_tagger;
>  static int use_done_feature;
>  static int no_data;
>  static int full_tree;
> +static int reference_excluded_commits;
>  static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
>  static struct string_list tag_refs = STRING_LIST_INIT_NODUP;
>  static struct refspec refspecs = REFSPEC_INIT_FETCH;
> @@ -596,7 +597,8 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
>  		message += 2;
>  
>  	if (commit->parents &&
> -	    get_object_mark(&commit->parents->item->object) != 0 &&
> +	    (get_object_mark(&commit->parents->item->object) != 0 ||
> +	     reference_excluded_commits) &&
>  	    !full_tree) {
>  		parse_commit_or_die(commit->parents->item);
>  		diff_tree_oid(get_commit_tree_oid(commit->parents->item),
> @@ -644,13 +646,21 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
>  	unuse_commit_buffer(commit, commit_buffer);
>  
>  	for (i = 0, p = commit->parents; p; p = p->next) {
> -		int mark = get_object_mark(&p->item->object);
> -		if (!mark)
> +		struct object *obj = &p->item->object;
> +		int mark = get_object_mark(obj);
> +
> +		if (!mark && !reference_excluded_commits)
>  			continue;
>  		if (i == 0)
> -			printf("from :%d\n", mark);
> +			printf("from ");
> +		else
> +			printf("merge ");
> +		if (mark)
> +			printf(":%d\n", mark);
>  		else
> -			printf("merge :%d\n", mark);
> +			printf("%s\n", sha1_to_hex(anonymize ?
> +						   anonymize_sha1(&obj->oid) :
> +						   obj->oid.hash));

Since we intend to move away from SHA-1, would this be a good time to
add an anonymize_oid() function, "while at it"?

>  		i++;
>  	}
>  
> @@ -931,13 +941,22 @@ static void handle_tags_and_duplicates(struct string_list *extras)
>  				/*
>  				 * Getting here means we have a commit which
>  				 * was excluded by a negative refspec (e.g.
> -				 * fast-export ^master master).  If the user
> +				 * fast-export ^master master).  If we are
> +				 * referencing excluded commits, set the ref
> +				 * to the exact commit.  Otherwise, the user
>  				 * wants the branch exported but every commit
> -				 * in its history to be deleted, that sounds
> -				 * like a ref deletion to me.
> +				 * in its history to be deleted, which basically
> +				 * just means deletion of the ref.
>  				 */
> -				printf("reset %s\nfrom %s\n\n",
> -				       name, sha1_to_hex(null_sha1));
> +				if (!reference_excluded_commits) {
> +					/* delete the ref */
> +					printf("reset %s\nfrom %s\n\n",
> +					       name, sha1_to_hex(null_sha1));
> +					continue;
> +				}
> +				/* set ref to commit using oid, not mark */
> +				printf("reset %s\nfrom %s\n\n", name,
> +				       sha1_to_hex(commit->object.oid.hash));

Please use oid_to_hex(&commit->object.oid) instead.

>  				continue;
>  			}
>  

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v2 04/11] fast-export: avoid dying when filtering by paths and old tags exist
  2018-11-14 19:17             ` SZEDER Gábor
@ 2018-11-14 23:13               ` Elijah Newren
  0 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-14 23:13 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Jeff King,
	Taylor Blau, Jonathan Nieder, Junio C Hamano

On Wed, Nov 14, 2018 at 11:17 AM SZEDER Gábor <szeder.dev@gmail.com> wrote:
> On Tue, Nov 13, 2018 at 04:25:53PM -0800, Elijah Newren wrote:
> > diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> > index af724e9937..b984a44224 100644
> > --- a/builtin/fast-export.c
> > +++ b/builtin/fast-export.c
> > @@ -774,9 +774,12 @@ static void handle_tag(const char *name, struct tag *tag)
> >                                       break;
> >                               if (!(p->object.flags & TREESAME))
> >                                       break;
> > -                             if (!p->parents)
> > -                                     die("can't find replacement commit for tag %s",
> > -                                          oid_to_hex(&tag->object.oid));
> > +                             if (!p->parents) {
> > +                                     printf("reset %s\nfrom %s\n\n",
> > +                                            name, sha1_to_hex(null_sha1));
>
> Please use oid_to_hex(&null_oid) instead.

Will do.  Looks like origin/master:builtin/fast-export.c already had
two sha1_to_hex() calls, so I'll add a cleanup patch fixing those too.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v2 08/11] fast-export: add --reference-excluded-parents option
  2018-11-14 19:27             ` SZEDER Gábor
@ 2018-11-14 23:16               ` Elijah Newren
  0 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-14 23:16 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Git Mailing List, Lars Schneider, brian m. carlson, Jeff King,
	Taylor Blau, Jonathan Nieder, Junio C Hamano

On Wed, Nov 14, 2018 at 11:28 AM SZEDER Gábor <szeder.dev@gmail.com> wrote:
>
> On Tue, Nov 13, 2018 at 04:25:57PM -0800, Elijah Newren wrote:
> > diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> > index 2fef00436b..3cc98c31ad 100644
> > --- a/builtin/fast-export.c
> > +++ b/builtin/fast-export.c
> > @@ -37,6 +37,7 @@ static int fake_missing_tagger;
> >  static int use_done_feature;
> >  static int no_data;
> >  static int full_tree;
> > +static int reference_excluded_commits;
> >  static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
> >  static struct string_list tag_refs = STRING_LIST_INIT_NODUP;
> >  static struct refspec refspecs = REFSPEC_INIT_FETCH;
> > @@ -596,7 +597,8 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
> >               message += 2;
> >
> >       if (commit->parents &&
> > -         get_object_mark(&commit->parents->item->object) != 0 &&
> > +         (get_object_mark(&commit->parents->item->object) != 0 ||
> > +          reference_excluded_commits) &&
> >           !full_tree) {
> >               parse_commit_or_die(commit->parents->item);
> >               diff_tree_oid(get_commit_tree_oid(commit->parents->item),
> > @@ -644,13 +646,21 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
> >       unuse_commit_buffer(commit, commit_buffer);
> >
> >       for (i = 0, p = commit->parents; p; p = p->next) {
> > -             int mark = get_object_mark(&p->item->object);
> > -             if (!mark)
> > +             struct object *obj = &p->item->object;
> > +             int mark = get_object_mark(obj);
> > +
> > +             if (!mark && !reference_excluded_commits)
> >                       continue;
> >               if (i == 0)
> > -                     printf("from :%d\n", mark);
> > +                     printf("from ");
> > +             else
> > +                     printf("merge ");
> > +             if (mark)
> > +                     printf(":%d\n", mark);
> >               else
> > -                     printf("merge :%d\n", mark);
> > +                     printf("%s\n", sha1_to_hex(anonymize ?
> > +                                                anonymize_sha1(&obj->oid) :
> > +                                                obj->oid.hash));
>
> Since we intend to move away from SHA-1, would this be a good time to
> add an anonymize_oid() function, "while at it"?

Since I already need to add a cleanup commit to remove the
pre-existing sha1_to_hex() calls, I'll just
s/anonymize_sha1/anonymize_oid/ while at it in the same commit; it's
not called from any other file.

> >               i++;
> >       }
> >
> > @@ -931,13 +941,22 @@ static void handle_tags_and_duplicates(struct string_list *extras)
> >                               /*
> >                                * Getting here means we have a commit which
> >                                * was excluded by a negative refspec (e.g.
> > -                              * fast-export ^master master).  If the user
> > +                              * fast-export ^master master).  If we are
> > +                              * referencing excluded commits, set the ref
> > +                              * to the exact commit.  Otherwise, the user
> >                                * wants the branch exported but every commit
> > -                              * in its history to be deleted, that sounds
> > -                              * like a ref deletion to me.
> > +                              * in its history to be deleted, which basically
> > +                              * just means deletion of the ref.
> >                                */
> > -                             printf("reset %s\nfrom %s\n\n",
> > -                                    name, sha1_to_hex(null_sha1));
> > +                             if (!reference_excluded_commits) {
> > +                                     /* delete the ref */
> > +                                     printf("reset %s\nfrom %s\n\n",
> > +                                            name, sha1_to_hex(null_sha1));
> > +                                     continue;
> > +                             }
> > +                             /* set ref to commit using oid, not mark */
> > +                             printf("reset %s\nfrom %s\n\n", name,
> > +                                    sha1_to_hex(commit->object.oid.hash));
>
> Please use oid_to_hex(&commit->object.oid) instead.

Yeah, there were a couple others I introduced too.  I'll fix them all up.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v3 00/11] fast export and import fixes and features
  2018-11-14  0:25         ` [PATCH v2 00/11] " Elijah Newren
                             ` (11 preceding siblings ...)
  2018-11-14  7:25           ` [PATCH v2 00/11] fast export and import fixes and features Jeff King
@ 2018-11-16  7:59           ` Elijah Newren
  2018-11-16  7:59             ` [PATCH v3 01/11] fast-export: convert sha1 to oid Elijah Newren
                               ` (11 more replies)
  12 siblings, 12 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-16  7:59 UTC (permalink / raw)
  To: gitster
  Cc: git, larsxschneider, sandals, peff, me, jrnieder, szeder.dev,
	Elijah Newren

This is a series of small fixes and features for fast-export and
fast-import, mostly on the fast-export side.

Changes since v2 (full range-diff below):
  * Dropped the final patch; going to try to use Peff's suggestion of
    rev-list and diff-tree to get what I need instead
  * Inserted a new patch at the beginning to convert pre-existing sha1
    stuff to oid (rename sha1_to_hex() -> oid_to_hex(), rename
    anonymize_sha1() to anonymize_oid(), etc.)
  * Modified other patches in the series to add calls to oid_to_hex() rather
    than sha1_to_hex()

Elijah Newren (11):
  fast-export: convert sha1 to oid
  git-fast-import.txt: fix documentation for --quiet option
  git-fast-export.txt: clarify misleading documentation about rev-list
    args
  fast-export: use value from correct enum
  fast-export: avoid dying when filtering by paths and old tags exist
  fast-export: move commit rewriting logic into a function for reuse
  fast-export: when using paths, avoid corrupt stream with non-existent
    mark
  fast-export: ensure we export requested refs
  fast-export: add --reference-excluded-parents option
  fast-import: remove unmaintained duplicate documentation
  fast-export: add a --show-original-ids option to show original names

 Documentation/git-fast-export.txt |  23 +++-
 Documentation/git-fast-import.txt |  23 +++-
 builtin/fast-export.c             | 190 +++++++++++++++++++++---------
 fast-import.c                     | 166 ++------------------------
 t/t9350-fast-export.sh            |  80 ++++++++++++-
 5 files changed, 268 insertions(+), 214 deletions(-)

 -:  ---------- >  1:  4c3370c85f fast-export: convert sha1 to oid
 1:  8870fb1340 =  2:  6ffa30e3c7 git-fast-import.txt: fix documentation for --quiet option
 2:  16d1c3e22d =  3:  1e278f009a git-fast-export.txt: clarify misleading documentation about rev-list args
 3:  e19f6b36f9 =  4:  9d7b2aef49 fast-export: use value from correct enum
 4:  2b305561d5 !  5:  b65a591d4d fast-export: avoid dying when filtering by paths and old tags exist
    @@ -29,7 +29,7 @@
     -					     oid_to_hex(&tag->object.oid));
     +				if (!p->parents) {
     +					printf("reset %s\nfrom %s\n\n",
    -+					       name, sha1_to_hex(null_sha1));
    ++					       name, oid_to_hex(&null_oid));
     +					free(buf);
     +					return;
     +				}
 5:  607b1dc2b2 !  6:  dde52c9cb6 fast-export: move commit rewriting logic into a function for reuse
    @@ -47,7 +47,7 @@
     -					break;
     -				if (!p->parents) {
     -					printf("reset %s\nfrom %s\n\n",
    --					       name, sha1_to_hex(null_sha1));
    +-					       name, oid_to_hex(&null_oid));
     -					free(buf);
     -					return;
     -				}
    @@ -55,7 +55,7 @@
     +			p = rewrite_commit((struct commit *)tagged);
     +			if (!p) {
     +				printf("reset %s\nfrom %s\n\n",
    -+				       name, sha1_to_hex(null_sha1));
    ++				       name, oid_to_hex(&null_oid));
     +				free(buf);
     +				return;
      			}
 6:  ec1862e858 !  7:  d9b2e326f0 fast-export: when using paths, avoid corrupt stream with non-existent mark
    @@ -35,7 +35,7 @@
     +				 * it.
     +				 */
     +				printf("reset %s\nfrom %s\n\n",
    -+				       name, sha1_to_hex(null_sha1));
    ++				       name, oid_to_hex(&null_oid));
     +				continue;
     +			}
      			printf("reset %s\nfrom :%d\n\n", name,
 7:  9da26e3ccb !  8:  9ddb155a70 fast-export: ensure we export requested refs
    @@ -97,7 +97,7 @@
      		case OBJ_TAG:
      			handle_tag(name, (struct tag *)object);
     @@
    - 				       name, sha1_to_hex(null_sha1));
    + 				       name, oid_to_hex(&null_oid));
      				continue;
      			}
     -			printf("reset %s\nfrom :%d\n\n", name,
    @@ -114,7 +114,7 @@
     +				 * like a ref deletion to me.
     +				 */
     +				printf("reset %s\nfrom %s\n\n",
    -+				       name, sha1_to_hex(null_sha1));
    ++				       name, oid_to_hex(&null_oid));
     +				continue;
     +			}
     +
 8:  7e5fe2f02e !  9:  595d2e5d30 fast-export: add --reference-excluded-parents option
    @@ -117,9 +117,9 @@
     +			printf(":%d\n", mark);
      		else
     -			printf("merge :%d\n", mark);
    -+			printf("%s\n", sha1_to_hex(anonymize ?
    -+						   anonymize_sha1(&obj->oid) :
    -+						   obj->oid.hash));
    ++			printf("%s\n", oid_to_hex(anonymize ?
    ++						  anonymize_oid(&obj->oid) :
    ++						  &obj->oid));
      		i++;
      	}
      
    @@ -138,16 +138,16 @@
     +				 * just means deletion of the ref.
      				 */
     -				printf("reset %s\nfrom %s\n\n",
    --				       name, sha1_to_hex(null_sha1));
    +-				       name, oid_to_hex(&null_oid));
     +				if (!reference_excluded_commits) {
     +					/* delete the ref */
     +					printf("reset %s\nfrom %s\n\n",
    -+					       name, sha1_to_hex(null_sha1));
    ++					       name, oid_to_hex(&null_oid));
     +					continue;
     +				}
     +				/* set ref to commit using oid, not mark */
     +				printf("reset %s\nfrom %s\n\n", name,
    -+				       sha1_to_hex(commit->object.oid.hash));
    ++				       oid_to_hex(&commit->object.oid));
      				continue;
      			}
      
    @@ -156,7 +156,7 @@
      			     N_("Apply refspec to exported refs")),
      		OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")),
     +		OPT_BOOL(0, "reference-excluded-parents",
    -+			 &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by sha1sum")),
    ++			 &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by object id")),
     +
      		OPT_END()
      	};
 9:  14306a8436 = 10:  2686246a89 fast-import: remove unmaintained duplicate documentation
10:  72487a61e4 ! 11:  b78d548e7d fast-export: add a --show-original-ids option to show original names
    @@ -141,9 +141,9 @@
     @@
      		OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")),
      		OPT_BOOL(0, "reference-excluded-parents",
    - 			 &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by sha1sum")),
    + 			 &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by object id")),
     +		OPT_BOOL(0, "show-original-ids", &show_original_ids,
    -+			    N_("Show original sha1sums of blobs/commits")),
    ++			    N_("Show original object ids of blobs/commits")),
      
      		OPT_END()
      	};

-- 
2.19.1.1063.g1796373474.dirty

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH v3 01/11] fast-export: convert sha1 to oid
  2018-11-16  7:59           ` [PATCH v3 " Elijah Newren
@ 2018-11-16  7:59             ` Elijah Newren
  2018-11-16  7:59             ` [PATCH v3 02/11] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
                               ` (10 subsequent siblings)
  11 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-16  7:59 UTC (permalink / raw)
  To: gitster
  Cc: git, larsxschneider, sandals, peff, me, jrnieder, szeder.dev,
	Elijah Newren

Rename anonymize_sha1() to anonymize_oid(() and change its signature,
and switch from sha1_to_hex() to oid_to_hex() and from GIT_SHA1_RAWSZ to
the_hash_algo->rawsz.  Also change a comment and a die string to mention
oid instead of sha1.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 456797c12a..f5166ac71e 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -243,7 +243,7 @@ static void export_blob(const struct object_id *oid)
 		if (!buf)
 			die("could not read blob %s", oid_to_hex(oid));
 		if (check_object_signature(oid, buf, size, type_name(type)) < 0)
-			die("sha1 mismatch in blob %s", oid_to_hex(oid));
+			die("oid mismatch in blob %s", oid_to_hex(oid));
 		object = parse_object_buffer(the_repository, oid, type,
 					     size, buf, &eaten);
 	}
@@ -330,17 +330,18 @@ static void print_path(const char *path)
 
 static void *generate_fake_oid(const void *old, size_t *len)
 {
-	static uint32_t counter = 1; /* avoid null sha1 */
-	unsigned char *out = xcalloc(GIT_SHA1_RAWSZ, 1);
-	put_be32(out + GIT_SHA1_RAWSZ - 4, counter++);
+	static uint32_t counter = 1; /* avoid null oid */
+	const unsigned hashsz = the_hash_algo->rawsz;
+	unsigned char *out = xcalloc(hashsz, 1);
+	put_be32(out + hashsz - 4, counter++);
 	return out;
 }
 
-static const unsigned char *anonymize_sha1(const struct object_id *oid)
+static const struct object_id *anonymize_oid(const struct object_id *oid)
 {
-	static struct hashmap sha1s;
-	size_t len = GIT_SHA1_RAWSZ;
-	return anonymize_mem(&sha1s, generate_fake_oid, oid, &len);
+	static struct hashmap objs;
+	size_t len = the_hash_algo->rawsz;
+	return anonymize_mem(&objs, generate_fake_oid, oid, &len);
 }
 
 static void show_filemodify(struct diff_queue_struct *q,
@@ -399,9 +400,9 @@ static void show_filemodify(struct diff_queue_struct *q,
 			 */
 			if (no_data || S_ISGITLINK(spec->mode))
 				printf("M %06o %s ", spec->mode,
-				       sha1_to_hex(anonymize ?
-						   anonymize_sha1(&spec->oid) :
-						   spec->oid.hash));
+				       oid_to_hex(anonymize ?
+						  anonymize_oid(&spec->oid) :
+						  &spec->oid));
 			else {
 				struct object *object = lookup_object(the_repository,
 								      spec->oid.hash);
@@ -988,7 +989,7 @@ static void handle_deletes(void)
 			continue;
 
 		printf("reset %s\nfrom %s\n\n",
-				refspec->dst, sha1_to_hex(null_sha1));
+				refspec->dst, oid_to_hex(&null_oid));
 	}
 }
 
-- 
2.19.1.1063.g1796373474.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 02/11] git-fast-import.txt: fix documentation for --quiet option
  2018-11-16  7:59           ` [PATCH v3 " Elijah Newren
  2018-11-16  7:59             ` [PATCH v3 01/11] fast-export: convert sha1 to oid Elijah Newren
@ 2018-11-16  7:59             ` Elijah Newren
  2018-11-16  7:59             ` [PATCH v3 03/11] git-fast-export.txt: clarify misleading documentation about rev-list args Elijah Newren
                               ` (9 subsequent siblings)
  11 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-16  7:59 UTC (permalink / raw)
  To: gitster
  Cc: git, larsxschneider, sandals, peff, me, jrnieder, szeder.dev,
	Elijah Newren

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-import.txt | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index e81117d27f..7ab97745a6 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -40,9 +40,10 @@ OPTIONS
 	not contain the old commit).
 
 --quiet::
-	Disable all non-fatal output, making fast-import silent when it
-	is successful.  This option disables the output shown by
-	--stats.
+	Disable the output shown by --stats, making fast-import usually
+	be silent when it is successful.  However, if the import stream
+	has directives intended to show user output (e.g. `progress`
+	directives), the corresponding messages will still be shown.
 
 --stats::
 	Display some basic statistics about the objects fast-import has
-- 
2.19.1.1063.g1796373474.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 03/11] git-fast-export.txt: clarify misleading documentation about rev-list args
  2018-11-16  7:59           ` [PATCH v3 " Elijah Newren
  2018-11-16  7:59             ` [PATCH v3 01/11] fast-export: convert sha1 to oid Elijah Newren
  2018-11-16  7:59             ` [PATCH v3 02/11] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
@ 2018-11-16  7:59             ` Elijah Newren
  2018-11-16  7:59             ` [PATCH v3 04/11] fast-export: use value from correct enum Elijah Newren
                               ` (8 subsequent siblings)
  11 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-16  7:59 UTC (permalink / raw)
  To: gitster
  Cc: git, larsxschneider, sandals, peff, me, jrnieder, szeder.dev,
	Elijah Newren

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-export.txt | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index ce954be532..fda55b3284 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -119,7 +119,8 @@ marks the same across runs.
 	'git rev-list', that specifies the specific objects and references
 	to export.  For example, `master~10..master` causes the
 	current master reference to be exported along with all objects
-	added since its 10th ancestor commit.
+	added since its 10th ancestor commit and all files common to
+	master{tilde}9 and master{tilde}10.
 
 EXAMPLES
 --------
-- 
2.19.1.1063.g1796373474.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 04/11] fast-export: use value from correct enum
  2018-11-16  7:59           ` [PATCH v3 " Elijah Newren
                               ` (2 preceding siblings ...)
  2018-11-16  7:59             ` [PATCH v3 03/11] git-fast-export.txt: clarify misleading documentation about rev-list args Elijah Newren
@ 2018-11-16  7:59             ` Elijah Newren
  2018-11-16  7:59             ` [PATCH v3 05/11] fast-export: avoid dying when filtering by paths and old tags exist Elijah Newren
                               ` (7 subsequent siblings)
  11 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-16  7:59 UTC (permalink / raw)
  To: gitster
  Cc: git, larsxschneider, sandals, peff, me, jrnieder, szeder.dev,
	Elijah Newren

ABORT and ERROR happen to have the same value, but come from differnt
enums.  Use the one from the correct enum, and while at it, rename the
values to avoid such problems.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index f5166ac71e..e2be35f41e 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -31,8 +31,8 @@ static const char *fast_export_usage[] = {
 };
 
 static int progress;
-static enum { ABORT, VERBATIM, WARN, WARN_STRIP, STRIP } signed_tag_mode = ABORT;
-static enum { ERROR, DROP, REWRITE } tag_of_filtered_mode = ERROR;
+static enum { SIGNED_TAG_ABORT, VERBATIM, WARN, WARN_STRIP, STRIP } signed_tag_mode = SIGNED_TAG_ABORT;
+static enum { TAG_FILTERING_ABORT, DROP, REWRITE } tag_of_filtered_mode = TAG_FILTERING_ABORT;
 static int fake_missing_tagger;
 static int use_done_feature;
 static int no_data;
@@ -46,7 +46,7 @@ static int parse_opt_signed_tag_mode(const struct option *opt,
 				     const char *arg, int unset)
 {
 	if (unset || !strcmp(arg, "abort"))
-		signed_tag_mode = ABORT;
+		signed_tag_mode = SIGNED_TAG_ABORT;
 	else if (!strcmp(arg, "verbatim") || !strcmp(arg, "ignore"))
 		signed_tag_mode = VERBATIM;
 	else if (!strcmp(arg, "warn"))
@@ -64,7 +64,7 @@ static int parse_opt_tag_of_filtered_mode(const struct option *opt,
 					  const char *arg, int unset)
 {
 	if (unset || !strcmp(arg, "abort"))
-		tag_of_filtered_mode = ERROR;
+		tag_of_filtered_mode = TAG_FILTERING_ABORT;
 	else if (!strcmp(arg, "drop"))
 		tag_of_filtered_mode = DROP;
 	else if (!strcmp(arg, "rewrite"))
@@ -728,7 +728,7 @@ static void handle_tag(const char *name, struct tag *tag)
 					       "\n-----BEGIN PGP SIGNATURE-----\n");
 		if (signature)
 			switch(signed_tag_mode) {
-			case ABORT:
+			case SIGNED_TAG_ABORT:
 				die("encountered signed tag %s; use "
 				    "--signed-tags=<mode> to handle it",
 				    oid_to_hex(&tag->object.oid));
@@ -753,7 +753,7 @@ static void handle_tag(const char *name, struct tag *tag)
 	tagged_mark = get_object_mark(tagged);
 	if (!tagged_mark) {
 		switch(tag_of_filtered_mode) {
-		case ABORT:
+		case TAG_FILTERING_ABORT:
 			die("tag %s tags unexported object; use "
 			    "--tag-of-filtered-object=<mode> to handle it",
 			    oid_to_hex(&tag->object.oid));
-- 
2.19.1.1063.g1796373474.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 05/11] fast-export: avoid dying when filtering by paths and old tags exist
  2018-11-16  7:59           ` [PATCH v3 " Elijah Newren
                               ` (3 preceding siblings ...)
  2018-11-16  7:59             ` [PATCH v3 04/11] fast-export: use value from correct enum Elijah Newren
@ 2018-11-16  7:59             ` Elijah Newren
  2018-11-16  7:59             ` [PATCH v3 06/11] fast-export: move commit rewriting logic into a function for reuse Elijah Newren
                               ` (6 subsequent siblings)
  11 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-16  7:59 UTC (permalink / raw)
  To: gitster
  Cc: git, larsxschneider, sandals, peff, me, jrnieder, szeder.dev,
	Elijah Newren

If --tag-of-filtered-object=rewrite is specified along with a set of
paths to limit what is exported, then any tags pointing to old commits
that do not contain any of those specified paths cause problems.  Since
the old tagged commit is not exported, fast-export attempts to rewrite
such tags to an ancestor commit which was exported.  If no such commit
exists, then fast-export currently die()s.  Five years after the tag
rewriting logic was added to fast-export (see commit 2d8ad4691921,
"fast-export: Add a --tag-of-filtered-object  option for newly dangling
tags", 2009-06-25), fast-import gained the ability to delete refs (see
commit 4ee1b225b99f, "fast-import: add support to delete refs",
2014-04-20), so now we do have a valid option to rewrite the tag to.
Delete these tags instead of dying.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c  |  9 ++++++---
 t/t9350-fast-export.sh | 16 ++++++++++++++++
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index e2be35f41e..7d50f5414e 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -775,9 +775,12 @@ static void handle_tag(const char *name, struct tag *tag)
 					break;
 				if (!(p->object.flags & TREESAME))
 					break;
-				if (!p->parents)
-					die("can't find replacement commit for tag %s",
-					     oid_to_hex(&tag->object.oid));
+				if (!p->parents) {
+					printf("reset %s\nfrom %s\n\n",
+					       name, oid_to_hex(&null_oid));
+					free(buf);
+					return;
+				}
 				p = p->parents->item;
 			}
 			tagged_mark = get_object_mark(&p->object);
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 6a392e87bc..3400ebeb51 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -325,6 +325,22 @@ test_expect_success 'rewriting tag of filtered out object' '
 )
 '
 
+test_expect_success 'rewrite tag predating pathspecs to nothing' '
+	test_create_repo rewrite_tag_predating_pathspecs &&
+	(
+		cd rewrite_tag_predating_pathspecs &&
+
+		test_commit initial &&
+
+		git tag -a -m "Some old tag" v0.0.0.0.0.0.1 &&
+
+		test_commit bar &&
+
+		git fast-export --tag-of-filtered-object=rewrite --all -- bar.t >output &&
+		grep from.$ZERO_OID output
+	)
+'
+
 cat > limit-by-paths/expected << EOF
 blob
 mark :1
-- 
2.19.1.1063.g1796373474.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 06/11] fast-export: move commit rewriting logic into a function for reuse
  2018-11-16  7:59           ` [PATCH v3 " Elijah Newren
                               ` (4 preceding siblings ...)
  2018-11-16  7:59             ` [PATCH v3 05/11] fast-export: avoid dying when filtering by paths and old tags exist Elijah Newren
@ 2018-11-16  7:59             ` Elijah Newren
  2018-11-16  7:59             ` [PATCH v3 07/11] fast-export: when using paths, avoid corrupt stream with non-existent mark Elijah Newren
                               ` (5 subsequent siblings)
  11 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-16  7:59 UTC (permalink / raw)
  To: gitster
  Cc: git, larsxschneider, sandals, peff, me, jrnieder, szeder.dev,
	Elijah Newren

Logic to replace a filtered commit with an unfiltered ancestor is useful
elsewhere; put it into a function we can call.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c | 37 ++++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 7d50f5414e..43e98a38a8 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -187,6 +187,22 @@ static int get_object_mark(struct object *object)
 	return ptr_to_mark(decoration);
 }
 
+static struct commit *rewrite_commit(struct commit *p)
+{
+	for (;;) {
+		if (p->parents && p->parents->next)
+			break;
+		if (p->object.flags & UNINTERESTING)
+			break;
+		if (!(p->object.flags & TREESAME))
+			break;
+		if (!p->parents)
+			return NULL;
+		p = p->parents->item;
+	}
+	return p;
+}
+
 static void show_progress(void)
 {
 	static int counter = 0;
@@ -767,21 +783,12 @@ static void handle_tag(const char *name, struct tag *tag)
 				    oid_to_hex(&tag->object.oid),
 				    type_name(tagged->type));
 			}
-			p = (struct commit *)tagged;
-			for (;;) {
-				if (p->parents && p->parents->next)
-					break;
-				if (p->object.flags & UNINTERESTING)
-					break;
-				if (!(p->object.flags & TREESAME))
-					break;
-				if (!p->parents) {
-					printf("reset %s\nfrom %s\n\n",
-					       name, oid_to_hex(&null_oid));
-					free(buf);
-					return;
-				}
-				p = p->parents->item;
+			p = rewrite_commit((struct commit *)tagged);
+			if (!p) {
+				printf("reset %s\nfrom %s\n\n",
+				       name, oid_to_hex(&null_oid));
+				free(buf);
+				return;
 			}
 			tagged_mark = get_object_mark(&p->object);
 		}
-- 
2.19.1.1063.g1796373474.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 07/11] fast-export: when using paths, avoid corrupt stream with non-existent mark
  2018-11-16  7:59           ` [PATCH v3 " Elijah Newren
                               ` (5 preceding siblings ...)
  2018-11-16  7:59             ` [PATCH v3 06/11] fast-export: move commit rewriting logic into a function for reuse Elijah Newren
@ 2018-11-16  7:59             ` Elijah Newren
  2018-11-16  7:59             ` [PATCH v3 08/11] fast-export: ensure we export requested refs Elijah Newren
                               ` (4 subsequent siblings)
  11 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-16  7:59 UTC (permalink / raw)
  To: gitster
  Cc: git, larsxschneider, sandals, peff, me, jrnieder, szeder.dev,
	Elijah Newren

If file paths are specified to fast-export and multiple refs point to a
commit that does not touch any of the relevant file paths, then
fast-export can hit problems.  fast-export has a list of additional refs
that it needs to explicitly set after exporting all blobs and commits,
and when it tries to get_object_mark() on the relevant commit, it can
get a mark of 0, i.e. "not found", because the commit in question did
not touch the relevant paths and thus was not exported.  Trying to
import a stream with a mark corresponding to an unexported object will
cause fast-import to crash.

Avoid this problem by taking the commit the ref points to and finding an
ancestor of it that was exported, and make the ref point to that commit
instead.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c  | 13 ++++++++++++-
 t/t9350-fast-export.sh | 20 ++++++++++++++++++++
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 43e98a38a8..227488ae84 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -901,7 +901,18 @@ static void handle_tags_and_duplicates(void)
 			if (anonymize)
 				name = anonymize_refname(name);
 			/* create refs pointing to already seen commits */
-			commit = (struct commit *)object;
+			commit = rewrite_commit((struct commit *)object);
+			if (!commit) {
+				/*
+				 * Neither this object nor any of its
+				 * ancestors touch any relevant paths, so
+				 * it has been filtered to nothing.  Delete
+				 * it.
+				 */
+				printf("reset %s\nfrom %s\n\n",
+				       name, oid_to_hex(&null_oid));
+				continue;
+			}
 			printf("reset %s\nfrom :%d\n\n", name,
 			       get_object_mark(&commit->object));
 			show_progress();
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 3400ebeb51..299120ba70 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -382,6 +382,26 @@ test_expect_success 'path limiting with import-marks does not lose unmodified fi
 	grep file0 actual
 '
 
+test_expect_success 'avoid corrupt stream with non-existent mark' '
+	test_create_repo avoid_non_existent_mark &&
+	(
+		cd avoid_non_existent_mark &&
+
+		test_commit important-path &&
+
+		test_commit ignored &&
+
+		git branch A &&
+		git branch B &&
+
+		echo foo >>important-path.t &&
+		git add important-path.t &&
+		test_commit more changes &&
+
+		git fast-export --all -- important-path.t | git fast-import --force
+	)
+'
+
 test_expect_success 'full-tree re-shows unmodified files'        '
 	git checkout -f simple &&
 	git fast-export --full-tree simple >actual &&
-- 
2.19.1.1063.g1796373474.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 08/11] fast-export: ensure we export requested refs
  2018-11-16  7:59           ` [PATCH v3 " Elijah Newren
                               ` (6 preceding siblings ...)
  2018-11-16  7:59             ` [PATCH v3 07/11] fast-export: when using paths, avoid corrupt stream with non-existent mark Elijah Newren
@ 2018-11-16  7:59             ` Elijah Newren
  2018-11-16  7:59             ` [PATCH v3 09/11] fast-export: add --reference-excluded-parents option Elijah Newren
                               ` (3 subsequent siblings)
  11 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-16  7:59 UTC (permalink / raw)
  To: gitster
  Cc: git, larsxschneider, sandals, peff, me, jrnieder, szeder.dev,
	Elijah Newren

If file paths are specified to fast-export and a ref points to a commit
that does not touch any of the relevant paths, then that ref would
sometimes fail to be exported.  (This depends on whether any ancestors
of the commit which do touch the relevant paths would be exported with
that same ref name or a different ref name.)  To avoid this problem,
put *all* specified refs into extra_refs to start, and then as we export
each commit, remove the refname used in the 'commit $REFNAME' directive
from extra_refs.  Then, in handle_tags_and_duplicates() we know which
refs actually do need a manual reset directive in order to be included.

This means that we do need some special handling for excluded refs; e.g.
if someone runs
   git fast-export ^master master
then they've asked for master to be exported, but they have also asked
for the commit which master points to and all of its history to be
excluded.  That logically means ref deletion.  Previously, such refs
were just silently omitted from being exported despite having been
explicitly requested for export.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c  | 54 ++++++++++++++++++++++++++++++++----------
 t/t9350-fast-export.sh | 16 ++++++++++---
 2 files changed, 55 insertions(+), 15 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 227488ae84..d71e0333d4 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -38,6 +38,7 @@ static int use_done_feature;
 static int no_data;
 static int full_tree;
 static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
+static struct string_list tag_refs = STRING_LIST_INIT_NODUP;
 static struct refspec refspecs = REFSPEC_INIT_FETCH;
 static int anonymize;
 static struct revision_sources revision_sources;
@@ -612,6 +613,13 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 			export_blob(&diff_queued_diff.queue[i]->two->oid);
 
 	refname = *revision_sources_at(&revision_sources, commit);
+	/*
+	 * FIXME: string_list_remove() below for each ref is overall
+	 * O(N^2).  Compared to a history walk and diffing trees, this is
+	 * just lost in the noise in practice.  However, theoretically a
+	 * repo may have enough refs for this to become slow.
+	 */
+	string_list_remove(&extra_refs, refname, 0);
 	if (anonymize) {
 		refname = anonymize_refname(refname);
 		anonymize_ident_line(&committer, &committer_end);
@@ -815,7 +823,7 @@ static struct commit *get_commit(struct rev_cmdline_entry *e, char *full_name)
 		/* handle nested tags */
 		while (tag && tag->object.type == OBJ_TAG) {
 			parse_object(the_repository, &tag->object.oid);
-			string_list_append(&extra_refs, full_name)->util = tag;
+			string_list_append(&tag_refs, full_name)->util = tag;
 			tag = (struct tag *)tag->tagged;
 		}
 		if (!tag)
@@ -874,25 +882,30 @@ static void get_tags_and_duplicates(struct rev_cmdline_info *info)
 		}
 
 		/*
-		 * This ref will not be updated through a commit, lets make
-		 * sure it gets properly updated eventually.
+		 * Make sure this ref gets properly updated eventually, whether
+		 * through a commit or manually at the end.
 		 */
-		if (*revision_sources_at(&revision_sources, commit) ||
-		    commit->object.flags & SHOWN)
+		if (e->item->type != OBJ_TAG)
 			string_list_append(&extra_refs, full_name)->util = commit;
+
 		if (!*revision_sources_at(&revision_sources, commit))
 			*revision_sources_at(&revision_sources, commit) = full_name;
 	}
+
+	string_list_sort(&extra_refs);
+	string_list_remove_duplicates(&extra_refs, 0);
 }
 
-static void handle_tags_and_duplicates(void)
+static void handle_tags_and_duplicates(struct string_list *extras)
 {
 	struct commit *commit;
 	int i;
 
-	for (i = extra_refs.nr - 1; i >= 0; i--) {
-		const char *name = extra_refs.items[i].string;
-		struct object *object = extra_refs.items[i].util;
+	for (i = extras->nr - 1; i >= 0; i--) {
+		const char *name = extras->items[i].string;
+		struct object *object = extras->items[i].util;
+		int mark;
+
 		switch (object->type) {
 		case OBJ_TAG:
 			handle_tag(name, (struct tag *)object);
@@ -913,8 +926,24 @@ static void handle_tags_and_duplicates(void)
 				       name, oid_to_hex(&null_oid));
 				continue;
 			}
-			printf("reset %s\nfrom :%d\n\n", name,
-			       get_object_mark(&commit->object));
+
+			mark = get_object_mark(&commit->object);
+			if (!mark) {
+				/*
+				 * Getting here means we have a commit which
+				 * was excluded by a negative refspec (e.g.
+				 * fast-export ^master master).  If the user
+				 * wants the branch exported but every commit
+				 * in its history to be deleted, that sounds
+				 * like a ref deletion to me.
+				 */
+				printf("reset %s\nfrom %s\n\n",
+				       name, oid_to_hex(&null_oid));
+				continue;
+			}
+
+			printf("reset %s\nfrom :%d\n\n", name, mark
+			       );
 			show_progress();
 			break;
 		}
@@ -1102,7 +1131,8 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		}
 	}
 
-	handle_tags_and_duplicates();
+	handle_tags_and_duplicates(&extra_refs);
+	handle_tags_and_duplicates(&tag_refs);
 	handle_deletes();
 
 	if (export_filename && lastimportid != last_idnum)
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 299120ba70..50c2fceef4 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -544,10 +544,20 @@ test_expect_success 'use refspec' '
 	test_cmp expected actual
 '
 
-test_expect_success 'delete refspec' '
+test_expect_success 'delete ref because entire history excluded' '
 	git branch to-delete &&
-	git fast-export --refspec :refs/heads/to-delete to-delete ^to-delete > actual &&
-	cat > expected <<-EOF &&
+	git fast-export to-delete ^to-delete >actual &&
+	cat >expected <<-EOF &&
+	reset refs/heads/to-delete
+	from 0000000000000000000000000000000000000000
+
+	EOF
+	test_cmp expected actual
+'
+
+test_expect_success 'delete refspec' '
+	git fast-export --refspec :refs/heads/to-delete >actual &&
+	cat >expected <<-EOF &&
 	reset refs/heads/to-delete
 	from 0000000000000000000000000000000000000000
 
-- 
2.19.1.1063.g1796373474.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 09/11] fast-export: add --reference-excluded-parents option
  2018-11-16  7:59           ` [PATCH v3 " Elijah Newren
                               ` (7 preceding siblings ...)
  2018-11-16  7:59             ` [PATCH v3 08/11] fast-export: ensure we export requested refs Elijah Newren
@ 2018-11-16  7:59             ` Elijah Newren
  2018-11-16  7:59             ` [PATCH v3 10/11] fast-import: remove unmaintained duplicate documentation Elijah Newren
                               ` (2 subsequent siblings)
  11 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-16  7:59 UTC (permalink / raw)
  To: gitster
  Cc: git, larsxschneider, sandals, peff, me, jrnieder, szeder.dev,
	Elijah Newren

git filter-branch has a nifty feature allowing you to rewrite, e.g. just
the last 8 commits of a linear history
  git filter-branch $OPTIONS HEAD~8..HEAD

If you try the same with git fast-export, you instead get a history of
only 8 commits, with HEAD~7 being rewritten into a root commit.  There
are two alternatives:

  1) Don't use the negative revision specification, and when you're
     filtering the output to make modifications to the last 8 commits,
     just be careful to not modify any earlier commits somehow.

  2) First run 'git fast-export --export-marks=somefile HEAD~8', then
     run 'git fast-export --import-marks=somefile HEAD~8..HEAD'.

Both are more error prone than I'd like (the first for obvious reasons;
with the second option I have sometimes accidentally included too many
revisions in the first command and then found that the corresponding
extra revisions were not exported by the second command and thus were
not modified as I expected).  Also, both are poor from a performance
perspective.

Add a new --reference-excluded-parents option which will cause
fast-export to refer to commits outside the specified rev-list-args
range by their sha1sum.  Such a stream will only be useful in a
repository which already contains the necessary commits (much like the
restriction imposed when using --no-data).

Note from Peff:
  I think we might be able to do a little more optimization here. If
  we're exporting HEAD^..HEAD and there's an object in HEAD^ which is
  unchanged in HEAD, I think we'd still print it (because it would not
  be marked SHOWN), but we could omit it (by walking the tree of the
  boundary commits and marking them shown).  I don't think it's a
  blocker for what you're doing here, but just a possible future
  optimization.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-export.txt | 17 +++++++++++--
 builtin/fast-export.c             | 42 +++++++++++++++++++++++--------
 t/t9350-fast-export.sh            | 11 ++++++++
 3 files changed, 58 insertions(+), 12 deletions(-)

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index fda55b3284..f65026662a 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -110,6 +110,18 @@ marks the same across runs.
 	the shape of the history and stored tree.  See the section on
 	`ANONYMIZING` below.
 
+--reference-excluded-parents::
+	By default, running a command such as `git fast-export
+	master~5..master` will not include the commit master{tilde}5
+	and will make master{tilde}4 no longer have master{tilde}5 as
+	a parent (though both the old master{tilde}4 and new
+	master{tilde}4 will have all the same files).  Use
+	--reference-excluded-parents to instead have the the stream
+	refer to commits in the excluded range of history by their
+	sha1sum.  Note that the resulting stream can only be used by a
+	repository which already contains the necessary parent
+	commits.
+
 --refspec::
 	Apply the specified refspec to each ref exported. Multiple of them can
 	be specified.
@@ -119,8 +131,9 @@ marks the same across runs.
 	'git rev-list', that specifies the specific objects and references
 	to export.  For example, `master~10..master` causes the
 	current master reference to be exported along with all objects
-	added since its 10th ancestor commit and all files common to
-	master{tilde}9 and master{tilde}10.
+	added since its 10th ancestor commit and (unless the
+	--reference-excluded-parents option is specified) all files
+	common to master{tilde}9 and master{tilde}10.
 
 EXAMPLES
 --------
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index d71e0333d4..78fc67b03a 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -37,6 +37,7 @@ static int fake_missing_tagger;
 static int use_done_feature;
 static int no_data;
 static int full_tree;
+static int reference_excluded_commits;
 static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
 static struct string_list tag_refs = STRING_LIST_INIT_NODUP;
 static struct refspec refspecs = REFSPEC_INIT_FETCH;
@@ -597,7 +598,8 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 		message += 2;
 
 	if (commit->parents &&
-	    get_object_mark(&commit->parents->item->object) != 0 &&
+	    (get_object_mark(&commit->parents->item->object) != 0 ||
+	     reference_excluded_commits) &&
 	    !full_tree) {
 		parse_commit_or_die(commit->parents->item);
 		diff_tree_oid(get_commit_tree_oid(commit->parents->item),
@@ -645,13 +647,21 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 	unuse_commit_buffer(commit, commit_buffer);
 
 	for (i = 0, p = commit->parents; p; p = p->next) {
-		int mark = get_object_mark(&p->item->object);
-		if (!mark)
+		struct object *obj = &p->item->object;
+		int mark = get_object_mark(obj);
+
+		if (!mark && !reference_excluded_commits)
 			continue;
 		if (i == 0)
-			printf("from :%d\n", mark);
+			printf("from ");
+		else
+			printf("merge ");
+		if (mark)
+			printf(":%d\n", mark);
 		else
-			printf("merge :%d\n", mark);
+			printf("%s\n", oid_to_hex(anonymize ?
+						  anonymize_oid(&obj->oid) :
+						  &obj->oid));
 		i++;
 	}
 
@@ -932,13 +942,22 @@ static void handle_tags_and_duplicates(struct string_list *extras)
 				/*
 				 * Getting here means we have a commit which
 				 * was excluded by a negative refspec (e.g.
-				 * fast-export ^master master).  If the user
+				 * fast-export ^master master).  If we are
+				 * referencing excluded commits, set the ref
+				 * to the exact commit.  Otherwise, the user
 				 * wants the branch exported but every commit
-				 * in its history to be deleted, that sounds
-				 * like a ref deletion to me.
+				 * in its history to be deleted, which basically
+				 * just means deletion of the ref.
 				 */
-				printf("reset %s\nfrom %s\n\n",
-				       name, oid_to_hex(&null_oid));
+				if (!reference_excluded_commits) {
+					/* delete the ref */
+					printf("reset %s\nfrom %s\n\n",
+					       name, oid_to_hex(&null_oid));
+					continue;
+				}
+				/* set ref to commit using oid, not mark */
+				printf("reset %s\nfrom %s\n\n", name,
+				       oid_to_hex(&commit->object.oid));
 				continue;
 			}
 
@@ -1075,6 +1094,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_STRING_LIST(0, "refspec", &refspecs_list, N_("refspec"),
 			     N_("Apply refspec to exported refs")),
 		OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")),
+		OPT_BOOL(0, "reference-excluded-parents",
+			 &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by object id")),
+
 		OPT_END()
 	};
 
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 50c2fceef4..d7d73061d0 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -66,6 +66,17 @@ test_expect_success 'fast-export master~2..master' '
 
 '
 
+test_expect_success 'fast-export --reference-excluded-parents master~2..master' '
+
+	git fast-export --reference-excluded-parents master~2..master >actual &&
+	grep commit.refs/heads/master actual >commit-count &&
+	test_line_count = 2 commit-count &&
+	sed "s/master/rewrite/" actual |
+		(cd new &&
+		 git fast-import &&
+		 test $MASTER = $(git rev-parse --verify refs/heads/rewrite))
+'
+
 test_expect_success 'iso-8859-1' '
 
 	git config i18n.commitencoding ISO8859-1 &&
-- 
2.19.1.1063.g1796373474.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 10/11] fast-import: remove unmaintained duplicate documentation
  2018-11-16  7:59           ` [PATCH v3 " Elijah Newren
                               ` (8 preceding siblings ...)
  2018-11-16  7:59             ` [PATCH v3 09/11] fast-export: add --reference-excluded-parents option Elijah Newren
@ 2018-11-16  7:59             ` Elijah Newren
  2018-11-16  7:59             ` [PATCH v3 11/11] fast-export: add a --show-original-ids option to show original names Elijah Newren
  2018-11-16  8:50             ` [PATCH v3 00/11] fast export and import fixes and features Jeff King
  11 siblings, 0 replies; 90+ messages in thread
From: Elijah Newren @ 2018-11-16  7:59 UTC (permalink / raw)
  To: gitster
  Cc: git, larsxschneider, sandals, peff, me, jrnieder, szeder.dev,
	Elijah Newren

fast-import.c has started with a comment for nine and a half years
re-directing the reader to Documentation/git-fast-import.txt for
maintained documentation.  Instead of leaving the unmaintained
documentation in place, just excise it.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 fast-import.c | 154 --------------------------------------------------
 1 file changed, 154 deletions(-)

diff --git a/fast-import.c b/fast-import.c
index 95600c78e0..555d49ad23 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1,157 +1,3 @@
-/*
-(See Documentation/git-fast-import.txt for maintained documentation.)
-Format of STDIN stream:
-
-  stream ::= cmd*;
-
-  cmd ::= new_blob
-        | new_commit
-        | new_tag
-        | reset_branch
-        | checkpoint
-        | progress
-        ;
-
-  new_blob ::= 'blob' lf
-    mark?
-    file_content;
-  file_content ::= data;
-
-  new_commit ::= 'commit' sp ref_str lf
-    mark?
-    ('author' (sp name)? sp '<' email '>' sp when lf)?
-    'committer' (sp name)? sp '<' email '>' sp when lf
-    commit_msg
-    ('from' sp commit-ish lf)?
-    ('merge' sp commit-ish lf)*
-    (file_change | ls)*
-    lf?;
-  commit_msg ::= data;
-
-  ls ::= 'ls' sp '"' quoted(path) '"' lf;
-
-  file_change ::= file_clr
-    | file_del
-    | file_rnm
-    | file_cpy
-    | file_obm
-    | file_inm;
-  file_clr ::= 'deleteall' lf;
-  file_del ::= 'D' sp path_str lf;
-  file_rnm ::= 'R' sp path_str sp path_str lf;
-  file_cpy ::= 'C' sp path_str sp path_str lf;
-  file_obm ::= 'M' sp mode sp (hexsha1 | idnum) sp path_str lf;
-  file_inm ::= 'M' sp mode sp 'inline' sp path_str lf
-    data;
-  note_obm ::= 'N' sp (hexsha1 | idnum) sp commit-ish lf;
-  note_inm ::= 'N' sp 'inline' sp commit-ish lf
-    data;
-
-  new_tag ::= 'tag' sp tag_str lf
-    'from' sp commit-ish lf
-    ('tagger' (sp name)? sp '<' email '>' sp when lf)?
-    tag_msg;
-  tag_msg ::= data;
-
-  reset_branch ::= 'reset' sp ref_str lf
-    ('from' sp commit-ish lf)?
-    lf?;
-
-  checkpoint ::= 'checkpoint' lf
-    lf?;
-
-  progress ::= 'progress' sp not_lf* lf
-    lf?;
-
-     # note: the first idnum in a stream should be 1 and subsequent
-     # idnums should not have gaps between values as this will cause
-     # the stream parser to reserve space for the gapped values.  An
-     # idnum can be updated in the future to a new object by issuing
-     # a new mark directive with the old idnum.
-     #
-  mark ::= 'mark' sp idnum lf;
-  data ::= (delimited_data | exact_data)
-    lf?;
-
-    # note: delim may be any string but must not contain lf.
-    # data_line may contain any data but must not be exactly
-    # delim.
-  delimited_data ::= 'data' sp '<<' delim lf
-    (data_line lf)*
-    delim lf;
-
-     # note: declen indicates the length of binary_data in bytes.
-     # declen does not include the lf preceding the binary data.
-     #
-  exact_data ::= 'data' sp declen lf
-    binary_data;
-
-     # note: quoted strings are C-style quoting supporting \c for
-     # common escapes of 'c' (e..g \n, \t, \\, \") or \nnn where nnn
-     # is the signed byte value in octal.  Note that the only
-     # characters which must actually be escaped to protect the
-     # stream formatting is: \, " and LF.  Otherwise these values
-     # are UTF8.
-     #
-  commit-ish  ::= (ref_str | hexsha1 | sha1exp_str | idnum);
-  ref_str     ::= ref;
-  sha1exp_str ::= sha1exp;
-  tag_str     ::= tag;
-  path_str    ::= path    | '"' quoted(path)    '"' ;
-  mode        ::= '100644' | '644'
-                | '100755' | '755'
-                | '120000'
-                ;
-
-  declen ::= # unsigned 32 bit value, ascii base10 notation;
-  bigint ::= # unsigned integer value, ascii base10 notation;
-  binary_data ::= # file content, not interpreted;
-
-  when         ::= raw_when | rfc2822_when;
-  raw_when     ::= ts sp tz;
-  rfc2822_when ::= # Valid RFC 2822 date and time;
-
-  sp ::= # ASCII space character;
-  lf ::= # ASCII newline (LF) character;
-
-     # note: a colon (':') must precede the numerical value assigned to
-     # an idnum.  This is to distinguish it from a ref or tag name as
-     # GIT does not permit ':' in ref or tag strings.
-     #
-  idnum   ::= ':' bigint;
-  path    ::= # GIT style file path, e.g. "a/b/c";
-  ref     ::= # GIT ref name, e.g. "refs/heads/MOZ_GECKO_EXPERIMENT";
-  tag     ::= # GIT tag name, e.g. "FIREFOX_1_5";
-  sha1exp ::= # Any valid GIT SHA1 expression;
-  hexsha1 ::= # SHA1 in hexadecimal format;
-
-     # note: name and email are UTF8 strings, however name must not
-     # contain '<' or lf and email must not contain any of the
-     # following: '<', '>', lf.
-     #
-  name  ::= # valid GIT author/committer name;
-  email ::= # valid GIT author/committer email;
-  ts    ::= # time since the epoch in seconds, ascii base10 notation;
-  tz    ::= # GIT style timezone;
-
-     # note: comments, get-mark, ls-tree, and cat-blob requests may
-     # appear anywhere in the input, except within a data command. Any
-     # form of the data command always escapes the related input from
-     # comment processing.
-     #
-     # In case it is not clear, the '#' that starts the comment
-     # must be the first character on that line (an lf
-     # preceded it).
-     #
-
-  get_mark ::= 'get-mark' sp idnum lf;
-  cat_blob ::= 'cat-blob' sp (hexsha1 | idnum) lf;
-  ls_tree  ::= 'ls' sp (hexsha1 | idnum) sp path_str lf;
-
-  comment ::= '#' not_lf* lf;
-  not_lf  ::= # Any byte that is not ASCII newline (LF);
-*/
-
 #include "builtin.h"
 #include "cache.h"
 #include "repository.h"
-- 
2.19.1.1063.g1796373474.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH v3 11/11] fast-export: add a --show-original-ids option to show original names
  2018-11-16  7:59           ` [PATCH v3 " Elijah Newren
                               ` (9 preceding siblings ...)
  2018-11-16  7:59             ` [PATCH v3 10/11] fast-import: remove unmaintained duplicate documentation Elijah Newren
@ 2018-11-16  7:59             ` Elijah Newren
  2018-11-16 12:29               ` SZEDER Gábor
  2018-11-16  8:50             ` [PATCH v3 00/11] fast export and import fixes and features Jeff King
  11 siblings, 1 reply; 90+ messages in thread
From: Elijah Newren @ 2018-11-16  7:59 UTC (permalink / raw)
  To: gitster
  Cc: git, larsxschneider, sandals, peff, me, jrnieder, szeder.dev,
	Elijah Newren

Knowing the original names (hashes) of commits can sometimes enable
post-filtering that would otherwise be difficult or impossible.  In
particular, the desire to rewrite commit messages which refer to other
prior commits (on top of whatever other filtering is being done) is
very difficult without knowing the original names of each commit.

In addition, knowing the original names (hashes) of blobs can allow
filtering by blob-id without requiring re-hashing the content of the
blob, and is thus useful as a small optimization.

Once we add original ids for both commits and blobs, we may as well
add them for tags too for completeness.  Perhaps someone will have a
use for them.

This commit teaches a new --show-original-ids option to fast-export
which will make it add a 'original-oid <hash>' line to blob, commits,
and tags.  It also teaches fast-import to parse (and ignore) such
lines.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-export.txt |  7 +++++++
 Documentation/git-fast-import.txt | 16 ++++++++++++++++
 builtin/fast-export.c             | 20 +++++++++++++++-----
 fast-import.c                     | 12 ++++++++++++
 t/t9350-fast-export.sh            | 17 +++++++++++++++++
 5 files changed, 67 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index f65026662a..64c01ba918 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -122,6 +122,13 @@ marks the same across runs.
 	repository which already contains the necessary parent
 	commits.
 
+--show-original-ids::
+	Add an extra directive to the output for commits and blobs,
+	`original-oid <SHA1SUM>`.  While such directives will likely be
+	ignored by importers such as git-fast-import, it may be useful
+	for intermediary filters (e.g. for rewriting commit messages
+	which refer to older commits, or for stripping blobs by id).
+
 --refspec::
 	Apply the specified refspec to each ref exported. Multiple of them can
 	be specified.
diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index 7ab97745a6..43ab3b1637 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -385,6 +385,7 @@ change to the project.
 ....
 	'commit' SP <ref> LF
 	mark?
+	original-oid?
 	('author' (SP <name>)? SP LT <email> GT SP <when> LF)?
 	'committer' (SP <name>)? SP LT <email> GT SP <when> LF
 	data
@@ -741,6 +742,19 @@ New marks are created automatically.  Existing marks can be moved
 to another object simply by reusing the same `<idnum>` in another
 `mark` command.
 
+`original-oid`
+~~~~~~~~~~~~~~
+Provides the name of the object in the original source control system.
+fast-import will simply ignore this directive, but filter processes
+which operate on and modify the stream before feeding to fast-import
+may have uses for this information
+
+....
+	'original-oid' SP <object-identifier> LF
+....
+
+where `<object-identifer>` is any string not containing LF.
+
 `tag`
 ~~~~~
 Creates an annotated tag referring to a specific commit.  To create
@@ -749,6 +763,7 @@ lightweight (non-annotated) tags see the `reset` command below.
 ....
 	'tag' SP <name> LF
 	'from' SP <commit-ish> LF
+	original-oid?
 	'tagger' (SP <name>)? SP LT <email> GT SP <when> LF
 	data
 ....
@@ -823,6 +838,7 @@ assigned mark.
 ....
 	'blob' LF
 	mark?
+	original-oid?
 	data
 ....
 
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 78fc67b03a..36c2575de5 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -38,6 +38,7 @@ static int use_done_feature;
 static int no_data;
 static int full_tree;
 static int reference_excluded_commits;
+static int show_original_ids;
 static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
 static struct string_list tag_refs = STRING_LIST_INIT_NODUP;
 static struct refspec refspecs = REFSPEC_INIT_FETCH;
@@ -271,7 +272,10 @@ static void export_blob(const struct object_id *oid)
 
 	mark_next_object(object);
 
-	printf("blob\nmark :%"PRIu32"\ndata %lu\n", last_idnum, size);
+	printf("blob\nmark :%"PRIu32"\n", last_idnum);
+	if (show_original_ids)
+		printf("original-oid %s\n", oid_to_hex(oid));
+	printf("data %lu\n", size);
 	if (size && fwrite(buf, size, 1, stdout) != 1)
 		die_errno("could not write blob '%s'", oid_to_hex(oid));
 	printf("\n");
@@ -635,8 +639,10 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 		reencoded = reencode_string(message, "UTF-8", encoding);
 	if (!commit->parents)
 		printf("reset %s\n", refname);
-	printf("commit %s\nmark :%"PRIu32"\n%.*s\n%.*s\ndata %u\n%s",
-	       refname, last_idnum,
+	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
+	if (show_original_ids)
+		printf("original-oid %s\n", oid_to_hex(&commit->object.oid));
+	printf("%.*s\n%.*s\ndata %u\n%s",
 	       (int)(author_end - author), author,
 	       (int)(committer_end - committer), committer,
 	       (unsigned)(reencoded
@@ -814,8 +820,10 @@ static void handle_tag(const char *name, struct tag *tag)
 
 	if (starts_with(name, "refs/tags/"))
 		name += 10;
-	printf("tag %s\nfrom :%d\n%.*s%sdata %d\n%.*s\n",
-	       name, tagged_mark,
+	printf("tag %s\nfrom :%d\n", name, tagged_mark);
+	if (show_original_ids)
+		printf("original-oid %s\n", oid_to_hex(&tag->object.oid));
+	printf("%.*s%sdata %d\n%.*s\n",
 	       (int)(tagger_end - tagger), tagger,
 	       tagger == tagger_end ? "" : "\n",
 	       (int)message_size, (int)message_size, message ? message : "");
@@ -1096,6 +1104,8 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")),
 		OPT_BOOL(0, "reference-excluded-parents",
 			 &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by object id")),
+		OPT_BOOL(0, "show-original-ids", &show_original_ids,
+			    N_("Show original object ids of blobs/commits")),
 
 		OPT_END()
 	};
diff --git a/fast-import.c b/fast-import.c
index 555d49ad23..71b6cba00f 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1814,6 +1814,13 @@ static void parse_mark(void)
 		next_mark = 0;
 }
 
+static void parse_original_identifier(void)
+{
+	const char *v;
+	if (skip_prefix(command_buf.buf, "original-oid ", &v))
+		read_next_command();
+}
+
 static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res)
 {
 	const char *data;
@@ -1956,6 +1963,7 @@ static void parse_new_blob(void)
 {
 	read_next_command();
 	parse_mark();
+	parse_original_identifier();
 	parse_and_store_blob(&last_blob, NULL, next_mark);
 }
 
@@ -2579,6 +2587,7 @@ static void parse_new_commit(const char *arg)
 
 	read_next_command();
 	parse_mark();
+	parse_original_identifier();
 	if (skip_prefix(command_buf.buf, "author ", &v)) {
 		author = parse_ident(v);
 		read_next_command();
@@ -2711,6 +2720,9 @@ static void parse_new_tag(const char *arg)
 		die("Invalid ref name or SHA1 expression: %s", from);
 	read_next_command();
 
+	/* original-oid ... */
+	parse_original_identifier();
+
 	/* tagger ... */
 	if (skip_prefix(command_buf.buf, "tagger ", &v)) {
 		tagger = parse_ident(v);
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index d7d73061d0..5690fe2810 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -77,6 +77,23 @@ test_expect_success 'fast-export --reference-excluded-parents master~2..master'
 		 test $MASTER = $(git rev-parse --verify refs/heads/rewrite))
 '
 
+test_expect_success 'fast-export --show-original-ids' '
+
+	git fast-export --show-original-ids master >output &&
+	grep ^original-oid output| sed -e s/^original-oid.// | sort >actual &&
+	git rev-list --objects master muss >objects-and-names &&
+	awk "{print \$1}" objects-and-names | sort >commits-trees-blobs &&
+	comm -23 actual commits-trees-blobs >unfound &&
+	test_must_be_empty unfound
+'
+
+test_expect_success 'fast-export --show-original-ids | git fast-import' '
+
+	git fast-export --show-original-ids master muss | git fast-import --quiet &&
+	test $MASTER = $(git rev-parse --verify refs/heads/master) &&
+	test $MUSS = $(git rev-parse --verify refs/tags/muss)
+'
+
 test_expect_success 'iso-8859-1' '
 
 	git config i18n.commitencoding ISO8859-1 &&
-- 
2.19.1.1063.g1796373474.dirty


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 00/11] fast export and import fixes and features
  2018-11-16  7:59           ` [PATCH v3 " Elijah Newren
                               ` (10 preceding siblings ...)
  2018-11-16  7:59             ` [PATCH v3 11/11] fast-export: add a --show-original-ids option to show original names Elijah Newren
@ 2018-11-16  8:50             ` Jeff King
  11 siblings, 0 replies; 90+ messages in thread
From: Jeff King @ 2018-11-16  8:50 UTC (permalink / raw)
  To: Elijah Newren
  Cc: gitster, git, larsxschneider, sandals, me, jrnieder, szeder.dev

On Thu, Nov 15, 2018 at 11:59:45PM -0800, Elijah Newren wrote:

> This is a series of small fixes and features for fast-export and
> fast-import, mostly on the fast-export side.
> 
> Changes since v2 (full range-diff below):
>   * Dropped the final patch; going to try to use Peff's suggestion of
>     rev-list and diff-tree to get what I need instead
>   * Inserted a new patch at the beginning to convert pre-existing sha1
>     stuff to oid (rename sha1_to_hex() -> oid_to_hex(), rename
>     anonymize_sha1() to anonymize_oid(), etc.)
>   * Modified other patches in the series to add calls to oid_to_hex() rather
>     than sha1_to_hex()

Thanks, these changes all look good to me. I have no more nits to pick.
:)

-Peff

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH v3 11/11] fast-export: add a --show-original-ids option to show original names
  2018-11-16  7:59             ` [PATCH v3 11/11] fast-export: add a --show-original-ids option to show original names Elijah Newren
@ 2018-11-16 12:29               ` SZEDER Gábor
  0 siblings, 0 replies; 90+ messages in thread
From: SZEDER Gábor @ 2018-11-16 12:29 UTC (permalink / raw)
  To: Elijah Newren; +Cc: gitster, git, larsxschneider, sandals, peff, me, jrnieder

On Thu, Nov 15, 2018 at 11:59:56PM -0800, Elijah Newren wrote:

> diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
> index d7d73061d0..5690fe2810 100755
> --- a/t/t9350-fast-export.sh
> +++ b/t/t9350-fast-export.sh
> @@ -77,6 +77,23 @@ test_expect_success 'fast-export --reference-excluded-parents master~2..master'
>  		 test $MASTER = $(git rev-parse --verify refs/heads/rewrite))
>  '
>  
> +test_expect_success 'fast-export --show-original-ids' '
> +
> +	git fast-export --show-original-ids master >output &&
> +	grep ^original-oid output| sed -e s/^original-oid.// | sort >actual &&

Nit: 'sed' can do what this 'grep' does:

  sed -n -e s/^original-oid.//p output | sort >actual &&

thus sparing a process.

> +	git rev-list --objects master muss >objects-and-names &&
> +	awk "{print \$1}" objects-and-names | sort >commits-trees-blobs &&
> +	comm -23 actual commits-trees-blobs >unfound &&
> +	test_must_be_empty unfound
> +'
> +
> +test_expect_success 'fast-export --show-original-ids | git fast-import' '
> +
> +	git fast-export --show-original-ids master muss | git fast-import --quiet &&
> +	test $MASTER = $(git rev-parse --verify refs/heads/master) &&
> +	test $MUSS = $(git rev-parse --verify refs/tags/muss)
> +'
> +
>  test_expect_success 'iso-8859-1' '
>  
>  	git config i18n.commitencoding ISO8859-1 &&
> -- 
> 2.19.1.1063.g1796373474.dirty
> 

^ permalink raw reply	[flat|nested] 90+ messages in thread

end of thread, other threads:[~2018-11-16 12:29 UTC | newest]

Thread overview: 90+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-23 13:04 Import/Export as a fast way to purge files from Git? Lars Schneider
2018-09-23 14:55 ` Eric Sunshine
2018-09-23 15:58   ` Lars Schneider
2018-09-23 15:53 ` brian m. carlson
2018-09-23 17:04   ` Jeff King
2018-09-24 17:24 ` Elijah Newren
2018-10-31 19:15   ` Lars Schneider
2018-11-01  7:12     ` Elijah Newren
2018-11-11  6:23       ` [PATCH 00/10] fast export and import fixes and features Elijah Newren
2018-11-11  6:23         ` [PATCH 01/10] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
2018-11-11  6:33           ` Jeff King
2018-11-11  6:23         ` [PATCH 02/10] git-fast-export.txt: clarify misleading documentation about rev-list args Elijah Newren
2018-11-11  6:36           ` Jeff King
2018-11-11  7:17             ` Elijah Newren
2018-11-13 23:25               ` Elijah Newren
2018-11-13 23:39                 ` Jonathan Nieder
2018-11-14  0:02                   ` Elijah Newren
2018-11-11  6:23         ` [PATCH 03/10] fast-export: use value from correct enum Elijah Newren
2018-11-11  6:36           ` Jeff King
2018-11-11 20:10             ` Ævar Arnfjörð Bjarmason
2018-11-12  9:12               ` Ævar Arnfjörð Bjarmason
2018-11-12 11:31               ` Jeff King
2018-11-11  6:23         ` [PATCH 04/10] fast-export: avoid dying when filtering by paths and old tags exist Elijah Newren
2018-11-11  6:44           ` Jeff King
2018-11-11  7:38             ` Elijah Newren
2018-11-12 12:32               ` Jeff King
2018-11-12 22:50             ` brian m. carlson
2018-11-13 14:38               ` Jeff King
2018-11-11  6:23         ` [PATCH 05/10] fast-export: move commit rewriting logic into a function for reuse Elijah Newren
2018-11-11  6:47           ` Jeff King
2018-11-11  6:23         ` [PATCH 06/10] fast-export: when using paths, avoid corrupt stream with non-existent mark Elijah Newren
2018-11-11  6:53           ` Jeff King
2018-11-11  8:01             ` Elijah Newren
2018-11-12 12:45               ` Jeff King
2018-11-12 15:36                 ` Elijah Newren
2018-11-11  6:23         ` [PATCH 07/10] fast-export: ensure we export requested refs Elijah Newren
2018-11-11  7:02           ` Jeff King
2018-11-11  8:20             ` Elijah Newren
2018-11-11  6:23         ` [PATCH 08/10] fast-export: add --reference-excluded-parents option Elijah Newren
2018-11-11  7:11           ` Jeff King
2018-11-11  6:23         ` [PATCH 09/10] fast-export: add a --show-original-ids option to show original names Elijah Newren
2018-11-11  7:20           ` Jeff King
2018-11-11  8:32             ` Elijah Newren
2018-11-12 12:53               ` Jeff King
2018-11-12 15:46                 ` Elijah Newren
2018-11-12 16:31                   ` Jeff King
2018-11-11  6:23         ` [PATCH 10/10] fast-export: add --always-show-modify-after-rename Elijah Newren
2018-11-11  7:23           ` Jeff King
2018-11-11  8:42             ` Elijah Newren
2018-11-12 12:58               ` Jeff King
2018-11-12 18:08                 ` Elijah Newren
2018-11-13 14:45                   ` Jeff King
2018-11-13 17:10                     ` Elijah Newren
2018-11-14  7:14                       ` Jeff King
2018-11-11  7:27         ` [PATCH 00/10] fast export and import fixes and features Jeff King
2018-11-11  8:44           ` Elijah Newren
2018-11-12 13:00             ` Jeff King
2018-11-14  0:25         ` [PATCH v2 00/11] " Elijah Newren
2018-11-14  0:25           ` [PATCH v2 01/11] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
2018-11-14  0:25           ` [PATCH v2 02/11] git-fast-export.txt: clarify misleading documentation about rev-list args Elijah Newren
2018-11-14  0:25           ` [PATCH v2 03/11] fast-export: use value from correct enum Elijah Newren
2018-11-14  0:25           ` [PATCH v2 04/11] fast-export: avoid dying when filtering by paths and old tags exist Elijah Newren
2018-11-14 19:17             ` SZEDER Gábor
2018-11-14 23:13               ` Elijah Newren
2018-11-14  0:25           ` [PATCH v2 05/11] fast-export: move commit rewriting logic into a function for reuse Elijah Newren
2018-11-14  0:25           ` [PATCH v2 06/11] fast-export: when using paths, avoid corrupt stream with non-existent mark Elijah Newren
2018-11-14  0:25           ` [PATCH v2 07/11] fast-export: ensure we export requested refs Elijah Newren
2018-11-14  0:25           ` [PATCH v2 08/11] fast-export: add --reference-excluded-parents option Elijah Newren
2018-11-14 19:27             ` SZEDER Gábor
2018-11-14 23:16               ` Elijah Newren
2018-11-14  0:25           ` [PATCH v2 09/11] fast-import: remove unmaintained duplicate documentation Elijah Newren
2018-11-14  0:25           ` [PATCH v2 10/11] fast-export: add a --show-original-ids option to show original names Elijah Newren
2018-11-14  0:26           ` [PATCH v2 11/11] fast-export: add --always-show-modify-after-rename Elijah Newren
2018-11-14  7:25           ` [PATCH v2 00/11] fast export and import fixes and features Jeff King
2018-11-16  7:59           ` [PATCH v3 " Elijah Newren
2018-11-16  7:59             ` [PATCH v3 01/11] fast-export: convert sha1 to oid Elijah Newren
2018-11-16  7:59             ` [PATCH v3 02/11] git-fast-import.txt: fix documentation for --quiet option Elijah Newren
2018-11-16  7:59             ` [PATCH v3 03/11] git-fast-export.txt: clarify misleading documentation about rev-list args Elijah Newren
2018-11-16  7:59             ` [PATCH v3 04/11] fast-export: use value from correct enum Elijah Newren
2018-11-16  7:59             ` [PATCH v3 05/11] fast-export: avoid dying when filtering by paths and old tags exist Elijah Newren
2018-11-16  7:59             ` [PATCH v3 06/11] fast-export: move commit rewriting logic into a function for reuse Elijah Newren
2018-11-16  7:59             ` [PATCH v3 07/11] fast-export: when using paths, avoid corrupt stream with non-existent mark Elijah Newren
2018-11-16  7:59             ` [PATCH v3 08/11] fast-export: ensure we export requested refs Elijah Newren
2018-11-16  7:59             ` [PATCH v3 09/11] fast-export: add --reference-excluded-parents option Elijah Newren
2018-11-16  7:59             ` [PATCH v3 10/11] fast-import: remove unmaintained duplicate documentation Elijah Newren
2018-11-16  7:59             ` [PATCH v3 11/11] fast-export: add a --show-original-ids option to show original names Elijah Newren
2018-11-16 12:29               ` SZEDER Gábor
2018-11-16  8:50             ` [PATCH v3 00/11] fast export and import fixes and features Jeff King
2018-11-12  9:17       ` Import/Export as a fast way to purge files from Git? Ævar Arnfjörð Bjarmason
2018-11-12 15:34         ` Elijah Newren

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).