git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* Round-tripping fast-export/import changes commit hashes
@ 2021-02-27 12:31 anatoly techtonik
  2021-02-27 17:48 ` Elijah Newren
  0 siblings, 1 reply; 19+ messages in thread
From: anatoly techtonik @ 2021-02-27 12:31 UTC (permalink / raw)
  To: Git Mailing List

Hi.

I can't get the same commit hashes after fast-export and then fast-import of
this repository without any edits https://github.com/simons-public/protonfixes
I have no idea what causes this, and how to prevent it from happening. Are
there any workarounds?

What did you do before the bug happened? (Steps to reproduce your issue)

  #!/bin/bash

  git clone https://github.com/simons-public/protonfixes.git
  git -C protonfixes log --format=oneline | tail -n 4

  git init protoimported
  git -C protonfixes fast-export --all --reencode=no | (cd
protoimported && git fast-import)
  git -C protoimported log --format=oneline | tail -n 4

What did you expect to happen? (Expected behavior)

  Expect imported repo to match exported.

What happened instead? (Actual behavior)

  All hashes are different, the exported repo diverged on the second commit.

What's different between what you expected and what actually happened?

  The log of hashes from initial repo:

    + git -C protonfixes log --format=oneline
    + tail -n 4
    1c0cf2c8e742e673dba9fd1a09afd12a25c25571 Update README.md
    367d61f9b2a799accbdaeed5d64f9be914ca0f7a Updated zip link
    d3d24b63446c7d06586eaa51764ff0c619113f09 Update README.md
    7a43ca89ff7a70127ac9ca0f10b6eaaa34f2f69c Initial commit

  The log from imported repo:

    + git -C protoimported log --format=oneline
    + tail -n 4
    a27ec5d2e4c562f40e693e0b4149959d2b69bf21 Update README.md
    e59cf92be79c47984e9f94bfad912e5a29dfa5e0 Updated zip link
    fb6498f62af783d2e943770f90bc642cf5c9ec9c Update README.md
    7a43ca89ff7a70127ac9ca0f10b6eaaa34f2f69c Initial commit

[System Info]
git version:
git version 2.31.0.rc0
cpu: x86_64
built from commit: 225365fb5195e804274ab569ac3cc4919451dc7f
sizeof-long: 8
sizeof-size_t: 8
shell-path: /bin/sh
uname: Linux 5.8.0-43-generic #49-Ubuntu SMP Fri Feb 5 03:01:28 UTC 2021 x86_64
compiler info: gnuc: 10.2
libc info: glibc: 2.32
$SHELL (typically, interactive shell): /usr/bin/zsh


[Enabled Hooks]
not run from a git repository - no hooks to show
-- 
anatoly t.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-02-27 12:31 Round-tripping fast-export/import changes commit hashes anatoly techtonik
@ 2021-02-27 17:48 ` Elijah Newren
  2021-02-28 10:00   ` anatoly techtonik
  0 siblings, 1 reply; 19+ messages in thread
From: Elijah Newren @ 2021-02-27 17:48 UTC (permalink / raw)
  To: anatoly techtonik; +Cc: Git Mailing List

Hi,

On Sat, Feb 27, 2021 at 4:37 AM anatoly techtonik <techtonik@gmail.com> wrote:
>
> Hi.
>
> I can't get the same commit hashes after fast-export and then fast-import of
> this repository without any edits https://github.com/simons-public/protonfixes
> I have no idea what causes this, and how to prevent it from happening. Are
> there any workarounds?

Your second commit is signed.  Fast-export strips any extended headers
on commits, such as GPG signatures, because there's no way to keep
them in general.  In the special case that you aren't making *any*
changes to the repository and will import it as-is, you could
theoretically keep the signatures, but you don't need fast-export in
such a case so no one ever bothered to implement commit signature
handling in fast-export and fast-import.  If you make any changes
whatsoever to the commits before the signature (including importing
them to a different system), then the signature would be invalid.

You probably don't want to hear this, but there are no workarounds.

There are also other things that will prevent a simple fast-export |
fast-import pipeline from preserving your history as-is besides signed
commits (most of these are noted in the "Inherited Limitations"
section over at
https://htmlpreview.github.io/?https://github.com/newren/git-filter-repo/blob/docs/html/git-filter-repo.html):

  * any other form of extended header; fast-export only looks for the
headers it knows and exports those
  * grafts and replace objects will just get rewritten (and if they
cause any cycles, those cycles and anything depending on them are
dropped)
  * commits without an author will be given one matching the committer
(hopefully you don't have these, but if you do...)
  * tags that are missing a tagger are also a problem (hopefully you
don't have these, but if you do...)
  * annotated or signed tags outside the refs/tags/ namespace will get
renamed weirdly
  * commits by default are re-encoded into UTF-8, though I notice you
did pass --reencode=no to handle this

Hope that at least explains things for you, even if it doesn't give
you a workaround or a solution.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-02-27 17:48 ` Elijah Newren
@ 2021-02-28 10:00   ` anatoly techtonik
  2021-02-28 10:34     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 19+ messages in thread
From: anatoly techtonik @ 2021-02-28 10:00 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List

On Sat, Feb 27, 2021 at 8:49 PM Elijah Newren <newren@gmail.com> wrote:
>
> Your second commit is signed.  Fast-export strips any extended headers
> on commits, such as GPG signatures, because there's no way to keep
> them in general.

Why is it not possible to encode them with base64 and insert into the
stream?

> There are also other things that will prevent a simple fast-export |
> fast-import pipeline from preserving your history as-is besides signed
> commits (most of these are noted in the "Inherited Limitations"
> section over at
> https://htmlpreview.github.io/?https://github.com/newren/git-filter-repo/blob/docs/html/git-filter-repo.html):

Is there any way to check what commits will be altered as a result of
`fast-export` and why? Right now I don't see that it is reported.

> Hope that at least explains things for you, even if it doesn't give
> you a workaround or a solution.

Thanks. That is very helpful to know.

The reason I am asking is because I tried to merge two repos with
`reposurgeon` which operates on `fast-export` data. It is basically
merging GitHub wiki into main repo,

After successfully merging them I still can not send a PR, because
it produces a huge amount of changes, because of the stripped info.
It can be seen here:

https://github.com/simons-public/protonfixes/compare/master...techtonik:master

I tracked this behaviour in `reposurgeon` in this issue
https://gitlab.com/esr/reposurgeon/-/issues/344
-- 
anatoly t.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-02-28 10:00   ` anatoly techtonik
@ 2021-02-28 10:34     ` Ævar Arnfjörð Bjarmason
  2021-03-01  7:44       ` anatoly techtonik
  0 siblings, 1 reply; 19+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-02-28 10:34 UTC (permalink / raw)
  To: anatoly techtonik; +Cc: Elijah Newren, Git Mailing List


On Sun, Feb 28 2021, anatoly techtonik wrote:

> On Sat, Feb 27, 2021 at 8:49 PM Elijah Newren <newren@gmail.com> wrote:
>>
>> Your second commit is signed.  Fast-export strips any extended headers
>> on commits, such as GPG signatures, because there's no way to keep
>> them in general.
>
> Why is it not possible to encode them with base64 and insert into the
> stream?

I think Elijah means that in the general case people are using fast
export/import to export/import between different systems or in
combination with a utility like git-filter-repo.

In those cases users are also changing the content of the repository, so
the hashes will change, invalidating signatures.

But there's also cases where e.g. you don't modify the history, or only
part of it, and could then preserve these headers. I think there's no
inherent reason not to do so, just that nobody's cared enough to submit
patches etc.

>> There are also other things that will prevent a simple fast-export |
>> fast-import pipeline from preserving your history as-is besides signed
>> commits (most of these are noted in the "Inherited Limitations"
>> section over at
>> https://htmlpreview.github.io/?https://github.com/newren/git-filter-repo/blob/docs/html/git-filter-repo.html):
>
> Is there any way to check what commits will be altered as a result of
> `fast-export` and why? Right now I don't see that it is reported.

I don't think so, but not being very familiar with fast export/import I
don't see why it shouldn't have some option to not munge data like that,
or to report it, if someone cared enough to track those issues & patch
it...

>> Hope that at least explains things for you, even if it doesn't give
>> you a workaround or a solution.
>
> Thanks. That is very helpful to know.
>
> The reason I am asking is because I tried to merge two repos with
> `reposurgeon` which operates on `fast-export` data. It is basically
> merging GitHub wiki into main repo,
>
> After successfully merging them I still can not send a PR, because
> it produces a huge amount of changes, because of the stripped info.
> It can be seen here:
>
> https://github.com/simons-public/protonfixes/compare/master...techtonik:master
>
> I tracked this behaviour in `reposurgeon` in this issue
> https://gitlab.com/esr/reposurgeon/-/issues/344


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-02-28 10:34     ` Ævar Arnfjörð Bjarmason
@ 2021-03-01  7:44       ` anatoly techtonik
  2021-03-01 17:34         ` Junio C Hamano
                           ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: anatoly techtonik @ 2021-03-01  7:44 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Elijah Newren, Git Mailing List

On Sun, Feb 28, 2021 at 1:34 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> I think Elijah means that in the general case people are using fast
> export/import to export/import between different systems or in
> combination with a utility like git-filter-repo.
>
> In those cases users are also changing the content of the repository, so
> the hashes will change, invalidating signatures.
>
> But there's also cases where e.g. you don't modify the history, or only
> part of it, and could then preserve these headers. I think there's no
> inherent reason not to do so, just that nobody's cared enough to submit
> patches etc.

Is fast-export/import the only way to filter information in `git`? Maybe there
is a slow json-export/import tool that gives a complete representation of all
events in a repository? Or API that can be used to serialize and import that
stream?

If no, then I'd like to take a look at where header filtering and serialization
takes place. My C skills are at the "hello world" level, so I am not sure I can
write a patch. But I can write the logic in Python and ask somebody to port
that.
-- 
anatoly t.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-01  7:44       ` anatoly techtonik
@ 2021-03-01 17:34         ` Junio C Hamano
  2021-03-02 21:52           ` anatoly techtonik
  2021-03-01 18:06         ` Elijah Newren
  2021-03-01 20:02         ` Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 19+ messages in thread
From: Junio C Hamano @ 2021-03-01 17:34 UTC (permalink / raw)
  To: anatoly techtonik
  Cc: Ævar Arnfjörð Bjarmason, Elijah Newren, Git Mailing List

anatoly techtonik <techtonik@gmail.com> writes:

> Is fast-export/import the only way to filter information in `git`? Maybe there
> is a slow json-export/import tool that gives a complete representation of all
> events in a repository? Or API that can be used to serialize and import that
> stream?

I do not think representation is a problem.

It is just that the output stream of fast-export is designed to be
"filtered" and the expected use case is to modify the stream somehow
before feeding it to fast-import.  And because every object name and
commit & tag signature depends on everything that they can reach,
even a single bit change in an earlier part of the history will
invalidate any and all signatures on objects that can reach it.  So
instead of originally-signed objects whose signatures are now
invalid, "fast-export | fast-import" pipeline would give you
originally-signed objects whose signatures are stripped.

Admittedly, there is a narrow use case where such a signature
invalidation is not an issue.  If you run fast-export and feed that
straight into fast-import without doing any modification to the
stream, then you are getting a bit-for-bit identical copy.

But "git clone --mirror" is a much better way to do get such a
bit-for-bit identical history and objects.  And if you want to do so
with sneakernet, you can create a bundle file, sneakernet it to your
destination, and then clone from the bundle.

So...


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-01  7:44       ` anatoly techtonik
  2021-03-01 17:34         ` Junio C Hamano
@ 2021-03-01 18:06         ` Elijah Newren
  2021-03-01 20:04           ` Ævar Arnfjörð Bjarmason
  2021-03-02 22:12           ` anatoly techtonik
  2021-03-01 20:02         ` Ævar Arnfjörð Bjarmason
  2 siblings, 2 replies; 19+ messages in thread
From: Elijah Newren @ 2021-03-01 18:06 UTC (permalink / raw)
  To: anatoly techtonik
  Cc: Ævar Arnfjörð Bjarmason, Git Mailing List

On Sun, Feb 28, 2021 at 11:44 PM anatoly techtonik <techtonik@gmail.com> wrote:
>
> On Sun, Feb 28, 2021 at 1:34 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
> >
> > I think Elijah means that in the general case people are using fast
> > export/import to export/import between different systems or in
> > combination with a utility like git-filter-repo.
> >
> > In those cases users are also changing the content of the repository, so
> > the hashes will change, invalidating signatures.
> >
> > But there's also cases where e.g. you don't modify the history, or only
> > part of it, and could then preserve these headers. I think there's no
> > inherent reason not to do so, just that nobody's cared enough to submit
> > patches etc.
>
> Is fast-export/import the only way to filter information in `git`? Maybe there
> is a slow json-export/import tool that gives a complete representation of all
> events in a repository? Or API that can be used to serialize and import that
> stream?
>
> If no, then I'd like to take a look at where header filtering and serialization
> takes place. My C skills are at the "hello world" level, so I am not sure I can
> write a patch. But I can write the logic in Python and ask somebody to port
> that.

If you are intent on keeping signatures because you know they are
still valid, then you already know you aren't modifying any
blobs/trees/commits leading up to those signatures.  If that is the
case, perhaps you should just avoid exporting the signature or
anything it depends on, and just export the stuff after that point.
You can do this with fast-export's --reference-excluded-parents option
and pass it an exclusion range.  For example:

   git fast-export --reference-excluded-parents ^master~5 --all

and then pipe that through fast-import.


In general, I think if fast-export or fast-import are lacking features
you want, we should add them there, but I don't see how adding
signature reading to fast-import and signature exporting to
fast-export makes sense in general.  Even if you assume fast-import
can process all the bits it is sent (e.g. you extend it to support
commits without an author, tags without a tagger, signed objects, any
other extended commit headers), and even if you add flags to
fast-export to die if there are any bits it doesn't recognize and to
export all pieces of blobs/trees/tags (e.g. don't add missing authors,
don't re-encode messages in UTF-8, don't use grafts or replace
objects, keep extended headers such as signatures, etc.), then it
still couldn't possibly work in all cases in general.  For example, if
you had a repository with unusual objects made by ancient or broken
git versions (such as tree entries in the wrong sort order, or tree
entries that recorded modes of 040000 instead of 40000 for trees or
something with perms other than 100644 or 100755 for files), then when
fast-import goes to recreate these objects using the canonical format
they will no longer have the same hash and your commit signatures will
get invalidated.  Other git commands will also refuse to create
objects with those oddities, even if git accepts ancient objects that
have them.

So, it's basically impossible to have a "complete representation of
all events in a repository" that do what you want except for the
*original* binary format.  (But if you really want to see the original
binary format, maybe `git cat-file --batch` will be handy to you.)

But I think fast-export's --reference-excluded-parents might come in
handy for you and let you do what you want.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-01  7:44       ` anatoly techtonik
  2021-03-01 17:34         ` Junio C Hamano
  2021-03-01 18:06         ` Elijah Newren
@ 2021-03-01 20:02         ` Ævar Arnfjörð Bjarmason
  2021-03-02 22:23           ` anatoly techtonik
  2 siblings, 1 reply; 19+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-01 20:02 UTC (permalink / raw)
  To: anatoly techtonik; +Cc: Elijah Newren, Git Mailing List


On Mon, Mar 01 2021, anatoly techtonik wrote:

> On Sun, Feb 28, 2021 at 1:34 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>> I think Elijah means that in the general case people are using fast
>> export/import to export/import between different systems or in
>> combination with a utility like git-filter-repo.
>>
>> In those cases users are also changing the content of the repository, so
>> the hashes will change, invalidating signatures.
>>
>> But there's also cases where e.g. you don't modify the history, or only
>> part of it, and could then preserve these headers. I think there's no
>> inherent reason not to do so, just that nobody's cared enough to submit
>> patches etc.
>
> Is fast-export/import the only way to filter information in `git`? Maybe there
> is a slow json-export/import tool that gives a complete representation of all
> events in a repository? Or API that can be used to serialize and import that
> stream?

Aside from other things mentioned & any issues in fast export/import in
this thread, if you want round-trip correctness you're not going to want
JSON-anything. It's not capable of representing arbitrary binary data.

But in any case, it's not the fast-export format that's the issue, but
how the tools in git.git are munging/rewriting/omitting the repository
data in question...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-01 18:06         ` Elijah Newren
@ 2021-03-01 20:04           ` Ævar Arnfjörð Bjarmason
  2021-03-01 20:17             ` Elijah Newren
  2021-03-02 22:12           ` anatoly techtonik
  1 sibling, 1 reply; 19+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-01 20:04 UTC (permalink / raw)
  To: Elijah Newren; +Cc: anatoly techtonik, Git Mailing List


On Mon, Mar 01 2021, Elijah Newren wrote:

> On Sun, Feb 28, 2021 at 11:44 PM anatoly techtonik <techtonik@gmail.com> wrote:
>>
>> On Sun, Feb 28, 2021 at 1:34 PM Ævar Arnfjörð Bjarmason
>> <avarab@gmail.com> wrote:
>> >
>> > I think Elijah means that in the general case people are using fast
>> > export/import to export/import between different systems or in
>> > combination with a utility like git-filter-repo.
>> >
>> > In those cases users are also changing the content of the repository, so
>> > the hashes will change, invalidating signatures.
>> >
>> > But there's also cases where e.g. you don't modify the history, or only
>> > part of it, and could then preserve these headers. I think there's no
>> > inherent reason not to do so, just that nobody's cared enough to submit
>> > patches etc.
>>
>> Is fast-export/import the only way to filter information in `git`? Maybe there
>> is a slow json-export/import tool that gives a complete representation of all
>> events in a repository? Or API that can be used to serialize and import that
>> stream?
>>
>> If no, then I'd like to take a look at where header filtering and serialization
>> takes place. My C skills are at the "hello world" level, so I am not sure I can
>> write a patch. But I can write the logic in Python and ask somebody to port
>> that.
>
> If you are intent on keeping signatures because you know they are
> still valid, then you already know you aren't modifying any
> blobs/trees/commits leading up to those signatures.  If that is the
> case, perhaps you should just avoid exporting the signature or
> anything it depends on, and just export the stuff after that point.
> You can do this with fast-export's --reference-excluded-parents option
> and pass it an exclusion range.  For example:
>
>    git fast-export --reference-excluded-parents ^master~5 --all
>
> and then pipe that through fast-import.
>
>
> In general, I think if fast-export or fast-import are lacking features
> you want, we should add them there, but I don't see how adding
> signature reading to fast-import and signature exporting to
> fast-export makes sense in general.  Even if you assume fast-import
> can process all the bits it is sent (e.g. you extend it to support
> commits without an author, tags without a tagger, signed objects, any
> other extended commit headers), and even if you add flags to
> fast-export to die if there are any bits it doesn't recognize and to
> export all pieces of blobs/trees/tags (e.g. don't add missing authors,
> don't re-encode messages in UTF-8, don't use grafts or replace
> objects, keep extended headers such as signatures, etc.), then it
> still couldn't possibly work in all cases in general.  For example, if
> you had a repository with unusual objects made by ancient or broken
> git versions (such as tree entries in the wrong sort order, or tree
> entries that recorded modes of 040000 instead of 40000 for trees or
> something with perms other than 100644 or 100755 for files), then when
> fast-import goes to recreate these objects using the canonical format
> they will no longer have the same hash and your commit signatures will
> get invalidated.  Other git commands will also refuse to create
> objects with those oddities, even if git accepts ancient objects that
> have them.
>
> So, it's basically impossible to have a "complete representation of
> all events in a repository" that do what you want except for the
> *original* binary format.  (But if you really want to see the original
> binary format, maybe `git cat-file --batch` will be handy to you.)
>
> But I think fast-export's --reference-excluded-parents might come in
> handy for you and let you do what you want.

...to add to that line of thinking, it's also a completely valid
technique to just completele rewrite your repository, then (re-)push the
old signed tags to refs/tags/*.

By default they won't be pulled down as they won't reference commits on
branches you're fetching, and you can also stick them somewhere else
than refs/tags/*, e.g. refs/legacy-tags/*.

None of the commit history will be the same, but the content (mostly)
will, which is usually what matters when checking out an old tag.

Of course this hack has little benefit over just keeping a foo-old.git
repo around, and moving on with new history in your new foo.git.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-01 20:04           ` Ævar Arnfjörð Bjarmason
@ 2021-03-01 20:17             ` Elijah Newren
  0 siblings, 0 replies; 19+ messages in thread
From: Elijah Newren @ 2021-03-01 20:17 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: anatoly techtonik, Git Mailing List

On Mon, Mar 1, 2021 at 12:04 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Mon, Mar 01 2021, Elijah Newren wrote:
>
> > On Sun, Feb 28, 2021 at 11:44 PM anatoly techtonik <techtonik@gmail.com> wrote:
> >>
> >> On Sun, Feb 28, 2021 at 1:34 PM Ævar Arnfjörð Bjarmason
> >> <avarab@gmail.com> wrote:
> >> >
> >> > I think Elijah means that in the general case people are using fast
> >> > export/import to export/import between different systems or in
> >> > combination with a utility like git-filter-repo.
> >> >
> >> > In those cases users are also changing the content of the repository, so
> >> > the hashes will change, invalidating signatures.
> >> >
> >> > But there's also cases where e.g. you don't modify the history, or only
> >> > part of it, and could then preserve these headers. I think there's no
> >> > inherent reason not to do so, just that nobody's cared enough to submit
> >> > patches etc.
> >>
> >> Is fast-export/import the only way to filter information in `git`? Maybe there
> >> is a slow json-export/import tool that gives a complete representation of all
> >> events in a repository? Or API that can be used to serialize and import that
> >> stream?
> >>
> >> If no, then I'd like to take a look at where header filtering and serialization
> >> takes place. My C skills are at the "hello world" level, so I am not sure I can
> >> write a patch. But I can write the logic in Python and ask somebody to port
> >> that.
> >
> > If you are intent on keeping signatures because you know they are
> > still valid, then you already know you aren't modifying any
> > blobs/trees/commits leading up to those signatures.  If that is the
> > case, perhaps you should just avoid exporting the signature or
> > anything it depends on, and just export the stuff after that point.
> > You can do this with fast-export's --reference-excluded-parents option
> > and pass it an exclusion range.  For example:
> >
> >    git fast-export --reference-excluded-parents ^master~5 --all
> >
> > and then pipe that through fast-import.
> >
> >
> > In general, I think if fast-export or fast-import are lacking features
> > you want, we should add them there, but I don't see how adding
> > signature reading to fast-import and signature exporting to
> > fast-export makes sense in general.  Even if you assume fast-import
> > can process all the bits it is sent (e.g. you extend it to support
> > commits without an author, tags without a tagger, signed objects, any
> > other extended commit headers), and even if you add flags to
> > fast-export to die if there are any bits it doesn't recognize and to
> > export all pieces of blobs/trees/tags (e.g. don't add missing authors,
> > don't re-encode messages in UTF-8, don't use grafts or replace
> > objects, keep extended headers such as signatures, etc.), then it
> > still couldn't possibly work in all cases in general.  For example, if
> > you had a repository with unusual objects made by ancient or broken
> > git versions (such as tree entries in the wrong sort order, or tree
> > entries that recorded modes of 040000 instead of 40000 for trees or
> > something with perms other than 100644 or 100755 for files), then when
> > fast-import goes to recreate these objects using the canonical format
> > they will no longer have the same hash and your commit signatures will
> > get invalidated.  Other git commands will also refuse to create
> > objects with those oddities, even if git accepts ancient objects that
> > have them.
> >
> > So, it's basically impossible to have a "complete representation of
> > all events in a repository" that do what you want except for the
> > *original* binary format.  (But if you really want to see the original
> > binary format, maybe `git cat-file --batch` will be handy to you.)
> >
> > But I think fast-export's --reference-excluded-parents might come in
> > handy for you and let you do what you want.
>
> ...to add to that line of thinking, it's also a completely valid
> technique to just completele rewrite your repository, then (re-)push the
> old signed tags to refs/tags/*.

The repository in question didn't have any signed tags, just a signed commit.

> By default they won't be pulled down as they won't reference commits on
> branches you're fetching, and you can also stick them somewhere else
> than refs/tags/*, e.g. refs/legacy-tags/*.
>
> None of the commit history will be the same, but the content (mostly)
> will, which is usually what matters when checking out an old tag.
>
> Of course this hack has little benefit over just keeping a foo-old.git
> repo around, and moving on with new history in your new foo.git.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-01 17:34         ` Junio C Hamano
@ 2021-03-02 21:52           ` anatoly techtonik
  2021-03-03  7:13             ` Johannes Sixt
  0 siblings, 1 reply; 19+ messages in thread
From: anatoly techtonik @ 2021-03-02 21:52 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, Elijah Newren, Git Mailing List

On Mon, Mar 1, 2021 at 8:34 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> It is just that the output stream of fast-export is designed to be
> "filtered" and the expected use case is to modify the stream somehow
> before feeding it to fast-import.  And because every object name and
> commit & tag signature depends on everything that they can reach,
> even a single bit change in an earlier part of the history will
> invalidate any and all signatures on objects that can reach it.  So
> instead of originally-signed objects whose signatures are now
> invalid, "fast-export | fast-import" pipeline would give you
> originally-signed objects whose signatures are stripped.

I need to merge two unrelated repos and I am using `reposurgeon`
http://www.catb.org/~esr/reposurgeon/repository-editing.html
to do this preserving timestamps and commit order. The model of
operation is that it reads revisions into memory from git using
fast-export, operates on them, and then rebuild the stream back
into git repo with fast-import. The problem is that in the exported
dump the information is already lost, and the resulting commits are
"not mergeable". Basically all GitHub repositories where people
edited `README.md` online are "not mergeable" after this point,
because all GitHub edited commits are signed.

For my use case, where I just need to attach another branch in
time without altering original commits in any way, `reposurgeon`
can not be used.

> Admittedly, there is a narrow use case where such a signature
> invalidation is not an issue.  If you run fast-export and feed that
> straight into fast-import without doing any modification to the
> stream, then you are getting a bit-for-bit identical copy.

I did just that and signatures got stripped, altering history.

git -C protonfixes fast-export --all --reencode=no | (cd
protoimported && git fast-import)

-- 
anatoly t.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-01 18:06         ` Elijah Newren
  2021-03-01 20:04           ` Ævar Arnfjörð Bjarmason
@ 2021-03-02 22:12           ` anatoly techtonik
  1 sibling, 0 replies; 19+ messages in thread
From: anatoly techtonik @ 2021-03-02 22:12 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Ævar Arnfjörð Bjarmason, Git Mailing List

On Mon, Mar 1, 2021 at 9:06 PM Elijah Newren <newren@gmail.com> wrote:
> On Sun, Feb 28, 2021 at 11:44 PM anatoly techtonik <techtonik@gmail.com> wrote:
> For example:
>
>    git fast-export --reference-excluded-parents ^master~5 --all
>
> and then pipe that through fast-import.

That may come in handy, but if certain parents are excluded, it will be
impossible to find them to reference and attach branches to them.

> Other git commands will also refuse to create
> objects with those oddities, even if git accepts ancient objects that
> have them.

Are there any `lint` commands that can detect and warn about those
oddities?

> (But if you really want to see the original
> binary format, maybe `git cat-file --batch` will be handy to you.)

Looks good. Is there a way to import it back? And how hard it could
be to write a parser for it? Is there a specification for its fields?
-- 
anatoly t.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-01 20:02         ` Ævar Arnfjörð Bjarmason
@ 2021-03-02 22:23           ` anatoly techtonik
  0 siblings, 0 replies; 19+ messages in thread
From: anatoly techtonik @ 2021-03-02 22:23 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Elijah Newren, Git Mailing List

On Mon, Mar 1, 2021 at 11:02 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> Aside from other things mentioned & any issues in fast export/import in
> this thread, if you want round-trip correctness you're not going to want
> JSON-anything. It's not capable of representing arbitrary binary data.

Yes, binary data needs to be explicitly represented in base64 or similar
encoding. Just the same way as ordinary strings will need escape
symbols.
-- 
anatoly t.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-02 21:52           ` anatoly techtonik
@ 2021-03-03  7:13             ` Johannes Sixt
  2021-03-04  0:55               ` Junio C Hamano
  0 siblings, 1 reply; 19+ messages in thread
From: Johannes Sixt @ 2021-03-03  7:13 UTC (permalink / raw)
  To: anatoly techtonik
  Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Git Mailing List

Am 02.03.21 um 22:52 schrieb anatoly techtonik:
> For my use case, where I just need to attach another branch in
> time without altering original commits in any way, `reposurgeon`
> can not be used.

What do you mean by "attach another branch in time"? Because if you
really do not want to alter original commits in any way, perhaps you
only want `git fetch /the/other/repository master:the-other-one-s-master`?

-- Hannes

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-03  7:13             ` Johannes Sixt
@ 2021-03-04  0:55               ` Junio C Hamano
  2021-08-09 15:45                 ` anatoly techtonik
  0 siblings, 1 reply; 19+ messages in thread
From: Junio C Hamano @ 2021-03-04  0:55 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: anatoly techtonik, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Git Mailing List

Johannes Sixt <j6t@kdbg.org> writes:

> Am 02.03.21 um 22:52 schrieb anatoly techtonik:
>> For my use case, where I just need to attach another branch in
>> time without altering original commits in any way, `reposurgeon`
>> can not be used.
>
> What do you mean by "attach another branch in time"? Because if you
> really do not want to alter original commits in any way, perhaps you
> only want `git fetch /the/other/repository master:the-other-one-s-master`?

Yeah, I had the same impression.  If a bit-for-bit identical copy of
the original history is needed, then fetching from the original
repository (either directly or via a bundle) would be a much simpler
and performant way.

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-04  0:55               ` Junio C Hamano
@ 2021-08-09 15:45                 ` anatoly techtonik
  2021-08-09 18:15                   ` Elijah Newren
  0 siblings, 1 reply; 19+ messages in thread
From: anatoly techtonik @ 2021-08-09 15:45 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Sixt, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Git Mailing List

On Thu, Mar 4, 2021 at 3:56 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Johannes Sixt <j6t@kdbg.org> writes:
>
> > Am 02.03.21 um 22:52 schrieb anatoly techtonik:
> >> For my use case, where I just need to attach another branch in
> >> time without altering original commits in any way, `reposurgeon`
> >> can not be used.
> >
> > What do you mean by "attach another branch in time"? Because if you
> > really do not want to alter original commits in any way, perhaps you
> > only want `git fetch /the/other/repository master:the-other-one-s-master`?
>
> Yeah, I had the same impression.  If a bit-for-bit identical copy of
> the original history is needed, then fetching from the original
> repository (either directly or via a bundle) would be a much simpler
> and performant way.

The goal is to have an editable stream, which, if left without edits, would
be bit-by-bit identical, so that external tools like `reposurgeon` could
operate on that stream and be audited.

Right now, because the repository
https://github.com/simons-public/protonfixes contains a signed commit
right from the start, the simple fast-export and fast-import with git itself
fails the check.

I understand that patching `git` to add `--complete` to fast-import is
realistically beyond my coding abilities, and my only option is to parse
the binary stream produced by `git cat-file --batch`, which I also won't
be able to do without specification.

P.S. I am resurrecting the old thread, because my problem with editing
the history of the repository with an external tool still can not be solved.
-- 
anatoly t.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-08-09 15:45                 ` anatoly techtonik
@ 2021-08-09 18:15                   ` Elijah Newren
  2021-08-10 15:51                     ` anatoly techtonik
  0 siblings, 1 reply; 19+ messages in thread
From: Elijah Newren @ 2021-08-09 18:15 UTC (permalink / raw)
  To: anatoly techtonik
  Cc: Junio C Hamano, Johannes Sixt,
	Ævar Arnfjörð Bjarmason, Git Mailing List

On Mon, Aug 9, 2021 at 8:45 AM anatoly techtonik <techtonik@gmail.com> wrote:
>
> On Thu, Mar 4, 2021 at 3:56 AM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Johannes Sixt <j6t@kdbg.org> writes:
> >
> > > Am 02.03.21 um 22:52 schrieb anatoly techtonik:
> > >> For my use case, where I just need to attach another branch in
> > >> time without altering original commits in any way, `reposurgeon`
> > >> can not be used.
> > >
> > > What do you mean by "attach another branch in time"? Because if you
> > > really do not want to alter original commits in any way, perhaps you
> > > only want `git fetch /the/other/repository master:the-other-one-s-master`?
> >
> > Yeah, I had the same impression.  If a bit-for-bit identical copy of
> > the original history is needed, then fetching from the original
> > repository (either directly or via a bundle) would be a much simpler
> > and performant way.
>
> The goal is to have an editable stream, which, if left without edits, would
> be bit-by-bit identical, so that external tools like `reposurgeon` could
> operate on that stream and be audited.

There were some patches proposed some months back[1] to make
fast-import allow importing signed commits...except that they
unconditionally kept the signatures and didn't do any validation,
which would have resulted in invalid signatures if any edits happened.
I suggested adding signature verification (which would allow options
like erroring out if they didn't match, or dropping signatures when
they didn't match but keeping them otherwise).  That'd help usecases
like yours.  The author wasn't interested in implementing that
suggestion (and it's a low priority for me that I may never get around
to).  The series also wasn't pushed through and eventually was
dropped.

However, that wouldn't fully solve your stated goal.  As already
mentioned earlier in this thread, I don't think your stated goal is
realistic; the only complete bit-for-bit identical representation of
the repository is the original binary format.

Your stated goal here, however, isn't required for solving the usecase
you present.

[1] https://lore.kernel.org/git/20210430232537.1131641-1-lukeshu@lukeshu.com/

> Right now, because the repository
> https://github.com/simons-public/protonfixes contains a signed commit
> right from the start, the simple fast-export and fast-import with git itself
> fails the check.

Yes, and I mentioned several other reasons why a round-trip from
fast-export through fast-import cannot be relied upon to preserve
object hashes.

> I understand that patching `git` to add `--complete` to fast-import is
> realistically beyond my coding abilities, and my only option is to parse

It's more patching than that which would be required:
(1) It'd be both fast-export and fast-import that would need patching,
not just fast-import.
(2) --complete is a bit of a misnomer too, because it's not just
get-all-the-data, it's keep-the-data-in-the-original-format.  If
objects had modes of 040000 instead of 40000, despite meaning the same
thing, you'd have to prevent canonicalization and store them as the
original recorded value or you'd get a different hash.  Ditto for
commit messages with extra data after a NUL byte, and a variety of
other possible issues.
(3) fast-export works by looking for the relevant bits it knows how to
export.  You'd have to redesign it to fully parse every bit of data in
each object it looks at, throw errors if it didn't recognize any, and
make sure it exports all the bits.  That might be difficult since it's
hard to know how to future proof it.  How do you guarantee you've
printed every field in a commit struct, when that struct might gain
new fields in the future?  (This is especially challenging since
fast-export/fast-import might not be considered core tools, or at
least don't get as much attention as the "truly core" parts of git;
see https://lore.kernel.org/git/xmqq36mxdnpz.fsf@gitster-ct.c.googlers.com/)

> the binary stream produced by `git cat-file --batch`, which I also won't
> be able to do without specification.

The specification is already available in the manual.  Just run `git
cat-file --help` to see it.  Let me quote part of it for you:

       For example, --batch without a custom format would produce:

           <sha1> SP <type> SP <size> LF
           <contents> LF

> P.S. I am resurrecting the old thread, because my problem with editing
> the history of the repository with an external tool still can not be solved.

Sure it can, just use fast-export's --reference-excluded-parents
option and don't export commits you know you won't need to change.

Or, if for some reason you are really set on exporting everything and
then editing, then go ahead and create the full fast-export output,
including with all your edits, and then post-process it manually
before feeding to fast-import.  In particular, in the post-processing
step find the commits that were problematic that you know won't be
modified, such as your signed commit.  Then go edit that fast-export
dump and (a) remove the dump of the no-longer-signed signed commit
(because you don't want it), and (b) replace any references to the
no-longer-signed-commit (e.g. "from :12") to instead use the hash of
the actual original signed commit (e.g. "from
d3d24b63446c7d06586eaa51764ff0c619113f09").  If you do that, then git
fast-import will just build the new commits on the existing signed
commit instead of on some new commit that is missing the signature.
Technically, you can even skip step (a), as all it will do is produce
an extra commit in your repository that isn't used and thus will be
garbage collected later.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-08-09 18:15                   ` Elijah Newren
@ 2021-08-10 15:51                     ` anatoly techtonik
  2021-08-10 17:57                       ` Elijah Newren
  0 siblings, 1 reply; 19+ messages in thread
From: anatoly techtonik @ 2021-08-10 15:51 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Junio C Hamano, Johannes Sixt,
	Ævar Arnfjörð Bjarmason, Git Mailing List

On Mon, Aug 9, 2021 at 9:15 PM Elijah Newren <newren@gmail.com> wrote:
>
> The author wasn't interested in implementing that
> suggestion (and it's a low priority for me that I may never get around
> to).  The series also wasn't pushed through and eventually was
> dropped.

What it takes to validate the commit signature? Isn't it the same as
validating commit tag? Is it possible to merge at least the `--fast-export`
part? The effect of roundtrip would be the same, but at least external
tools would be able to detect signed commits and warn users.

> [1] https://lore.kernel.org/git/20210430232537.1131641-1-lukeshu@lukeshu.com/

> Yes, and I mentioned several other reasons why a round-trip from
> fast-export through fast-import cannot be relied upon to preserve
> object hashes.

Yes, I understand that. What would be the recommended way to detect
which commits would change as a result of the round-trip? It will then
be possible to warn users in `reposurgeon` `lint` command.

> (3) fast-export works by looking for the relevant bits it knows how to
> export.  You'd have to redesign it to fully parse every bit of data in
> each object it looks at, throw errors if it didn't recognize any, and
> make sure it exports all the bits.  That might be difficult since it's
> hard to know how to future proof it.  How do you guarantee you've
> printed every field in a commit struct, when that struct might gain
> new fields in the future?  (This is especially challenging since
> fast-export/fast-import might not be considered core tools, or at
> least don't get as much attention as the "truly core" parts of git;
> see https://lore.kernel.org/git/xmqq36mxdnpz.fsf@gitster-ct.c.googlers.com/)

Looks like the only way to make it forward compatible is to introduce
some kind of versioning and a validation schema like protobuf. Otherwise
writing an importer and exporter for each and every thing that may
encounter in a git stream may be unrealistic, yes.

> > P.S. I am resurrecting the old thread, because my problem with editing
> > the history of the repository with an external tool still can not be solved.
>
> Sure it can, just use fast-export's --reference-excluded-parents
> option and don't export commits you know you won't need to change.

How does `--reference-excluded-parents` help to read signed commits?

`reposurgeon` needs all commits to select those that are needed by
different criteria. It is hard to tell which commits are not important without
reading and processing them first.

> Or, if for some reason you are really set on exporting everything and
> then editing, then go ahead and create the full fast-export output,
> including with all your edits, and then post-process it manually
> before feeding to fast-import.  In particular, in the post-processing
> step find the commits that were problematic that you know won't be
> modified, such as your signed commit.  Then go edit that fast-export
> dump and (a) remove the dump of the no-longer-signed signed commit
> (because you don't want it), and (b) replace any references to the
> no-longer-signed-commit (e.g. "from :12") to instead use the hash of
> the actual original signed commit (e.g. "from
> d3d24b63446c7d06586eaa51764ff0c619113f09").  If you do that, then git
> fast-import will just build the new commits on the existing signed
> commit instead of on some new commit that is missing the signature.
> Technically, you can even skip step (a), as all it will do is produce
> an extra commit in your repository that isn't used and thus will be
> garbage collected later.

The problem is to detect problematic signed commits, because as I
understand `fast-export` doesn't give any signs if commits were signed
before the export.
-- 
anatoly t.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-08-10 15:51                     ` anatoly techtonik
@ 2021-08-10 17:57                       ` Elijah Newren
  0 siblings, 0 replies; 19+ messages in thread
From: Elijah Newren @ 2021-08-10 17:57 UTC (permalink / raw)
  To: anatoly techtonik
  Cc: Junio C Hamano, Johannes Sixt,
	Ævar Arnfjörð Bjarmason, Git Mailing List

On Tue, Aug 10, 2021 at 8:51 AM anatoly techtonik <techtonik@gmail.com> wrote:
>
> On Mon, Aug 9, 2021 at 9:15 PM Elijah Newren <newren@gmail.com> wrote:
> >
> > The author wasn't interested in implementing that
> > suggestion (and it's a low priority for me that I may never get around
> > to).  The series also wasn't pushed through and eventually was
> > dropped.
>
> What it takes to validate the commit signature?

I'm not familiar with any of the gpg libraries, and don't even have an
active gpg key.  So, I don't know.  Some quick grepping shows that we
have gpg-interface.[ch], so we have some functions we can apparently
call.

> Isn't it the same as validating commit tag?

gpg signatures of tags are somewhat different than gpg signatures of commits:

* gpg signatures for tags are simply part of the annotated tag message
* gpg signatures for commits are stored in a separate commit header,
not just as extra text at the end of the commit message

This gpg signature handling for tags means that fast-import isn't even
aware of whether the tag is signed; it simply sees a commit message
and records it.  fast-export also would have been unaware and just
exported them as-is if someone hadn't written some special parsing for
it.  fast-import would need to do similar special parsing to become
aware of whether the tags are signed or not.  For now, fast-import
just keeps any tag messages as-is, and thus potentially writes invalid
tag signatures.  (The only way people have to control this is at the
fast-export side with the --signed-tags flag, which gives you the
choices of abort, strip, or keep the signatures even though they'll
likely be wrong.)  If fast-import were to gain knowledge of tag
signatures and an ability to validate them, it could offer smarter
options like keep-if-valid-and-discard-otherwise.

In contrast, the fact that gpg signatures for commits have to be
recorded as a separate commit header means they cannot be recorded in
fast-import without additional code changes.  And both the fast-export
and fast-import sides have to be made aware of and specially handle
the commit signatures for them to even get propagated, let alone
validated.

> Is it possible to merge at least the `--fast-export`
> part? The effect of roundtrip would be the same, but at least external
> tools would be able to detect signed commits and warn users.

The fact that it wasn't merged suggests there was some issue raised in
feedback that wasn't addressed.  I don't remember if that was the case
or not, but someone would have to find out, address any remaining
issues pointed out by feedback, and champion it through.

Personally, I don't like shoving a half solution through and think
there needs to be validation on the fast-import side added at the same
time, but others may disagree with me.  I have plenty of other
projects to work on, though, so whoever does the work will more likely
be the ones to decide.

> > [1] https://lore.kernel.org/git/20210430232537.1131641-1-lukeshu@lukeshu.com/
>
> > Yes, and I mentioned several other reasons why a round-trip from
> > fast-export through fast-import cannot be relied upon to preserve
> > object hashes.
>
> Yes, I understand that. What would be the recommended way to detect
> which commits would change as a result of the round-trip? It will then
> be possible to warn users in `reposurgeon` `lint` command.

There is no function or command that would check that kind of thing
short of doing the round-trip.  I provided a list of reasons IDs could
change as a starting point in case anyone wanted to try to write a
function or command that could check, and to point out that it is a
long list and might grow in the future.

I think practically, if you're doing a one-shot export (as I
originally assumed from your email), that you'd find out and then just
manually fix things up by hand.  If your goal is writing or changing a
general purpose filtering tool, then I'd suggest instead using the
alternate technique I outlined in the other thread you started at [2].

[2] https://lore.kernel.org/git/CABPp-BH4dcsW52immJpTjgY5LjaVfKrY9MaUOnKT3byi2tBPpg@mail.gmail.com/

> > (3) fast-export works by looking for the relevant bits it knows how to
> > export.  You'd have to redesign it to fully parse every bit of data in
> > each object it looks at, throw errors if it didn't recognize any, and
> > make sure it exports all the bits.  That might be difficult since it's
> > hard to know how to future proof it.  How do you guarantee you've
> > printed every field in a commit struct, when that struct might gain
> > new fields in the future?  (This is especially challenging since
> > fast-export/fast-import might not be considered core tools, or at
> > least don't get as much attention as the "truly core" parts of git;
> > see https://lore.kernel.org/git/xmqq36mxdnpz.fsf@gitster-ct.c.googlers.com/)
>
> Looks like the only way to make it forward compatible is to introduce
> some kind of versioning and a validation schema like protobuf. Otherwise
> writing an importer and exporter for each and every thing that may
> encounter in a git stream may be unrealistic, yes.
>
> > > P.S. I am resurrecting the old thread, because my problem with editing
> > > the history of the repository with an external tool still can not be solved.
> >
> > Sure it can, just use fast-export's --reference-excluded-parents
> > option and don't export commits you know you won't need to change.
>
> How does `--reference-excluded-parents` help to read signed commits?

It doesn't.  I was assuming you were doing a one shot export, namely
of the repository you linked to,
https://github.com/simons-public/protonfixes, and that you already
knew which commits were not going to be changed (because you pointed
them out in your email to the list) -- and in fact that it was only a
single commit affected, as you mentioned.

Armed with that knowledge, you could just export the parts of the
repository AFTER that commit, and use --reference-excluded-parents to
make sure the fast-export stream built upon them rather than squashing
all changes up to that point into the first commit in the stream.

> `reposurgeon` needs all commits to select those that are needed by
> different criteria. It is hard to tell which commits are not important without
> reading and processing them first.

Right, so you aren't trying to just handle this one repository, but
modify/create a general purpose tool that does so.  See my response in
the other thread you started, again at [2] above.

> > Or, if for some reason you are really set on exporting everything and
> > then editing, then go ahead and create the full fast-export output,
> > including with all your edits, and then post-process it manually
> > before feeding to fast-import.  In particular, in the post-processing
> > step find the commits that were problematic that you know won't be
> > modified, such as your signed commit.  Then go edit that fast-export
> > dump and (a) remove the dump of the no-longer-signed signed commit
> > (because you don't want it), and (b) replace any references to the
> > no-longer-signed-commit (e.g. "from :12") to instead use the hash of
> > the actual original signed commit (e.g. "from
> > d3d24b63446c7d06586eaa51764ff0c619113f09").  If you do that, then git
> > fast-import will just build the new commits on the existing signed
> > commit instead of on some new commit that is missing the signature.
> > Technically, you can even skip step (a), as all it will do is produce
> > an extra commit in your repository that isn't used and thus will be
> > garbage collected later.
>
> The problem is to detect problematic signed commits, because as I
> understand `fast-export` doesn't give any signs if commits were signed
> before the export.

Signed commits is just one issue, and you'll have to add special code
to handle a bunch of other special cases if you go down this route.
I'd rephrase the problem.  You want to know when _your tool_ (e.g.
reposurgeon since you refer to it multiple times; I'm guessing you're
contributing to it?) has not modified a commit or any of its
ancestors, and when it hasn't, then _your tool_ should remove that
commit from the fast-export stream and replace any references to it by
the original commit's object id.  I outlined how to do this in [2],
referenced above, making use of the --show-original-ids flag to
fast-export.  If you do that, then for any commits which you haven't
modified (including not modifying any of its ancestors), then you'll
keep the same commits as-is with no stripping of gpg-signatures or
canonicalization of objects, so that you'll have the exact same commit
IDs.  Further, you can do this today, without any changes to git
fast-export or git fast-import.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-08-10 18:25 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-27 12:31 Round-tripping fast-export/import changes commit hashes anatoly techtonik
2021-02-27 17:48 ` Elijah Newren
2021-02-28 10:00   ` anatoly techtonik
2021-02-28 10:34     ` Ævar Arnfjörð Bjarmason
2021-03-01  7:44       ` anatoly techtonik
2021-03-01 17:34         ` Junio C Hamano
2021-03-02 21:52           ` anatoly techtonik
2021-03-03  7:13             ` Johannes Sixt
2021-03-04  0:55               ` Junio C Hamano
2021-08-09 15:45                 ` anatoly techtonik
2021-08-09 18:15                   ` Elijah Newren
2021-08-10 15:51                     ` anatoly techtonik
2021-08-10 17:57                       ` Elijah Newren
2021-03-01 18:06         ` Elijah Newren
2021-03-01 20:04           ` Ævar Arnfjörð Bjarmason
2021-03-01 20:17             ` Elijah Newren
2021-03-02 22:12           ` anatoly techtonik
2021-03-01 20:02         ` Ævar Arnfjörð Bjarmason
2021-03-02 22:23           ` anatoly techtonik

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).