git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Round-tripping fast-export/import changes commit hashes
@ 2021-02-27 12:31 anatoly techtonik
  2021-02-27 17:48 ` Elijah Newren
  0 siblings, 1 reply; 21+ messages in thread
From: anatoly techtonik @ 2021-02-27 12:31 UTC (permalink / raw)
  To: Git Mailing List

Hi.

I can't get the same commit hashes after fast-export and then fast-import of
this repository without any edits https://github.com/simons-public/protonfixes
I have no idea what causes this, and how to prevent it from happening. Are
there any workarounds?

What did you do before the bug happened? (Steps to reproduce your issue)

  #!/bin/bash

  git clone https://github.com/simons-public/protonfixes.git
  git -C protonfixes log --format=oneline | tail -n 4

  git init protoimported
  git -C protonfixes fast-export --all --reencode=no | (cd
protoimported && git fast-import)
  git -C protoimported log --format=oneline | tail -n 4

What did you expect to happen? (Expected behavior)

  Expect imported repo to match exported.

What happened instead? (Actual behavior)

  All hashes are different, the exported repo diverged on the second commit.

What's different between what you expected and what actually happened?

  The log of hashes from initial repo:

    + git -C protonfixes log --format=oneline
    + tail -n 4
    1c0cf2c8e742e673dba9fd1a09afd12a25c25571 Update README.md
    367d61f9b2a799accbdaeed5d64f9be914ca0f7a Updated zip link
    d3d24b63446c7d06586eaa51764ff0c619113f09 Update README.md
    7a43ca89ff7a70127ac9ca0f10b6eaaa34f2f69c Initial commit

  The log from imported repo:

    + git -C protoimported log --format=oneline
    + tail -n 4
    a27ec5d2e4c562f40e693e0b4149959d2b69bf21 Update README.md
    e59cf92be79c47984e9f94bfad912e5a29dfa5e0 Updated zip link
    fb6498f62af783d2e943770f90bc642cf5c9ec9c Update README.md
    7a43ca89ff7a70127ac9ca0f10b6eaaa34f2f69c Initial commit

[System Info]
git version:
git version 2.31.0.rc0
cpu: x86_64
built from commit: 225365fb5195e804274ab569ac3cc4919451dc7f
sizeof-long: 8
sizeof-size_t: 8
shell-path: /bin/sh
uname: Linux 5.8.0-43-generic #49-Ubuntu SMP Fri Feb 5 03:01:28 UTC 2021 x86_64
compiler info: gnuc: 10.2
libc info: glibc: 2.32
$SHELL (typically, interactive shell): /usr/bin/zsh


[Enabled Hooks]
not run from a git repository - no hooks to show
-- 
anatoly t.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-02-27 12:31 Round-tripping fast-export/import changes commit hashes anatoly techtonik
@ 2021-02-27 17:48 ` Elijah Newren
  2021-02-28 10:00   ` anatoly techtonik
  0 siblings, 1 reply; 21+ messages in thread
From: Elijah Newren @ 2021-02-27 17:48 UTC (permalink / raw)
  To: anatoly techtonik; +Cc: Git Mailing List

Hi,

On Sat, Feb 27, 2021 at 4:37 AM anatoly techtonik <techtonik@gmail.com> wrote:
>
> Hi.
>
> I can't get the same commit hashes after fast-export and then fast-import of
> this repository without any edits https://github.com/simons-public/protonfixes
> I have no idea what causes this, and how to prevent it from happening. Are
> there any workarounds?

Your second commit is signed.  Fast-export strips any extended headers
on commits, such as GPG signatures, because there's no way to keep
them in general.  In the special case that you aren't making *any*
changes to the repository and will import it as-is, you could
theoretically keep the signatures, but you don't need fast-export in
such a case so no one ever bothered to implement commit signature
handling in fast-export and fast-import.  If you make any changes
whatsoever to the commits before the signature (including importing
them to a different system), then the signature would be invalid.

You probably don't want to hear this, but there are no workarounds.

There are also other things that will prevent a simple fast-export |
fast-import pipeline from preserving your history as-is besides signed
commits (most of these are noted in the "Inherited Limitations"
section over at
https://htmlpreview.github.io/?https://github.com/newren/git-filter-repo/blob/docs/html/git-filter-repo.html):

  * any other form of extended header; fast-export only looks for the
headers it knows and exports those
  * grafts and replace objects will just get rewritten (and if they
cause any cycles, those cycles and anything depending on them are
dropped)
  * commits without an author will be given one matching the committer
(hopefully you don't have these, but if you do...)
  * tags that are missing a tagger are also a problem (hopefully you
don't have these, but if you do...)
  * annotated or signed tags outside the refs/tags/ namespace will get
renamed weirdly
  * commits by default are re-encoded into UTF-8, though I notice you
did pass --reencode=no to handle this

Hope that at least explains things for you, even if it doesn't give
you a workaround or a solution.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-02-27 17:48 ` Elijah Newren
@ 2021-02-28 10:00   ` anatoly techtonik
  2021-02-28 10:34     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 21+ messages in thread
From: anatoly techtonik @ 2021-02-28 10:00 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List

On Sat, Feb 27, 2021 at 8:49 PM Elijah Newren <newren@gmail.com> wrote:
>
> Your second commit is signed.  Fast-export strips any extended headers
> on commits, such as GPG signatures, because there's no way to keep
> them in general.

Why is it not possible to encode them with base64 and insert into the
stream?

> There are also other things that will prevent a simple fast-export |
> fast-import pipeline from preserving your history as-is besides signed
> commits (most of these are noted in the "Inherited Limitations"
> section over at
> https://htmlpreview.github.io/?https://github.com/newren/git-filter-repo/blob/docs/html/git-filter-repo.html):

Is there any way to check what commits will be altered as a result of
`fast-export` and why? Right now I don't see that it is reported.

> Hope that at least explains things for you, even if it doesn't give
> you a workaround or a solution.

Thanks. That is very helpful to know.

The reason I am asking is because I tried to merge two repos with
`reposurgeon` which operates on `fast-export` data. It is basically
merging GitHub wiki into main repo,

After successfully merging them I still can not send a PR, because
it produces a huge amount of changes, because of the stripped info.
It can be seen here:

https://github.com/simons-public/protonfixes/compare/master...techtonik:master

I tracked this behaviour in `reposurgeon` in this issue
https://gitlab.com/esr/reposurgeon/-/issues/344
-- 
anatoly t.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-02-28 10:00   ` anatoly techtonik
@ 2021-02-28 10:34     ` Ævar Arnfjörð Bjarmason
  2021-03-01  7:44       ` anatoly techtonik
  0 siblings, 1 reply; 21+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-02-28 10:34 UTC (permalink / raw)
  To: anatoly techtonik; +Cc: Elijah Newren, Git Mailing List


On Sun, Feb 28 2021, anatoly techtonik wrote:

> On Sat, Feb 27, 2021 at 8:49 PM Elijah Newren <newren@gmail.com> wrote:
>>
>> Your second commit is signed.  Fast-export strips any extended headers
>> on commits, such as GPG signatures, because there's no way to keep
>> them in general.
>
> Why is it not possible to encode them with base64 and insert into the
> stream?

I think Elijah means that in the general case people are using fast
export/import to export/import between different systems or in
combination with a utility like git-filter-repo.

In those cases users are also changing the content of the repository, so
the hashes will change, invalidating signatures.

But there's also cases where e.g. you don't modify the history, or only
part of it, and could then preserve these headers. I think there's no
inherent reason not to do so, just that nobody's cared enough to submit
patches etc.

>> There are also other things that will prevent a simple fast-export |
>> fast-import pipeline from preserving your history as-is besides signed
>> commits (most of these are noted in the "Inherited Limitations"
>> section over at
>> https://htmlpreview.github.io/?https://github.com/newren/git-filter-repo/blob/docs/html/git-filter-repo.html):
>
> Is there any way to check what commits will be altered as a result of
> `fast-export` and why? Right now I don't see that it is reported.

I don't think so, but not being very familiar with fast export/import I
don't see why it shouldn't have some option to not munge data like that,
or to report it, if someone cared enough to track those issues & patch
it...

>> Hope that at least explains things for you, even if it doesn't give
>> you a workaround or a solution.
>
> Thanks. That is very helpful to know.
>
> The reason I am asking is because I tried to merge two repos with
> `reposurgeon` which operates on `fast-export` data. It is basically
> merging GitHub wiki into main repo,
>
> After successfully merging them I still can not send a PR, because
> it produces a huge amount of changes, because of the stripped info.
> It can be seen here:
>
> https://github.com/simons-public/protonfixes/compare/master...techtonik:master
>
> I tracked this behaviour in `reposurgeon` in this issue
> https://gitlab.com/esr/reposurgeon/-/issues/344


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-02-28 10:34     ` Ævar Arnfjörð Bjarmason
@ 2021-03-01  7:44       ` anatoly techtonik
  2021-03-01 17:34         ` Junio C Hamano
                           ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: anatoly techtonik @ 2021-03-01  7:44 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Elijah Newren, Git Mailing List

On Sun, Feb 28, 2021 at 1:34 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> I think Elijah means that in the general case people are using fast
> export/import to export/import between different systems or in
> combination with a utility like git-filter-repo.
>
> In those cases users are also changing the content of the repository, so
> the hashes will change, invalidating signatures.
>
> But there's also cases where e.g. you don't modify the history, or only
> part of it, and could then preserve these headers. I think there's no
> inherent reason not to do so, just that nobody's cared enough to submit
> patches etc.

Is fast-export/import the only way to filter information in `git`? Maybe there
is a slow json-export/import tool that gives a complete representation of all
events in a repository? Or API that can be used to serialize and import that
stream?

If no, then I'd like to take a look at where header filtering and serialization
takes place. My C skills are at the "hello world" level, so I am not sure I can
write a patch. But I can write the logic in Python and ask somebody to port
that.
-- 
anatoly t.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-01  7:44       ` anatoly techtonik
@ 2021-03-01 17:34         ` Junio C Hamano
  2021-03-02 21:52           ` anatoly techtonik
  2021-03-01 18:06         ` Elijah Newren
  2021-03-01 20:02         ` Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2021-03-01 17:34 UTC (permalink / raw)
  To: anatoly techtonik
  Cc: Ævar Arnfjörð Bjarmason, Elijah Newren,
	Git Mailing List

anatoly techtonik <techtonik@gmail.com> writes:

> Is fast-export/import the only way to filter information in `git`? Maybe there
> is a slow json-export/import tool that gives a complete representation of all
> events in a repository? Or API that can be used to serialize and import that
> stream?

I do not think representation is a problem.

It is just that the output stream of fast-export is designed to be
"filtered" and the expected use case is to modify the stream somehow
before feeding it to fast-import.  And because every object name and
commit & tag signature depends on everything that they can reach,
even a single bit change in an earlier part of the history will
invalidate any and all signatures on objects that can reach it.  So
instead of originally-signed objects whose signatures are now
invalid, "fast-export | fast-import" pipeline would give you
originally-signed objects whose signatures are stripped.

Admittedly, there is a narrow use case where such a signature
invalidation is not an issue.  If you run fast-export and feed that
straight into fast-import without doing any modification to the
stream, then you are getting a bit-for-bit identical copy.

But "git clone --mirror" is a much better way to do get such a
bit-for-bit identical history and objects.  And if you want to do so
with sneakernet, you can create a bundle file, sneakernet it to your
destination, and then clone from the bundle.

So...


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-01  7:44       ` anatoly techtonik
  2021-03-01 17:34         ` Junio C Hamano
@ 2021-03-01 18:06         ` Elijah Newren
  2021-03-01 20:04           ` Ævar Arnfjörð Bjarmason
  2021-03-02 22:12           ` anatoly techtonik
  2021-03-01 20:02         ` Ævar Arnfjörð Bjarmason
  2 siblings, 2 replies; 21+ messages in thread
From: Elijah Newren @ 2021-03-01 18:06 UTC (permalink / raw)
  To: anatoly techtonik
  Cc: Ævar Arnfjörð Bjarmason, Git Mailing List

On Sun, Feb 28, 2021 at 11:44 PM anatoly techtonik <techtonik@gmail.com> wrote:
>
> On Sun, Feb 28, 2021 at 1:34 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
> >
> > I think Elijah means that in the general case people are using fast
> > export/import to export/import between different systems or in
> > combination with a utility like git-filter-repo.
> >
> > In those cases users are also changing the content of the repository, so
> > the hashes will change, invalidating signatures.
> >
> > But there's also cases where e.g. you don't modify the history, or only
> > part of it, and could then preserve these headers. I think there's no
> > inherent reason not to do so, just that nobody's cared enough to submit
> > patches etc.
>
> Is fast-export/import the only way to filter information in `git`? Maybe there
> is a slow json-export/import tool that gives a complete representation of all
> events in a repository? Or API that can be used to serialize and import that
> stream?
>
> If no, then I'd like to take a look at where header filtering and serialization
> takes place. My C skills are at the "hello world" level, so I am not sure I can
> write a patch. But I can write the logic in Python and ask somebody to port
> that.

If you are intent on keeping signatures because you know they are
still valid, then you already know you aren't modifying any
blobs/trees/commits leading up to those signatures.  If that is the
case, perhaps you should just avoid exporting the signature or
anything it depends on, and just export the stuff after that point.
You can do this with fast-export's --reference-excluded-parents option
and pass it an exclusion range.  For example:

   git fast-export --reference-excluded-parents ^master~5 --all

and then pipe that through fast-import.


In general, I think if fast-export or fast-import are lacking features
you want, we should add them there, but I don't see how adding
signature reading to fast-import and signature exporting to
fast-export makes sense in general.  Even if you assume fast-import
can process all the bits it is sent (e.g. you extend it to support
commits without an author, tags without a tagger, signed objects, any
other extended commit headers), and even if you add flags to
fast-export to die if there are any bits it doesn't recognize and to
export all pieces of blobs/trees/tags (e.g. don't add missing authors,
don't re-encode messages in UTF-8, don't use grafts or replace
objects, keep extended headers such as signatures, etc.), then it
still couldn't possibly work in all cases in general.  For example, if
you had a repository with unusual objects made by ancient or broken
git versions (such as tree entries in the wrong sort order, or tree
entries that recorded modes of 040000 instead of 40000 for trees or
something with perms other than 100644 or 100755 for files), then when
fast-import goes to recreate these objects using the canonical format
they will no longer have the same hash and your commit signatures will
get invalidated.  Other git commands will also refuse to create
objects with those oddities, even if git accepts ancient objects that
have them.

So, it's basically impossible to have a "complete representation of
all events in a repository" that do what you want except for the
*original* binary format.  (But if you really want to see the original
binary format, maybe `git cat-file --batch` will be handy to you.)

But I think fast-export's --reference-excluded-parents might come in
handy for you and let you do what you want.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-01  7:44       ` anatoly techtonik
  2021-03-01 17:34         ` Junio C Hamano
  2021-03-01 18:06         ` Elijah Newren
@ 2021-03-01 20:02         ` Ævar Arnfjörð Bjarmason
  2021-03-02 22:23           ` anatoly techtonik
  2 siblings, 1 reply; 21+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-01 20:02 UTC (permalink / raw)
  To: anatoly techtonik; +Cc: Elijah Newren, Git Mailing List


On Mon, Mar 01 2021, anatoly techtonik wrote:

> On Sun, Feb 28, 2021 at 1:34 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>> I think Elijah means that in the general case people are using fast
>> export/import to export/import between different systems or in
>> combination with a utility like git-filter-repo.
>>
>> In those cases users are also changing the content of the repository, so
>> the hashes will change, invalidating signatures.
>>
>> But there's also cases where e.g. you don't modify the history, or only
>> part of it, and could then preserve these headers. I think there's no
>> inherent reason not to do so, just that nobody's cared enough to submit
>> patches etc.
>
> Is fast-export/import the only way to filter information in `git`? Maybe there
> is a slow json-export/import tool that gives a complete representation of all
> events in a repository? Or API that can be used to serialize and import that
> stream?

Aside from other things mentioned & any issues in fast export/import in
this thread, if you want round-trip correctness you're not going to want
JSON-anything. It's not capable of representing arbitrary binary data.

But in any case, it's not the fast-export format that's the issue, but
how the tools in git.git are munging/rewriting/omitting the repository
data in question...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-01 18:06         ` Elijah Newren
@ 2021-03-01 20:04           ` Ævar Arnfjörð Bjarmason
  2021-03-01 20:17             ` Elijah Newren
  2021-03-02 22:12           ` anatoly techtonik
  1 sibling, 1 reply; 21+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-01 20:04 UTC (permalink / raw)
  To: Elijah Newren; +Cc: anatoly techtonik, Git Mailing List


On Mon, Mar 01 2021, Elijah Newren wrote:

> On Sun, Feb 28, 2021 at 11:44 PM anatoly techtonik <techtonik@gmail.com> wrote:
>>
>> On Sun, Feb 28, 2021 at 1:34 PM Ævar Arnfjörð Bjarmason
>> <avarab@gmail.com> wrote:
>> >
>> > I think Elijah means that in the general case people are using fast
>> > export/import to export/import between different systems or in
>> > combination with a utility like git-filter-repo.
>> >
>> > In those cases users are also changing the content of the repository, so
>> > the hashes will change, invalidating signatures.
>> >
>> > But there's also cases where e.g. you don't modify the history, or only
>> > part of it, and could then preserve these headers. I think there's no
>> > inherent reason not to do so, just that nobody's cared enough to submit
>> > patches etc.
>>
>> Is fast-export/import the only way to filter information in `git`? Maybe there
>> is a slow json-export/import tool that gives a complete representation of all
>> events in a repository? Or API that can be used to serialize and import that
>> stream?
>>
>> If no, then I'd like to take a look at where header filtering and serialization
>> takes place. My C skills are at the "hello world" level, so I am not sure I can
>> write a patch. But I can write the logic in Python and ask somebody to port
>> that.
>
> If you are intent on keeping signatures because you know they are
> still valid, then you already know you aren't modifying any
> blobs/trees/commits leading up to those signatures.  If that is the
> case, perhaps you should just avoid exporting the signature or
> anything it depends on, and just export the stuff after that point.
> You can do this with fast-export's --reference-excluded-parents option
> and pass it an exclusion range.  For example:
>
>    git fast-export --reference-excluded-parents ^master~5 --all
>
> and then pipe that through fast-import.
>
>
> In general, I think if fast-export or fast-import are lacking features
> you want, we should add them there, but I don't see how adding
> signature reading to fast-import and signature exporting to
> fast-export makes sense in general.  Even if you assume fast-import
> can process all the bits it is sent (e.g. you extend it to support
> commits without an author, tags without a tagger, signed objects, any
> other extended commit headers), and even if you add flags to
> fast-export to die if there are any bits it doesn't recognize and to
> export all pieces of blobs/trees/tags (e.g. don't add missing authors,
> don't re-encode messages in UTF-8, don't use grafts or replace
> objects, keep extended headers such as signatures, etc.), then it
> still couldn't possibly work in all cases in general.  For example, if
> you had a repository with unusual objects made by ancient or broken
> git versions (such as tree entries in the wrong sort order, or tree
> entries that recorded modes of 040000 instead of 40000 for trees or
> something with perms other than 100644 or 100755 for files), then when
> fast-import goes to recreate these objects using the canonical format
> they will no longer have the same hash and your commit signatures will
> get invalidated.  Other git commands will also refuse to create
> objects with those oddities, even if git accepts ancient objects that
> have them.
>
> So, it's basically impossible to have a "complete representation of
> all events in a repository" that do what you want except for the
> *original* binary format.  (But if you really want to see the original
> binary format, maybe `git cat-file --batch` will be handy to you.)
>
> But I think fast-export's --reference-excluded-parents might come in
> handy for you and let you do what you want.

...to add to that line of thinking, it's also a completely valid
technique to just completele rewrite your repository, then (re-)push the
old signed tags to refs/tags/*.

By default they won't be pulled down as they won't reference commits on
branches you're fetching, and you can also stick them somewhere else
than refs/tags/*, e.g. refs/legacy-tags/*.

None of the commit history will be the same, but the content (mostly)
will, which is usually what matters when checking out an old tag.

Of course this hack has little benefit over just keeping a foo-old.git
repo around, and moving on with new history in your new foo.git.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-01 20:04           ` Ævar Arnfjörð Bjarmason
@ 2021-03-01 20:17             ` Elijah Newren
  0 siblings, 0 replies; 21+ messages in thread
From: Elijah Newren @ 2021-03-01 20:17 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: anatoly techtonik, Git Mailing List

On Mon, Mar 1, 2021 at 12:04 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Mon, Mar 01 2021, Elijah Newren wrote:
>
> > On Sun, Feb 28, 2021 at 11:44 PM anatoly techtonik <techtonik@gmail.com> wrote:
> >>
> >> On Sun, Feb 28, 2021 at 1:34 PM Ævar Arnfjörð Bjarmason
> >> <avarab@gmail.com> wrote:
> >> >
> >> > I think Elijah means that in the general case people are using fast
> >> > export/import to export/import between different systems or in
> >> > combination with a utility like git-filter-repo.
> >> >
> >> > In those cases users are also changing the content of the repository, so
> >> > the hashes will change, invalidating signatures.
> >> >
> >> > But there's also cases where e.g. you don't modify the history, or only
> >> > part of it, and could then preserve these headers. I think there's no
> >> > inherent reason not to do so, just that nobody's cared enough to submit
> >> > patches etc.
> >>
> >> Is fast-export/import the only way to filter information in `git`? Maybe there
> >> is a slow json-export/import tool that gives a complete representation of all
> >> events in a repository? Or API that can be used to serialize and import that
> >> stream?
> >>
> >> If no, then I'd like to take a look at where header filtering and serialization
> >> takes place. My C skills are at the "hello world" level, so I am not sure I can
> >> write a patch. But I can write the logic in Python and ask somebody to port
> >> that.
> >
> > If you are intent on keeping signatures because you know they are
> > still valid, then you already know you aren't modifying any
> > blobs/trees/commits leading up to those signatures.  If that is the
> > case, perhaps you should just avoid exporting the signature or
> > anything it depends on, and just export the stuff after that point.
> > You can do this with fast-export's --reference-excluded-parents option
> > and pass it an exclusion range.  For example:
> >
> >    git fast-export --reference-excluded-parents ^master~5 --all
> >
> > and then pipe that through fast-import.
> >
> >
> > In general, I think if fast-export or fast-import are lacking features
> > you want, we should add them there, but I don't see how adding
> > signature reading to fast-import and signature exporting to
> > fast-export makes sense in general.  Even if you assume fast-import
> > can process all the bits it is sent (e.g. you extend it to support
> > commits without an author, tags without a tagger, signed objects, any
> > other extended commit headers), and even if you add flags to
> > fast-export to die if there are any bits it doesn't recognize and to
> > export all pieces of blobs/trees/tags (e.g. don't add missing authors,
> > don't re-encode messages in UTF-8, don't use grafts or replace
> > objects, keep extended headers such as signatures, etc.), then it
> > still couldn't possibly work in all cases in general.  For example, if
> > you had a repository with unusual objects made by ancient or broken
> > git versions (such as tree entries in the wrong sort order, or tree
> > entries that recorded modes of 040000 instead of 40000 for trees or
> > something with perms other than 100644 or 100755 for files), then when
> > fast-import goes to recreate these objects using the canonical format
> > they will no longer have the same hash and your commit signatures will
> > get invalidated.  Other git commands will also refuse to create
> > objects with those oddities, even if git accepts ancient objects that
> > have them.
> >
> > So, it's basically impossible to have a "complete representation of
> > all events in a repository" that do what you want except for the
> > *original* binary format.  (But if you really want to see the original
> > binary format, maybe `git cat-file --batch` will be handy to you.)
> >
> > But I think fast-export's --reference-excluded-parents might come in
> > handy for you and let you do what you want.
>
> ...to add to that line of thinking, it's also a completely valid
> technique to just completele rewrite your repository, then (re-)push the
> old signed tags to refs/tags/*.

The repository in question didn't have any signed tags, just a signed commit.

> By default they won't be pulled down as they won't reference commits on
> branches you're fetching, and you can also stick them somewhere else
> than refs/tags/*, e.g. refs/legacy-tags/*.
>
> None of the commit history will be the same, but the content (mostly)
> will, which is usually what matters when checking out an old tag.
>
> Of course this hack has little benefit over just keeping a foo-old.git
> repo around, and moving on with new history in your new foo.git.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-01 17:34         ` Junio C Hamano
@ 2021-03-02 21:52           ` anatoly techtonik
  2021-03-03  7:13             ` Johannes Sixt
  0 siblings, 1 reply; 21+ messages in thread
From: anatoly techtonik @ 2021-03-02 21:52 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, Elijah Newren,
	Git Mailing List

On Mon, Mar 1, 2021 at 8:34 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> It is just that the output stream of fast-export is designed to be
> "filtered" and the expected use case is to modify the stream somehow
> before feeding it to fast-import.  And because every object name and
> commit & tag signature depends on everything that they can reach,
> even a single bit change in an earlier part of the history will
> invalidate any and all signatures on objects that can reach it.  So
> instead of originally-signed objects whose signatures are now
> invalid, "fast-export | fast-import" pipeline would give you
> originally-signed objects whose signatures are stripped.

I need to merge two unrelated repos and I am using `reposurgeon`
http://www.catb.org/~esr/reposurgeon/repository-editing.html
to do this preserving timestamps and commit order. The model of
operation is that it reads revisions into memory from git using
fast-export, operates on them, and then rebuild the stream back
into git repo with fast-import. The problem is that in the exported
dump the information is already lost, and the resulting commits are
"not mergeable". Basically all GitHub repositories where people
edited `README.md` online are "not mergeable" after this point,
because all GitHub edited commits are signed.

For my use case, where I just need to attach another branch in
time without altering original commits in any way, `reposurgeon`
can not be used.

> Admittedly, there is a narrow use case where such a signature
> invalidation is not an issue.  If you run fast-export and feed that
> straight into fast-import without doing any modification to the
> stream, then you are getting a bit-for-bit identical copy.

I did just that and signatures got stripped, altering history.

git -C protonfixes fast-export --all --reencode=no | (cd
protoimported && git fast-import)

-- 
anatoly t.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-01 18:06         ` Elijah Newren
  2021-03-01 20:04           ` Ævar Arnfjörð Bjarmason
@ 2021-03-02 22:12           ` anatoly techtonik
  1 sibling, 0 replies; 21+ messages in thread
From: anatoly techtonik @ 2021-03-02 22:12 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Ævar Arnfjörð Bjarmason, Git Mailing List

On Mon, Mar 1, 2021 at 9:06 PM Elijah Newren <newren@gmail.com> wrote:
> On Sun, Feb 28, 2021 at 11:44 PM anatoly techtonik <techtonik@gmail.com> wrote:
> For example:
>
>    git fast-export --reference-excluded-parents ^master~5 --all
>
> and then pipe that through fast-import.

That may come in handy, but if certain parents are excluded, it will be
impossible to find them to reference and attach branches to them.

> Other git commands will also refuse to create
> objects with those oddities, even if git accepts ancient objects that
> have them.

Are there any `lint` commands that can detect and warn about those
oddities?

> (But if you really want to see the original
> binary format, maybe `git cat-file --batch` will be handy to you.)

Looks good. Is there a way to import it back? And how hard it could
be to write a parser for it? Is there a specification for its fields?
-- 
anatoly t.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-01 20:02         ` Ævar Arnfjörð Bjarmason
@ 2021-03-02 22:23           ` anatoly techtonik
  0 siblings, 0 replies; 21+ messages in thread
From: anatoly techtonik @ 2021-03-02 22:23 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Elijah Newren, Git Mailing List

On Mon, Mar 1, 2021 at 11:02 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> Aside from other things mentioned & any issues in fast export/import in
> this thread, if you want round-trip correctness you're not going to want
> JSON-anything. It's not capable of representing arbitrary binary data.

Yes, binary data needs to be explicitly represented in base64 or similar
encoding. Just the same way as ordinary strings will need escape
symbols.
-- 
anatoly t.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-02 21:52           ` anatoly techtonik
@ 2021-03-03  7:13             ` Johannes Sixt
  2021-03-04  0:55               ` Junio C Hamano
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Sixt @ 2021-03-03  7:13 UTC (permalink / raw)
  To: anatoly techtonik
  Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Git Mailing List

Am 02.03.21 um 22:52 schrieb anatoly techtonik:
> For my use case, where I just need to attach another branch in
> time without altering original commits in any way, `reposurgeon`
> can not be used.

What do you mean by "attach another branch in time"? Because if you
really do not want to alter original commits in any way, perhaps you
only want `git fetch /the/other/repository master:the-other-one-s-master`?

-- Hannes

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-03  7:13             ` Johannes Sixt
@ 2021-03-04  0:55               ` Junio C Hamano
  2021-08-09 15:45                 ` anatoly techtonik
  0 siblings, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2021-03-04  0:55 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: anatoly techtonik, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Git Mailing List

Johannes Sixt <j6t@kdbg.org> writes:

> Am 02.03.21 um 22:52 schrieb anatoly techtonik:
>> For my use case, where I just need to attach another branch in
>> time without altering original commits in any way, `reposurgeon`
>> can not be used.
>
> What do you mean by "attach another branch in time"? Because if you
> really do not want to alter original commits in any way, perhaps you
> only want `git fetch /the/other/repository master:the-other-one-s-master`?

Yeah, I had the same impression.  If a bit-for-bit identical copy of
the original history is needed, then fetching from the original
repository (either directly or via a bundle) would be a much simpler
and performant way.

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-03-04  0:55               ` Junio C Hamano
@ 2021-08-09 15:45                 ` anatoly techtonik
  2021-08-09 18:15                   ` Elijah Newren
  0 siblings, 1 reply; 21+ messages in thread
From: anatoly techtonik @ 2021-08-09 15:45 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Sixt, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Git Mailing List

On Thu, Mar 4, 2021 at 3:56 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Johannes Sixt <j6t@kdbg.org> writes:
>
> > Am 02.03.21 um 22:52 schrieb anatoly techtonik:
> >> For my use case, where I just need to attach another branch in
> >> time without altering original commits in any way, `reposurgeon`
> >> can not be used.
> >
> > What do you mean by "attach another branch in time"? Because if you
> > really do not want to alter original commits in any way, perhaps you
> > only want `git fetch /the/other/repository master:the-other-one-s-master`?
>
> Yeah, I had the same impression.  If a bit-for-bit identical copy of
> the original history is needed, then fetching from the original
> repository (either directly or via a bundle) would be a much simpler
> and performant way.

The goal is to have an editable stream, which, if left without edits, would
be bit-by-bit identical, so that external tools like `reposurgeon` could
operate on that stream and be audited.

Right now, because the repository
https://github.com/simons-public/protonfixes contains a signed commit
right from the start, the simple fast-export and fast-import with git itself
fails the check.

I understand that patching `git` to add `--complete` to fast-import is
realistically beyond my coding abilities, and my only option is to parse
the binary stream produced by `git cat-file --batch`, which I also won't
be able to do without specification.

P.S. I am resurrecting the old thread, because my problem with editing
the history of the repository with an external tool still can not be solved.
-- 
anatoly t.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-08-09 15:45                 ` anatoly techtonik
@ 2021-08-09 18:15                   ` Elijah Newren
  2021-08-10 15:51                     ` anatoly techtonik
  0 siblings, 1 reply; 21+ messages in thread
From: Elijah Newren @ 2021-08-09 18:15 UTC (permalink / raw)
  To: anatoly techtonik
  Cc: Junio C Hamano, Johannes Sixt,
	Ævar Arnfjörð Bjarmason, Git Mailing List

On Mon, Aug 9, 2021 at 8:45 AM anatoly techtonik <techtonik@gmail.com> wrote:
>
> On Thu, Mar 4, 2021 at 3:56 AM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Johannes Sixt <j6t@kdbg.org> writes:
> >
> > > Am 02.03.21 um 22:52 schrieb anatoly techtonik:
> > >> For my use case, where I just need to attach another branch in
> > >> time without altering original commits in any way, `reposurgeon`
> > >> can not be used.
> > >
> > > What do you mean by "attach another branch in time"? Because if you
> > > really do not want to alter original commits in any way, perhaps you
> > > only want `git fetch /the/other/repository master:the-other-one-s-master`?
> >
> > Yeah, I had the same impression.  If a bit-for-bit identical copy of
> > the original history is needed, then fetching from the original
> > repository (either directly or via a bundle) would be a much simpler
> > and performant way.
>
> The goal is to have an editable stream, which, if left without edits, would
> be bit-by-bit identical, so that external tools like `reposurgeon` could
> operate on that stream and be audited.

There were some patches proposed some months back[1] to make
fast-import allow importing signed commits...except that they
unconditionally kept the signatures and didn't do any validation,
which would have resulted in invalid signatures if any edits happened.
I suggested adding signature verification (which would allow options
like erroring out if they didn't match, or dropping signatures when
they didn't match but keeping them otherwise).  That'd help usecases
like yours.  The author wasn't interested in implementing that
suggestion (and it's a low priority for me that I may never get around
to).  The series also wasn't pushed through and eventually was
dropped.

However, that wouldn't fully solve your stated goal.  As already
mentioned earlier in this thread, I don't think your stated goal is
realistic; the only complete bit-for-bit identical representation of
the repository is the original binary format.

Your stated goal here, however, isn't required for solving the usecase
you present.

[1] https://lore.kernel.org/git/20210430232537.1131641-1-lukeshu@lukeshu.com/

> Right now, because the repository
> https://github.com/simons-public/protonfixes contains a signed commit
> right from the start, the simple fast-export and fast-import with git itself
> fails the check.

Yes, and I mentioned several other reasons why a round-trip from
fast-export through fast-import cannot be relied upon to preserve
object hashes.

> I understand that patching `git` to add `--complete` to fast-import is
> realistically beyond my coding abilities, and my only option is to parse

It's more patching than that which would be required:
(1) It'd be both fast-export and fast-import that would need patching,
not just fast-import.
(2) --complete is a bit of a misnomer too, because it's not just
get-all-the-data, it's keep-the-data-in-the-original-format.  If
objects had modes of 040000 instead of 40000, despite meaning the same
thing, you'd have to prevent canonicalization and store them as the
original recorded value or you'd get a different hash.  Ditto for
commit messages with extra data after a NUL byte, and a variety of
other possible issues.
(3) fast-export works by looking for the relevant bits it knows how to
export.  You'd have to redesign it to fully parse every bit of data in
each object it looks at, throw errors if it didn't recognize any, and
make sure it exports all the bits.  That might be difficult since it's
hard to know how to future proof it.  How do you guarantee you've
printed every field in a commit struct, when that struct might gain
new fields in the future?  (This is especially challenging since
fast-export/fast-import might not be considered core tools, or at
least don't get as much attention as the "truly core" parts of git;
see https://lore.kernel.org/git/xmqq36mxdnpz.fsf@gitster-ct.c.googlers.com/)

> the binary stream produced by `git cat-file --batch`, which I also won't
> be able to do without specification.

The specification is already available in the manual.  Just run `git
cat-file --help` to see it.  Let me quote part of it for you:

       For example, --batch without a custom format would produce:

           <sha1> SP <type> SP <size> LF
           <contents> LF

> P.S. I am resurrecting the old thread, because my problem with editing
> the history of the repository with an external tool still can not be solved.

Sure it can, just use fast-export's --reference-excluded-parents
option and don't export commits you know you won't need to change.

Or, if for some reason you are really set on exporting everything and
then editing, then go ahead and create the full fast-export output,
including with all your edits, and then post-process it manually
before feeding to fast-import.  In particular, in the post-processing
step find the commits that were problematic that you know won't be
modified, such as your signed commit.  Then go edit that fast-export
dump and (a) remove the dump of the no-longer-signed signed commit
(because you don't want it), and (b) replace any references to the
no-longer-signed-commit (e.g. "from :12") to instead use the hash of
the actual original signed commit (e.g. "from
d3d24b63446c7d06586eaa51764ff0c619113f09").  If you do that, then git
fast-import will just build the new commits on the existing signed
commit instead of on some new commit that is missing the signature.
Technically, you can even skip step (a), as all it will do is produce
an extra commit in your repository that isn't used and thus will be
garbage collected later.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-08-09 18:15                   ` Elijah Newren
@ 2021-08-10 15:51                     ` anatoly techtonik
  2021-08-10 17:57                       ` Elijah Newren
  0 siblings, 1 reply; 21+ messages in thread
From: anatoly techtonik @ 2021-08-10 15:51 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Junio C Hamano, Johannes Sixt,
	Ævar Arnfjörð Bjarmason, Git Mailing List

On Mon, Aug 9, 2021 at 9:15 PM Elijah Newren <newren@gmail.com> wrote:
>
> The author wasn't interested in implementing that
> suggestion (and it's a low priority for me that I may never get around
> to).  The series also wasn't pushed through and eventually was
> dropped.

What it takes to validate the commit signature? Isn't it the same as
validating commit tag? Is it possible to merge at least the `--fast-export`
part? The effect of roundtrip would be the same, but at least external
tools would be able to detect signed commits and warn users.

> [1] https://lore.kernel.org/git/20210430232537.1131641-1-lukeshu@lukeshu.com/

> Yes, and I mentioned several other reasons why a round-trip from
> fast-export through fast-import cannot be relied upon to preserve
> object hashes.

Yes, I understand that. What would be the recommended way to detect
which commits would change as a result of the round-trip? It will then
be possible to warn users in `reposurgeon` `lint` command.

> (3) fast-export works by looking for the relevant bits it knows how to
> export.  You'd have to redesign it to fully parse every bit of data in
> each object it looks at, throw errors if it didn't recognize any, and
> make sure it exports all the bits.  That might be difficult since it's
> hard to know how to future proof it.  How do you guarantee you've
> printed every field in a commit struct, when that struct might gain
> new fields in the future?  (This is especially challenging since
> fast-export/fast-import might not be considered core tools, or at
> least don't get as much attention as the "truly core" parts of git;
> see https://lore.kernel.org/git/xmqq36mxdnpz.fsf@gitster-ct.c.googlers.com/)

Looks like the only way to make it forward compatible is to introduce
some kind of versioning and a validation schema like protobuf. Otherwise
writing an importer and exporter for each and every thing that may
encounter in a git stream may be unrealistic, yes.

> > P.S. I am resurrecting the old thread, because my problem with editing
> > the history of the repository with an external tool still can not be solved.
>
> Sure it can, just use fast-export's --reference-excluded-parents
> option and don't export commits you know you won't need to change.

How does `--reference-excluded-parents` help to read signed commits?

`reposurgeon` needs all commits to select those that are needed by
different criteria. It is hard to tell which commits are not important without
reading and processing them first.

> Or, if for some reason you are really set on exporting everything and
> then editing, then go ahead and create the full fast-export output,
> including with all your edits, and then post-process it manually
> before feeding to fast-import.  In particular, in the post-processing
> step find the commits that were problematic that you know won't be
> modified, such as your signed commit.  Then go edit that fast-export
> dump and (a) remove the dump of the no-longer-signed signed commit
> (because you don't want it), and (b) replace any references to the
> no-longer-signed-commit (e.g. "from :12") to instead use the hash of
> the actual original signed commit (e.g. "from
> d3d24b63446c7d06586eaa51764ff0c619113f09").  If you do that, then git
> fast-import will just build the new commits on the existing signed
> commit instead of on some new commit that is missing the signature.
> Technically, you can even skip step (a), as all it will do is produce
> an extra commit in your repository that isn't used and thus will be
> garbage collected later.

The problem is to detect problematic signed commits, because as I
understand `fast-export` doesn't give any signs if commits were signed
before the export.
-- 
anatoly t.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-08-10 15:51                     ` anatoly techtonik
@ 2021-08-10 17:57                       ` Elijah Newren
  2022-12-11 18:30                         ` anatoly techtonik
  0 siblings, 1 reply; 21+ messages in thread
From: Elijah Newren @ 2021-08-10 17:57 UTC (permalink / raw)
  To: anatoly techtonik
  Cc: Junio C Hamano, Johannes Sixt,
	Ævar Arnfjörð Bjarmason, Git Mailing List

On Tue, Aug 10, 2021 at 8:51 AM anatoly techtonik <techtonik@gmail.com> wrote:
>
> On Mon, Aug 9, 2021 at 9:15 PM Elijah Newren <newren@gmail.com> wrote:
> >
> > The author wasn't interested in implementing that
> > suggestion (and it's a low priority for me that I may never get around
> > to).  The series also wasn't pushed through and eventually was
> > dropped.
>
> What it takes to validate the commit signature?

I'm not familiar with any of the gpg libraries, and don't even have an
active gpg key.  So, I don't know.  Some quick grepping shows that we
have gpg-interface.[ch], so we have some functions we can apparently
call.

> Isn't it the same as validating commit tag?

gpg signatures of tags are somewhat different than gpg signatures of commits:

* gpg signatures for tags are simply part of the annotated tag message
* gpg signatures for commits are stored in a separate commit header,
not just as extra text at the end of the commit message

This gpg signature handling for tags means that fast-import isn't even
aware of whether the tag is signed; it simply sees a commit message
and records it.  fast-export also would have been unaware and just
exported them as-is if someone hadn't written some special parsing for
it.  fast-import would need to do similar special parsing to become
aware of whether the tags are signed or not.  For now, fast-import
just keeps any tag messages as-is, and thus potentially writes invalid
tag signatures.  (The only way people have to control this is at the
fast-export side with the --signed-tags flag, which gives you the
choices of abort, strip, or keep the signatures even though they'll
likely be wrong.)  If fast-import were to gain knowledge of tag
signatures and an ability to validate them, it could offer smarter
options like keep-if-valid-and-discard-otherwise.

In contrast, the fact that gpg signatures for commits have to be
recorded as a separate commit header means they cannot be recorded in
fast-import without additional code changes.  And both the fast-export
and fast-import sides have to be made aware of and specially handle
the commit signatures for them to even get propagated, let alone
validated.

> Is it possible to merge at least the `--fast-export`
> part? The effect of roundtrip would be the same, but at least external
> tools would be able to detect signed commits and warn users.

The fact that it wasn't merged suggests there was some issue raised in
feedback that wasn't addressed.  I don't remember if that was the case
or not, but someone would have to find out, address any remaining
issues pointed out by feedback, and champion it through.

Personally, I don't like shoving a half solution through and think
there needs to be validation on the fast-import side added at the same
time, but others may disagree with me.  I have plenty of other
projects to work on, though, so whoever does the work will more likely
be the ones to decide.

> > [1] https://lore.kernel.org/git/20210430232537.1131641-1-lukeshu@lukeshu.com/
>
> > Yes, and I mentioned several other reasons why a round-trip from
> > fast-export through fast-import cannot be relied upon to preserve
> > object hashes.
>
> Yes, I understand that. What would be the recommended way to detect
> which commits would change as a result of the round-trip? It will then
> be possible to warn users in `reposurgeon` `lint` command.

There is no function or command that would check that kind of thing
short of doing the round-trip.  I provided a list of reasons IDs could
change as a starting point in case anyone wanted to try to write a
function or command that could check, and to point out that it is a
long list and might grow in the future.

I think practically, if you're doing a one-shot export (as I
originally assumed from your email), that you'd find out and then just
manually fix things up by hand.  If your goal is writing or changing a
general purpose filtering tool, then I'd suggest instead using the
alternate technique I outlined in the other thread you started at [2].

[2] https://lore.kernel.org/git/CABPp-BH4dcsW52immJpTjgY5LjaVfKrY9MaUOnKT3byi2tBPpg@mail.gmail.com/

> > (3) fast-export works by looking for the relevant bits it knows how to
> > export.  You'd have to redesign it to fully parse every bit of data in
> > each object it looks at, throw errors if it didn't recognize any, and
> > make sure it exports all the bits.  That might be difficult since it's
> > hard to know how to future proof it.  How do you guarantee you've
> > printed every field in a commit struct, when that struct might gain
> > new fields in the future?  (This is especially challenging since
> > fast-export/fast-import might not be considered core tools, or at
> > least don't get as much attention as the "truly core" parts of git;
> > see https://lore.kernel.org/git/xmqq36mxdnpz.fsf@gitster-ct.c.googlers.com/)
>
> Looks like the only way to make it forward compatible is to introduce
> some kind of versioning and a validation schema like protobuf. Otherwise
> writing an importer and exporter for each and every thing that may
> encounter in a git stream may be unrealistic, yes.
>
> > > P.S. I am resurrecting the old thread, because my problem with editing
> > > the history of the repository with an external tool still can not be solved.
> >
> > Sure it can, just use fast-export's --reference-excluded-parents
> > option and don't export commits you know you won't need to change.
>
> How does `--reference-excluded-parents` help to read signed commits?

It doesn't.  I was assuming you were doing a one shot export, namely
of the repository you linked to,
https://github.com/simons-public/protonfixes, and that you already
knew which commits were not going to be changed (because you pointed
them out in your email to the list) -- and in fact that it was only a
single commit affected, as you mentioned.

Armed with that knowledge, you could just export the parts of the
repository AFTER that commit, and use --reference-excluded-parents to
make sure the fast-export stream built upon them rather than squashing
all changes up to that point into the first commit in the stream.

> `reposurgeon` needs all commits to select those that are needed by
> different criteria. It is hard to tell which commits are not important without
> reading and processing them first.

Right, so you aren't trying to just handle this one repository, but
modify/create a general purpose tool that does so.  See my response in
the other thread you started, again at [2] above.

> > Or, if for some reason you are really set on exporting everything and
> > then editing, then go ahead and create the full fast-export output,
> > including with all your edits, and then post-process it manually
> > before feeding to fast-import.  In particular, in the post-processing
> > step find the commits that were problematic that you know won't be
> > modified, such as your signed commit.  Then go edit that fast-export
> > dump and (a) remove the dump of the no-longer-signed signed commit
> > (because you don't want it), and (b) replace any references to the
> > no-longer-signed-commit (e.g. "from :12") to instead use the hash of
> > the actual original signed commit (e.g. "from
> > d3d24b63446c7d06586eaa51764ff0c619113f09").  If you do that, then git
> > fast-import will just build the new commits on the existing signed
> > commit instead of on some new commit that is missing the signature.
> > Technically, you can even skip step (a), as all it will do is produce
> > an extra commit in your repository that isn't used and thus will be
> > garbage collected later.
>
> The problem is to detect problematic signed commits, because as I
> understand `fast-export` doesn't give any signs if commits were signed
> before the export.

Signed commits is just one issue, and you'll have to add special code
to handle a bunch of other special cases if you go down this route.
I'd rephrase the problem.  You want to know when _your tool_ (e.g.
reposurgeon since you refer to it multiple times; I'm guessing you're
contributing to it?) has not modified a commit or any of its
ancestors, and when it hasn't, then _your tool_ should remove that
commit from the fast-export stream and replace any references to it by
the original commit's object id.  I outlined how to do this in [2],
referenced above, making use of the --show-original-ids flag to
fast-export.  If you do that, then for any commits which you haven't
modified (including not modifying any of its ancestors), then you'll
keep the same commits as-is with no stripping of gpg-signatures or
canonicalization of objects, so that you'll have the exact same commit
IDs.  Further, you can do this today, without any changes to git
fast-export or git fast-import.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2021-08-10 17:57                       ` Elijah Newren
@ 2022-12-11 18:30                         ` anatoly techtonik
  2023-01-13  7:21                           ` Elijah Newren
  0 siblings, 1 reply; 21+ messages in thread
From: anatoly techtonik @ 2022-12-11 18:30 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Junio C Hamano, Johannes Sixt,
	Ævar Arnfjörð Bjarmason, Git Mailing List

On Tue, Aug 10, 2021 at 8:58 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Tue, Aug 10, 2021 at 8:51 AM anatoly techtonik <techtonik@gmail.com> wrote:
> >
> > On Mon, Aug 9, 2021 at 9:15 PM Elijah Newren <newren@gmail.com> wrote:
> > >
>
> [2] https://lore.kernel.org/git/CABPp-BH4dcsW52immJpTjgY5LjaVfKrY9MaUOnKT3byi2tBPpg@mail.gmail.com/
>
> Signed commits is just one issue, and you'll have to add special code
> to handle a bunch of other special cases if you go down this route.
> I'd rephrase the problem.  You want to know when _your tool_ (e.g.
> reposurgeon since you refer to it multiple times; I'm guessing you're
> contributing to it?) has not modified a commit or any of its
> ancestors, and when it hasn't, then _your tool_ should remove that
> commit from the fast-export stream and replace any references to it by
> the original commit's object id.  I outlined how to do this in [2],
> referenced above, making use of the --show-original-ids flag to
> fast-export.  If you do that, then for any commits which you haven't
> modified (including not modifying any of its ancestors), then you'll
> keep the same commits as-is with no stripping of gpg-signatures or
> canonicalization of objects, so that you'll have the exact same commit
> IDs.  Further, you can do this today, without any changes to git
> fast-export or git fast-import.

Took me a while to process the reply. Let's recap.

I want to make a roundtrip export/import of
https://github.com/simons-public/protonfixes which should get exactly
the same repository.

# --- fast-export to exported.txt
git clone https://github.com/simons-public/protonfixes
git -C protonfixes fast-export --all > exported.txt
# --- check revision of the repo
git -C protonfixes rev-parse HEAD
# 681411ba8ceb5d2d790e674eb7a5b98951d426e6

# --- fast-import into new repo
git init newrepo
git -C newrepo fast-import < exported.txt
# --- checking revision of the new repo
git -C newrepo rev-parse HEAD
# 9888762d7857d9721f0c354e7fc187a199754a4b

Hashes don't match. The roundtrip fails.


Let's see if --reference-excluded-parents helps.

# --- export below produces the same export stream as above
git -C protonfixes fast-export --reference-excluded-parents --all >
exported_parents.txt


Because fast-import/fast-export don't work, you propose to keep the old
repo around until it is clear which commits I am going to modify. Then
make a new fast-export starting from the first commit I am going to
modify with --reference-excluded-parents flag. Is that correct so far?

Then given this partial export and old repo, how to init the new repo
that fast-import can apply its tail there?

What if I have multiple commits that I modify, but I don't know which
of their parents was first? And when I touch commits from different
branches, how to recreate their parent history intact in one repo?

-- 
anatoly t.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Round-tripping fast-export/import changes commit hashes
  2022-12-11 18:30                         ` anatoly techtonik
@ 2023-01-13  7:21                           ` Elijah Newren
  0 siblings, 0 replies; 21+ messages in thread
From: Elijah Newren @ 2023-01-13  7:21 UTC (permalink / raw)
  To: anatoly techtonik
  Cc: Junio C Hamano, Johannes Sixt,
	Ævar Arnfjörð Bjarmason, Git Mailing List

On Sun, Dec 11, 2022 at 10:30 AM anatoly techtonik <techtonik@gmail.com> wrote:
>
> On Tue, Aug 10, 2021 at 8:58 PM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Tue, Aug 10, 2021 at 8:51 AM anatoly techtonik <techtonik@gmail.com> wrote:
> > >
> > > On Mon, Aug 9, 2021 at 9:15 PM Elijah Newren <newren@gmail.com> wrote:
> > > >
> >
> > [2] https://lore.kernel.org/git/CABPp-BH4dcsW52immJpTjgY5LjaVfKrY9MaUOnKT3byi2tBPpg@mail.gmail.com/
> >
> > Signed commits is just one issue, and you'll have to add special code
> > to handle a bunch of other special cases if you go down this route.
> > I'd rephrase the problem.  You want to know when _your tool_ (e.g.
> > reposurgeon since you refer to it multiple times; I'm guessing you're
> > contributing to it?) has not modified a commit or any of its
> > ancestors, and when it hasn't, then _your tool_ should remove that
> > commit from the fast-export stream and replace any references to it by
> > the original commit's object id.  I outlined how to do this in [2],
> > referenced above, making use of the --show-original-ids flag to
> > fast-export.  If you do that, then for any commits which you haven't
> > modified (including not modifying any of its ancestors), then you'll
> > keep the same commits as-is with no stripping of gpg-signatures or
> > canonicalization of objects, so that you'll have the exact same commit
> > IDs.  Further, you can do this today, without any changes to git
> > fast-export or git fast-import.
>
> Took me a while to process the reply. Let's recap.
>
> I want to make a roundtrip export/import of
> https://github.com/simons-public/protonfixes which should get exactly
> the same repository.

As I've stated a few times in the thread, this request of yours is
simply impossible for general repositories ([1] contains the best
summary of the reasons).  For the specific repository in question, the
only relevant roadblocker is the presence of a signed commit which
happens to be a root commit.  That opens the door to some workarounds
that could be used with this specific repository.

[1] https://lore.kernel.org/git/CABPp-BGDB6jj+Et44D6D22KXprB89dNpyS_AAu3E8vOCtVaW1A@mail.gmail.com/

I provided two workarounds you could try to use for your specific case
at [2] and [3], one of which you ask about below.

[2] https://lore.kernel.org/git/CABPp-BE=9wzF6_VypoR-uEPHsLWdV7zyE13FOgLK0h8NOcMz3g@mail.gmail.com/
[3]  https://lore.kernel.org/git/CABPp-BH4dcsW52immJpTjgY5LjaVfKrY9MaUOnKT3byi2tBPpg@mail.gmail.com/

> # --- fast-export to exported.txt
> git clone https://github.com/simons-public/protonfixes
> git -C protonfixes fast-export --all > exported.txt
> # --- check revision of the repo
> git -C protonfixes rev-parse HEAD
> # 681411ba8ceb5d2d790e674eb7a5b98951d426e6
>
> # --- fast-import into new repo
> git init newrepo
> git -C newrepo fast-import < exported.txt
> # --- checking revision of the new repo
> git -C newrepo rev-parse HEAD
> # 9888762d7857d9721f0c354e7fc187a199754a4b
>
> Hashes don't match. The roundtrip fails.

As expected, given that one of your commits is signed.

> Let's see if --reference-excluded-parents helps.
>
> # --- export below produces the same export stream as above
> git -C protonfixes fast-export --reference-excluded-parents --all >
> exported_parents.txt

--reference-excluded-parents only has effect if there are excluded
parents.  You didn't exclude any parents, so obviously adding this
flag isn't going to change anything.  You should instead first
clone/fetch the part of history up to the commits you want to keep
intact (e.g. the signed commits), and then run a command like
   git -C protonfixes fast-export --reference-excluded-parents
^${BASECOMMIT1} ^${BASECOMMIT2} ^${BASECOMMITN} --all
>exported_only_newer_history.txt | git -C newrepo fast-import

Note that the examples I gave you (e.g. [2] above) all used some
excluded references (e.g. "^master~5").

> Because fast-import/fast-export don't work

You have not yet identified a bug in either, so I disagree with this comment.

>, you propose to keep the old
> repo around until it is clear which commits I am going to modify.


This statement framing looks really weird to me.  You have posed your
problem in the form of doing some kind of export/import operation,
which is fine.  However, in order to do an export operation, you
obviously need the repository in order to export it.  So why are you
calling out that you keep the repo around until you run the
fast-export command?

Anyway, that aside...

I was just saying that
  (1) signed commits exist as a method to ensure to other users that
the commits have not been modified
  (2) fast-export and fast-import exist to allow you to modify history
in some fashion (and are separate steps so people can edit the stream
between running the two commands)
  (3) the above two imply that if you still want users to be able to
verify the signed commits, that signed commits should NOT be sent
through fast-export and fast-import
  (4) therefore, if you want the signed commits kept as-is, you should
simply fetch the history up to and including those, and only send the
remainder of the history through fast-export/fast-import.

But I will add here one additional thing:

If you're weaving repositories together, that likely changes the
parent(s) of some of the commits.  Once you change the parent(s) of a
commit, that alone changes the commit and invalidates any signature it
has.  In your case you seem to only have a root commit that is signed,
and if you keep that signed commit as a root commit, you can avoid
this problem.  But, in general, if signed commits are involved in the
weaving such that they gain new parents, then what you want to do is
simply impossible; you will not be able to keep the signatures in such
a case (and the commit ids will change as well).

> Then
> make a new fast-export starting from the first commit I am going to
> modify with --reference-excluded-parents flag. Is that correct so far?

You have the basic idea, but you are making things excessively complex
with one detail here -- it does not need to start with the first
commit you are going to modify; it can start earlier.  You can simply
export all commits after the one(s) you know you don't want to change.
For example, if the history looks like this:

A---B---C---D---E---F

and commits A and B are the only signed commits (which you want to
preserve) and commit D is the first one you are going to modify, you
could still run fast-export on "^A ^B F" (i.e. C, D, E, and F in this
case) -- that will also include C, but C isn't signed and round-trips
without problems, so it doesn't hurt to include it.

> Then given this partial export and old repo, how to init the new repo
> that fast-import can apply its tail there?

Flag the signed commit(s) with a branch or branches of some sort, then
fetch just those branches into the new repo.

> What if I have multiple commits that I modify, but I don't know which
> of their parents was first?

I wouldn't bother trying to figure out which one(s) is/are first.  (I
mean, you could do some revision walking to figure that out, in which
case you'd have to fetch more than just the history of the signed
commits you want to keep but everything prior to whatever first
commit(s) you want to modify.)

Instead, I'd just do the easier thing I noted above -- use the signed
commits as exclusion markers.

> And when I touch commits from different
> branches, how to recreate their parent history intact in one repo?

Place temporary branches pointing directly to each of the signed
commits you want to keep intact (which also implies you are keeping
all the history behind those commits intact as well), then run:

git -C newrepo fetch PATH_OR_URL_OF_OLD_REPO ${TEMPBRANCH1}
${TEMPBRANCH2} ${TEMPBRANCHN}

Then use the earlier suggestion of

git -C protonfixes fast-export --reference-excluded-parents
^${TEMPBRANCH1} ^${TEMPBRANCH2} ^${TEMPBRANCHN} --all
>exported_only_newer_history.txt | git -C newrepo fast-import

to get the remainder of the history exported/imported.



I will also add that since you are interested in attempting to
round-trip through fast-export/fast-import and still end up with the
same hashes (ignoring a few fundamental shortcomings mentioned earlier
in this thread that won't always permit this to work), you can at
least get closer by adding "--reencode=no" to fast-export (so that it
doesn't alter commit messages) and setting core.ignorecase=false for
at least the fast-import invocation (so that fast-import doesn't make
files which differ in case only clob each other while importing).
But, again, that only addresses like two issues out of half a dozen.
Again, see the link at [1] earlier in this email.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-01-13  7:27 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-27 12:31 Round-tripping fast-export/import changes commit hashes anatoly techtonik
2021-02-27 17:48 ` Elijah Newren
2021-02-28 10:00   ` anatoly techtonik
2021-02-28 10:34     ` Ævar Arnfjörð Bjarmason
2021-03-01  7:44       ` anatoly techtonik
2021-03-01 17:34         ` Junio C Hamano
2021-03-02 21:52           ` anatoly techtonik
2021-03-03  7:13             ` Johannes Sixt
2021-03-04  0:55               ` Junio C Hamano
2021-08-09 15:45                 ` anatoly techtonik
2021-08-09 18:15                   ` Elijah Newren
2021-08-10 15:51                     ` anatoly techtonik
2021-08-10 17:57                       ` Elijah Newren
2022-12-11 18:30                         ` anatoly techtonik
2023-01-13  7:21                           ` Elijah Newren
2021-03-01 18:06         ` Elijah Newren
2021-03-01 20:04           ` Ævar Arnfjörð Bjarmason
2021-03-01 20:17             ` Elijah Newren
2021-03-02 22:12           ` anatoly techtonik
2021-03-01 20:02         ` Ævar Arnfjörð Bjarmason
2021-03-02 22:23           ` anatoly techtonik

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).