how to (integrity) verify a whole git repo

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* how to (integrity) verify a whole git repo
@ 2020-04-21  4:45 Christoph Anton Mitterer
  2020-04-21  6:53 ` Jonathan Nieder
  2020-04-21 19:14 ` Junio C Hamano
  0 siblings, 2 replies; 7+ messages in thread
From: Christoph Anton Mitterer @ 2020-04-21  4:45 UTC (permalink / raw)
  To: git

Hi.

It seems I couldn't really find any definitive answer one the
following:

How to cryptographically verify the integrity of a whole git repo (i.e.
all it's commits/blobs/etc. in the history?

Assume e.g. I have the kernel sources and want to do some bisection.
One has also retrieved Linus' and GregKH's key via some trusted path
and assumes that SHA1 is more or less still safe enough ;-)

1) Of course there is git verify-tag and verify-commit which are signed
with the GPPG, but these alone check, AFAIU, only the respective
tag/commit.

How to check everything else? Is it enough to git fsck --full?

Everything earlier in the history of a verified tag/commit should be
cryptographically safe (assuming SHA1 would be still secure enough),
right?

2) But this of course won't show me anything which is in the repo but
not earlier in the history of the tag/commit I've checked, right?!
Is there a way to e.g. have everything dropped which is not verifiable
via some signed commit/tag?

3) I'd assume that normal operations like checkout/bisect/etc. notice
if some SHA1 sum doesn't match. So once I've verified say kernel v.5.6
tag, I could checkout everything in the history of that and be sure it
wasn't modified, right?
But of course this wouldn't include e.g. other stable versions, like
v5.5.13.

Thanks,
Chris.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: how to (integrity) verify a whole git repo
  2020-04-21  4:45 how to (integrity) verify a whole git repo Christoph Anton Mitterer
@ 2020-04-21  6:53 ` Jonathan Nieder
  2020-04-21 14:42   ` Christoph Anton Mitterer
  2020-04-21 19:14 ` Junio C Hamano
  1 sibling, 1 reply; 7+ messages in thread
From: Jonathan Nieder @ 2020-04-21  6:53 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: git

Hi Christoph,

Christoph Anton Mitterer wrote:

> How to cryptographically verify the integrity of a whole git repo (i.e.
> all it's commits/blobs/etc. in the history?

This happens automatically as part of fetch.  When you fetch, the
objects' content is transfered over the wire but not their names.  The
name of each object is a hash of its content.  Thus, whenever you
address an object by its name, you are using its verified identity.

> Assume e.g. I have the kernel sources and want to do some bisection.
> One has also retrieved Linus' and GregKH's key via some trusted path
> and assumes that SHA1 is more or less still safe enough ;-)
>
> 1) Of course there is git verify-tag and verify-commit which are signed
> with the GPPG, but these alone check, AFAIU, only the respective
> tag/commit.

Tag and commit object content include the object ids for the objects
they reference, so (assuming we are using a strong hash) their name
is enough to verify all content reachable from them.

In other words, it's a Merkle tree.

> How to check everything else? Is it enough to git fsck --full?

fsck is helpful for checking that objects are valid --- that they
don't reference any objects you don't have, that their format is
correct, and so on.  So it's good to run (or you can use the
transfer.fsckObjects setting to run fsck as part of the clone or fetch
operation).

Thanks and hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: how to (integrity) verify a whole git repo
  2020-04-21  6:53 ` Jonathan Nieder
@ 2020-04-21 14:42   ` Christoph Anton Mitterer
  2020-04-21 16:19     ` Konstantin Ryabitsev
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Anton Mitterer @ 2020-04-21 14:42 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git

Hey Jonathan.

On Mon, 2020-04-20 at 23:53 -0700, Jonathan Nieder wrote:
> This happens automatically as part of fetch.  When you fetch, the
> objects' content is transfered over the wire but not their
> names.  The
> name of each object is a hash of its content.  Thus, whenever you
> address an object by its name, you are using its verified identity.

Okay maybe I wasn't clear enough :D (mixing up integrity and
authenticity).

I'd guess that what you describe here is, that effectively the chain of
all SHA1 hashes is computed when one does fetch, right?

But this alone doesn't guarantee cryptographic authenticity, e.g. as in
"that's the kernel sources as released by Linus".

> Tag and commit object content include the object ids for the objects
> they reference, so (assuming we are using a strong hash) their name
> is enough to verify all content reachable from them.
> 
> In other words, it's a Merkle tree.

And for (cryptographically) checking the authenticity of that tree,
wouldn't I need to verify the signatures on it's leaves?

Taking again the kernel as an example:
If I clone the repo (or fsck it later), than all I know is that there
was no corruption, if the all the tips are correct, since they start
the chain of hash sums to all other objects.

But an attacker could have just forged these tips.
So for checking authenticity, I need to verify some signatures on them

Now if I check e.g. Linus signature on tag v5.6; I should know that
everything earlier (in the tree, not chronologically) to that tag are
authentic.

But not e.g. any commits on top of v.5.6 (which aren't either signed
themselves or protected by another tag "above" them).
Neither any commits never reached from v.5.6, e.g. later stable patches
like anything from above v.5.5 (which is again below v.5.6) up to 
v.5.5.13, which is not.

So from my understanding, to use only commits that are authentic by the
kernel upstream developers, I'd need verify all these tips.. and throw
away everything which is not reachable by one of them.

Is that somehow possible?

Thanks,
Chris.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: how to (integrity) verify a whole git repo
  2020-04-21 14:42   ` Christoph Anton Mitterer
@ 2020-04-21 16:19     ` Konstantin Ryabitsev
  2020-04-23 18:12       ` Christoph Anton Mitterer
  0 siblings, 1 reply; 7+ messages in thread
From: Konstantin Ryabitsev @ 2020-04-21 16:19 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: Jonathan Nieder, git

On Tue, Apr 21, 2020 at 04:42:16PM +0200, Christoph Anton Mitterer wrote:
> Taking again the kernel as an example:
> If I clone the repo (or fsck it later), than all I know is that there
> was no corruption, if the all the tips are correct, since they start
> the chain of hash sums to all other objects.

Notably, there is normally only one branch in torvalds/linux.git, and 
that's "master". So, there's only one tip.

> But an attacker could have just forged these tips.
> So for checking authenticity, I need to verify some signatures on them
> 
> Now if I check e.g. Linus signature on tag v5.6; I should know that
> everything earlier (in the tree, not chronologically) to that tag are
> authentic.

Yes, verifying a signature on a tag tells you that all commits are 
bit-for-bit exactly the same as on Linus's workstation where he created 
the signature.

> But not e.g. any commits on top of v.5.6 (which aren't either signed
> themselves or protected by another tag "above" them).

This is mostly true, yes.

> Neither any commits never reached from v.5.6, e.g. later stable patches
> like anything from above v.5.5 (which is again below v.5.6) up to 
> v.5.5.13, which is not.

Stable commits would be in the stable tree, and those tags are signed by 
Greg Kroah-Hartman.

> So from my understanding, to use only commits that are authentic by the
> kernel upstream developers, I'd need verify all these tips.. and throw
> away everything which is not reachable by one of them.
> 
> Is that somehow possible?

You probably don't care about commits that arrive between releases, so 
effectively you are already doing that? Even if you have loose objects 
that aren't reachable from your current tip (e.g. you only care about 
objects in the stable branch linux-5.6.y), it's not like they are going 
to "poison" your tree, so removing them is just a garbage collection 
operation at best.

## Minor attestation rant

I would argue that your premise of "authenticity" is wrong. The best 
that we are currently able to offer is a guarantee that, at the point 
where the tag was signed, the tree is bit-for-bit exact to the tree the 
way it exists on Linus Torvalds' (or Greg KH's) workstation.

However, both Linus and Greg merge code from tens of thousands of other 
contributors and it's important to keep in mind that their tag 
signatures do not offer any kind of attestation proof of the code's 
actual authorship or origin. Looking for such proof would be 
near-impossible -- even if we had a universally accepted mechanism to do 
cryptographic attestation of all patches and commits, normal maintainer 
operations would necessarily break this chain:

- maintainers insert their own trailers into commit messages
  (Signed-off-by, Tested-by, Acked-by, etc).
- maintainers reorder and edit patches that they receive from individual 
  contributors -- for typos, minor stylistical cleanups, extra comments, 
  etc.
- maintainers routinely rebase patches they receive before they can 
  submit them to be merged into mainline.

Full code attestation is possible in projects where all commits are 
forks and merges -- for example, many Git**b/Gerrit projects could be 
set up to require full cryptographic attestation of commits, if all 
operations are forks, pull requests, and merges. However, it would be 
impossible to force this development paradigm onto the Linux kernel -- 
it would be extremely disruptive and require massive individual effort 
to overhaul every maintainer's workflow. Furthermore, many maintainers 
would reject this approach because they would disagree about the main 
premise behind the effort -- that cryptographically signing every commit 
offers enough tangible benefit to be worth it.

Let me expound on the last point. There are some 15,000 personas who 
have committed code to the Linux kernel (a persona could be the same 
person committing code from different commercial entities -- 
jdoe@google.com vs jdoe@redhat.com). Even if we assume that each commit 
is signed, we then must have a way to perform some kind of meaningful 
verification, right?

- Where do we get all the public keys required for such a task?
- How do we handle cases where a key has expired or worse, has been 
  revoked by the developer? This can't invalidate their past commits, 
  because it's impossible to re-sign those.
- How do we bootstrap distributed trust without relying on someone being 
  a Fundamentally Non-corruptible Person? It's certainly not me -- I 
  have close relatives living under, shall we say, regimes with loose 
  standards when it comes to personal freedoms.
- How much trust should we be putting into cryptographic signatures?  
  Linux developers aren't necessarily that much better about keeping 
  their workstations protected against malicious attacks, so they are 
  just as vulnerable to having their private keys stolen as anyone else.

For this reason, Linux maintainers use either a zero-trust approach, or 
a last-leg trust approach:

- Submaintainers don't put much trust into *who* wrote the code and 
  review all submissions they receive as potentially containing security 
  bugs (intentional or not); their job is to review the code and pass it 
  up the chain to maintainers.
- if maintainers receive pull requests from submaintainers, then they 
  *may* check cryptographic signatures on the trees they pull. I am 
  trying to encourage all maintainers to do this, and I've been working 
  to introduce patch attestation so that maintainers preferring to work 
  with patch series as opposed to pull requests can have similar 
  functionality.
- Linus checks all signatures on trees he pulls from non-kernel.org 
  locations. Unfortunately, I've not been able to convince him that he 
  should check them on stuff he pulls from kernel.org as well (and he 
  has his own reasons for that).

So, all of this is to say that as the person cloning linux.git you are 
merely the last link in the chain of "trusting the maintainer before 
you." In your case that maintainer is Linus (or Greg KH), and you have 
to agree that, in the end, "having a tree that is bit-for-bit identical 
with what Linus has" is a pretty good assurance that it's as "authentic 
Linux" as it gets.

-K

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: how to (integrity) verify a whole git repo
  2020-04-21  4:45 how to (integrity) verify a whole git repo Christoph Anton Mitterer
  2020-04-21  6:53 ` Jonathan Nieder
@ 2020-04-21 19:14 ` Junio C Hamano
  2020-04-23  4:02   ` Christoph Anton Mitterer
  1 sibling, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2020-04-21 19:14 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: git

Christoph Anton Mitterer <calestyo@scientia.net> writes:

> How to check everything else? Is it enough to git fsck --full?
>
> Everything earlier in the history of a verified tag/commit should be
> cryptographically safe (assuming SHA1 would be still secure enough),
> right?

Correct.

> 2) But this of course won't show me anything which is in the repo but
> not earlier in the history of the tag/commit I've checked, right?!
> Is there a way to e.g. have everything dropped which is not verifiable
> via some signed commit/tag?

You can compute the commits that are not reachable from any of the
signed tags.

    git rev-list --all --not $list_tags_and_commits_you_trust_here

will enumerate all the commits that are not reachable from those
tags.

But your "have everything dropped" is a fuzzy notion and you must be
more precise to define what you want.  Imagine this history:

    ----o-----o-----L-----x----x-----x-----x-----x----x HEAD (master)
                                          /
                                         /
                                        /
                   ... ------o----o----G

where you have two people you trust (Linus and Greg), HEAD is the
tip of your 'master' branch, probably you fetched from Linus, L and
G are the two recent tags Linus and Greg signed.

If you enumerate commits that are not reachable from L or G, you'll
get all commits that are marked with 'x'.  Commits marked with 'o'
are reachable from either 'L' or 'G', and you would want to keep
them.

Now, you need to define what you mean by "have everything dropped".
You can remove commits 'x' but then after that where would your
'master' branch point at?  There is no good answer to that question.

What you could do is remove all branches and tags except for the
signed tags you trust from your repository and then use "git repack"
the repository.  Then there will be tags that point at L and G but
you'd be discarding 'master' (which is not signed) and repack will
discard all 'x' in the sample history illustrated above.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: how to (integrity) verify a whole git repo
  2020-04-21 19:14 ` Junio C Hamano
@ 2020-04-23  4:02   ` Christoph Anton Mitterer
  0 siblings, 0 replies; 7+ messages in thread
From: Christoph Anton Mitterer @ 2020-04-23  4:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Hey Junio.

On Tue, 2020-04-21 at 12:14 -0700, Junio C Hamano wrote:
> You can compute the commits that are not reachable from any of the
> signed tags.
> 
>     git rev-list --all --not $list_tags_and_commits_you_trust_here
> 
> will enumerate all the commits that are not reachable from those
> tags.

And with reachable you mean: "commits which were not before the commits
I trust" ("before obviously again not in terms of their date, but their
position in the tree").

> But your "have everything dropped" is a fuzzy notion and you must be
> more precise to define what you want.  Imagine this history:
> 
> 
>     ----o-----o-----L-----x----x-----x-----x-----x----x HEAD (master)
>                                           /
>                                          /
>                                         /
>                    ... ------o----o----G
> 
> where you have two people you trust (Linus and Greg), HEAD is the
> tip of your 'master' branch, probably you fetched from Linus, L and
> G are the two recent tags Linus and Greg signed.
> 
> If you enumerate commits that are not reachable from L or G, you'll
> get all commits that are marked with 'x'.  Commits marked with 'o'
> are reachable from either 'L' or 'G', and you would want to keep
> them.

That seems to be more or less what I'd want.

> Now, you need to define what you mean by "have everything dropped".
> You can remove commits 'x' but then after that where would your
> 'master' branch point at?  There is no good answer to that question.

Hmm well naively I'd have said master should point to L, assuming
Greg's branch was merged into it and assuming git knows which branch
was the one merged into.

Of course that would leave Greg's branch possibly dangling at G.

Maybe one could handle such cases like this:
    ----o-----o-----L-----x----x-----x-----x-----x----x HEAD (master)
                                          /
                                         /
                                        /
                    ... -----t----o----G

If the former branch name can be determined (from the commit message?),
recreate it.
If not, the commits from Greg's branch could be either left 
unreachable or maybe, with some special option, could be pointed at by
some newly created branch-name foo-1 or whatever.
If Greg's branch contains a commit pointed to by tag (here named t), at
least this would be reachable anyway.

But I guess for the use case I'm thinking about, unreachable commits
wouldn't be that much of a problem.

> What you could do is remove all branches and tags except for the
> signed tags you trust from your repository and then use "git repack"
> the repository.  Then there will be tags that point at L and G but
> you'd be discarding 'master' (which is not signed) and repack will
> discard all 'x' in the sample history illustrated above.

Well one could probably just manually set master to some reasonable
commit, i.e. the one which was likely anyway master at some point in
time, until Linus added further commits.

Is there an easy (like for people who don't dream in git ;-) ) and
ideally fast way to do all this.

I would have guessed that a command which does this more or less out of
the box, might be quite helpful for security conscious people. The
scenario shouldn't be so rare:

- one clones a repo, where commits are usually not signed, but tags are
- one has a number of trusted people and can even securely retrieve
  their keys (in my case, Debian ships Linus' and Greg's key in the
  source package of the kernel)
- one needs to work with the repo, including any older states in the
  history (in my case it's trying to bisect the - for me - showstopper
  bug: https://bugzilla.kernel.org/show_bug.cgi?id=207245 )
- one doesn't want to use anything which is not signed by trusted
  people, so basically one wants a repo, as if it would have just been
  cloned when all branches/etc. were at the state of a signed tag (or
  commit).

So I have something like the (stable)kernel repo which looks a bit like
(with (L) and (G) indicating who signed):
               x---x---x--- foo
              /
    ----o-----v.5.5(L)----o----o-----v.5.6(L)----x-----x----x master
               \                     \
                \                     o----v.5.6.1(G)---o----v.5.6.1(G)
                 \
                  o----o----v.5.5.1(G)---o---o---v.5.5.1(G)---x---x

A command like:
git drop-unsigned-stuff --trusted-key 00411886 --trusted-key 6092693E

would end up in this (and even garbage-collect all unreachable stuff
already, unless one uses some special option):

    ----o-----v.5.5(L)----o----o-----v.5.6(L) master
               \                     \
                \                     o----v.5.6.1(G)---o----v.5.6.1(G)
                 \
                  o----o----v.5.5.1(G)---o---o---v.5.5.1(G)

So with that repo, unless I fetch something new, I could be sure,
everything I have or I could potentially checkout was at some time
trusted by someone I trust.

In the example above, a branch (foo) which is completely unsigned would
consequentially be dropped completely.

In earlier days, most projects released their (signed) sources as some
tarball,...many nowadays just set (and sometimes even sign) some git
tag (which is great)... but with the old tarball one could have been
sure that everything in it is trusted (if one trusts the signer), which
git this is of course less simple.
So such cases I would have liked a simple way to get rid of everything
untrusted.

But probably my use case is just too exotic, otherwise git would
already have a helper command for it ^^

Cheers,
Chris.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: how to (integrity) verify a whole git repo
  2020-04-21 16:19     ` Konstantin Ryabitsev
@ 2020-04-23 18:12       ` Christoph Anton Mitterer
  0 siblings, 0 replies; 7+ messages in thread
From: Christoph Anton Mitterer @ 2020-04-23 18:12 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: git

On Tue, 2020-04-21 at 12:19 -0400, Konstantin Ryabitsev wrote:
> > So from my understanding, to use only commits that are authentic by
> > the
> > kernel upstream developers, I'd need verify all these tips.. and
> > throw
> > away everything which is not reachable by one of them.
> > 
> > Is that somehow possible?
> 
> You probably don't care about commits that arrive between releases, 

No, I guess not.

> so 
> effectively you are already doing that? Even if you have loose
> objects 
> that aren't reachable from your current tip (e.g. you only care
> about 
> objects in the stable branch linux-5.6.y), it's not like they are
> going 
> to "poison" your tree, so removing them is just a garbage collection 
> operation at best.

Well it's clear that any "loose" objects (in the sense of "not part of
the history by something that is signed and that I trust") don't poison
the tree before whatever I trust... but of course only if one never
accidentally uses anything of them.

For the Linus' kernel and Gerg's stable kernel repos this is probably
rather unlikely, since the do not contain much which is not signed by
one of the two.
But for other projects one might have many development branches or
other stuff which might not be signed at all.

When one then "works" on such a repo one doesn't always want to check
whether the current stuff one uses is signed and trusted or not.

> I would argue that your premise of "authenticity" is wrong. The best 
> that we are currently able to offer is a guarantee that, at the
> point 
> where the tag was signed, the tree is bit-for-bit exact to the tree
> the 
> way it exists on Linus Torvalds' (or Greg KH's) workstation.

And isn't that already something? :-D

It means to the least, that no simple MitM was possible in contrast to
when I just git clone git://whatever .

> However, both Linus and Greg merge code from tens of thousands of
> other 
> contributors and it's important to keep in mind that their tag 
> signatures do not offer any kind of attestation proof of the code's 
> actual authorship or origin.

Sure... but this is anyway the case... and nothing which one could
easily change or improve.

The best thing in terms of authenticity on can possibly get is being
able to have the repo exactly the same as it considered correct at it's
canonical upstream.

Everything better would require full trust paths and mutual signing
between all participating developers - which would surely be nice to
have, but is probably a completely other question.

Also, there are many much smaller projects, where things would be much
easier.

> 
> - Submaintainers don't put much trust into *who* wrote the code and 
>   review all submissions they receive as potentially containing
> security 
>   bugs (intentional or not); their job is to review the code and pass
> it 
>   up the chain to maintainers.
> - if maintainers receive pull requests from submaintainers, then
> they 
>   *may* check cryptographic signatures on the trees they pull. I am 
>   trying to encourage all maintainers to do this, and I've been
> working 
>   to introduce patch attestation so that maintainers preferring to
> work 
>   with patch series as opposed to pull requests can have similar 
>   functionality.
> - Linus checks all signatures on trees he pulls from non-kernel.org 
>   locations. Unfortunately, I've not been able to convince him that
> he 
>   should check them on stuff he pulls from kernel.org as well (and
> he 
>   has his own reasons for that).

But all this gives already quite some trust into the whole thing.

> So, all of this is to say that as the person cloning linux.git you
> are 
> merely the last link in the chain of "trusting the maintainer before 
> you." In your case that maintainer is Linus (or Greg KH), and you
> have 
> to agree that, in the end, "having a tree that is bit-for-bit
> identical 
> with what Linus has" is a pretty good assurance that it's as
> "authentic 
> Linux" as it gets.

Exactly... it's at least not much worse (if at all) than taking e.g. my
pre-compiled distro kernel, for which the sources are like not better
checked or more securely retrieved than when I clone Linus' git and
verify the tags.

My main concern was really to ideally "throw away" everything which
wasn't protected by a set of certain keys,... so that I wouldn't
accidentally use it.

Thanks,
Chris.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-04-23 18:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-21  4:45 how to (integrity) verify a whole git repo Christoph Anton Mitterer
2020-04-21  6:53 ` Jonathan Nieder
2020-04-21 14:42   ` Christoph Anton Mitterer
2020-04-21 16:19     ` Konstantin Ryabitsev
2020-04-23 18:12       ` Christoph Anton Mitterer
2020-04-21 19:14 ` Junio C Hamano
2020-04-23  4:02   ` Christoph Anton Mitterer

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).