git, monorepos, and access control

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* git, monorepos, and access control
@ 2018-12-05 20:13 Coiner, John
  2018-12-05 20:34 ` Ævar Arnfjörð Bjarmason
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Coiner, John @ 2018-12-05 20:13 UTC (permalink / raw)
  To: git@vger.kernel.org

Hi,

I'm an engineer with AMD. I'm looking at whether we could switch our 
internal version control to a monorepo, possibly one based on git and 
VFSForGit.

One obstacle to moving AMD to git/VFSForGit is the lack of access 
control support in git. AMD has a lot of data whose distribution must be 
limited. Sometimes it's a legal requirement, eg. CPU core designs are 
covered by US export control laws and not all employees may see them. 
Sometimes it's a contractual obligation, as when a third party shares 
data with us and we agree only to share this data with certain 
employees. Any hypothetical AMD monorepo should be able to securely deny 
read access in certain subtrees to users without required permissions.

Has anyone looked at adding access control to git, at a per-directory 
granularity? Is this a feature that the git community would possibly 
welcome?

Here's my rough thinking about how it might work:
  - an administrator can designate that a tree object requires zero or 
more named privileges to read
  - when a mortal user attempts to retrieve the tree object, a hook 
allows the server to check if the user has a given privilege. The hook 
can query an arbitrary user/group data base, LDAP or whatever. The 
details of this check are mostly in the hook; git only knows about 
abstract named privileges.
  - if the user has permission, everything goes as normal.
  - if the user lacks permission, they get a DeniedTree object which 
might carry some metadata about what permissions would be needed to see 
more. The DeniedTree lacks the real tree's entries. (TBD, how do we 
render a denied tree in the workspace? An un-writable directory 
containing only a GITDENIED file with some user friendly error message?)
  - hashes are secret. If the hashes from a protected tree leak, the 
data also leaks. No check on the server prevents it from handing out 
contents for correctly-guessed hashes.
  - mortal users shouldn't be able to alter permissions. Of course, 
mortal users will often modify tree objects that carry permissions. So 
the server should enforce that a user isn't pushing updates that alter 
permissions on the same logical directory.

I would welcome your feedback on whether this idea makes technical 
sense, and whether the feature could ever be a fit for git.

You might ask what alternatives we are looking at. At our scale, we'd 
really want a version control system that implements a virtual 
filesystem. That already limits us to ClearCase, VFSForGit, and maybe 
Vesta among public ones.  Am I missing any? We would also want one that 
permits branching enormous numbers of files without creating enormous 
amounts of data in the repo -- git gets that right, and perforce (our 
status quo) does not. That's how I got onto the idea of adding read 
authorization to git.

Thanks,

John

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git, monorepos, and access control
  2018-12-05 20:13 git, monorepos, and access control Coiner, John
@ 2018-12-05 20:34 ` Ævar Arnfjörð Bjarmason
  2018-12-05 20:43   ` Derrick Stolee
  2018-12-05 21:01 ` Jeff King
  2018-12-05 22:37 ` Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-12-05 20:34 UTC (permalink / raw)
  To: Coiner, John; +Cc: git@vger.kernel.org

On Wed, Dec 05 2018, Coiner, John wrote:

> I'm an engineer with AMD. I'm looking at whether we could switch our
> internal version control to a monorepo, possibly one based on git and
> VFSForGit.
>
> One obstacle to moving AMD to git/VFSForGit is the lack of access
> control support in git. AMD has a lot of data whose distribution must be
> limited. Sometimes it's a legal requirement, eg. CPU core designs are
> covered by US export control laws and not all employees may see them.
> Sometimes it's a contractual obligation, as when a third party shares
> data with us and we agree only to share this data with certain
> employees. Any hypothetical AMD monorepo should be able to securely deny
> read access in certain subtrees to users without required permissions.
>
> Has anyone looked at adding access control to git, at a per-directory
> granularity? Is this a feature that the git community would possibly
> welcome?
>
> Here's my rough thinking about how it might work:
>   - an administrator can designate that a tree object requires zero or
> more named privileges to read
>   - when a mortal user attempts to retrieve the tree object, a hook
> allows the server to check if the user has a given privilege. The hook
> can query an arbitrary user/group data base, LDAP or whatever. The
> details of this check are mostly in the hook; git only knows about
> abstract named privileges.
>   - if the user has permission, everything goes as normal.
>   - if the user lacks permission, they get a DeniedTree object which
> might carry some metadata about what permissions would be needed to see
> more. The DeniedTree lacks the real tree's entries. (TBD, how do we
> render a denied tree in the workspace? An un-writable directory
> containing only a GITDENIED file with some user friendly error message?)
>   - hashes are secret. If the hashes from a protected tree leak, the
> data also leaks. No check on the server prevents it from handing out
> contents for correctly-guessed hashes.
>   - mortal users shouldn't be able to alter permissions. Of course,
> mortal users will often modify tree objects that carry permissions. So
> the server should enforce that a user isn't pushing updates that alter
> permissions on the same logical directory.
>
> I would welcome your feedback on whether this idea makes technical
> sense, and whether the feature could ever be a fit for git.
>
> You might ask what alternatives we are looking at. At our scale, we'd
> really want a version control system that implements a virtual
> filesystem. That already limits us to ClearCase, VFSForGit, and maybe
> Vesta among public ones.  Am I missing any? We would also want one that
> permits branching enormous numbers of files without creating enormous
> amounts of data in the repo -- git gets that right, and perforce (our
> status quo) does not. That's how I got onto the idea of adding read
> authorization to git.

All of what you've described is possible to implement in git, but as far
as I know there's no existing implementation of it.

Microsoft's GVFS probably comes closest, and they're actively
upstreaming bits of that, but as far as I know that doesn't in any way
try to solve this "users XYZ can't even list such-and-such a tree"
problem.

Google has much the same problem with Android, i.e. there's some parts
of the giant checkout of multiple repos the "repo" tool manages that
aren't available to some of Google's employees, or only to some partners
etc. They solve this by splitting those at repo boundaries, and are
working on moving that to submodules. Perhaps this would work in your
case? It's not as easy to work with, but could be made to work with
existing software.

In case you haven't, read "SECURITY" in the git-fetch(1) manpage,
although if that could be fixed to work for the case you're describing.

In a hypothetical implementation that does this, you are always going to
need to have something where you hand off the SHA-1 of the
"secret-stuff-here/" top-level tree object to clients that can't access
any of the sub-trees or blobs from that directory, since they're going
to need it to construct a commit object consisting of
"stuff-they-can-access/" plus the current state of "secret-stuff-here/"
to push upstream. But you can hide the blobs & trees under
"secret-stuff-here/" and their SHA-1s.

Note also that in the general case your "is this blob secret to user
xyz" or "is this tree secret to user xyz" will need to deal with there
happening to be some blob or tree in "secret-stuff-here/" that just so
happens to share the exact same tree/blob as something they need under
"stuff-they-can-access/". Think a directory that only contains a
".gitignore" file, or a BSD "LICENSE" file.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git, monorepos, and access control
  2018-12-05 20:34 ` Ævar Arnfjörð Bjarmason
@ 2018-12-05 20:43   ` Derrick Stolee
  2018-12-05 20:58     ` Duy Nguyen
  0 siblings, 1 reply; 17+ messages in thread
From: Derrick Stolee @ 2018-12-05 20:43 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Coiner, John; +Cc: git@vger.kernel.org

On 12/5/2018 3:34 PM, Ævar Arnfjörð Bjarmason wrote:
> On Wed, Dec 05 2018, Coiner, John wrote:
>
>> I'm an engineer with AMD. I'm looking at whether we could switch our
>> internal version control to a monorepo, possibly one based on git and
>> VFSForGit.
>>
>> Has anyone looked at adding access control to git, at a per-directory
>> granularity? Is this a feature that the git community would possibly
>> welcome?
> All of what you've described is possible to implement in git, but as far
> as I know there's no existing implementation of it.
>
> Microsoft's GVFS probably comes closest, and they're actively
> upstreaming bits of that, but as far as I know that doesn't in any way
> try to solve this "users XYZ can't even list such-and-such a tree"
> problem.
(Avar has a lot of good ideas in his message, so I'm just going to add
on a few here.)

This directory-level security is not a goal for VFS for Git, and I don't
see itbecoming a priority as it breaks a number of design decisions we
made in our object storage and communication models.

The best I can think about when considering Git as an approach would be
to use submodules for your security-related content, and then have server-
side security for access to those repos. Of course, submodules are not
supported in VFS for Git, either.

The Gerrit service has _branch_ level security, which is related to the
reachability questions that a directory security would need. However,
the problem is quite different. Gerrit does have a lot of experience in
dealing with submodules, though, so that's probably a good place to
start.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git, monorepos, and access control
  2018-12-05 20:43   ` Derrick Stolee
@ 2018-12-05 20:58     ` Duy Nguyen
  2018-12-05 21:12       ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 17+ messages in thread
From: Duy Nguyen @ 2018-12-05 20:58 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Ævar Arnfjörð Bjarmason, John.Coiner,
	Git Mailing List

On Wed, Dec 5, 2018 at 9:46 PM Derrick Stolee <stolee@gmail.com> wrote:
> This directory-level security is not a goal for VFS for Git, and I don't
> see itbecoming a priority as it breaks a number of design decisions we
> made in our object storage and communication models.
>
> The best I can think about when considering Git as an approach would be
> to use submodules for your security-related content, and then have server-
> side security for access to those repos. Of course, submodules are not
> supported in VFS for Git, either.

Another option is builtin per-blob encryption (maybe with just
clean/smudge filter), then access control will be about obtaining the
decryption key (*) and we don't break object storage and
communication. Of course pack delta compression becomes absolutely
useless. But that is perhaps an acceptable trade off.

(*) Git will not cache the key in any shape or form. Whenever it needs
to deflate an encrypted blob, it asks for the key from a separate
daemon. This guy handles all the access control.

> The Gerrit service has _branch_ level security, which is related to the
> reachability questions that a directory security would need. However,
> the problem is quite different. Gerrit does have a lot of experience in
> dealing with submodules, though, so that's probably a good place to
> start.
>
> Thanks,
> -Stolee
-- 
Duy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git, monorepos, and access control
  2018-12-05 20:13 git, monorepos, and access control Coiner, John
  2018-12-05 20:34 ` Ævar Arnfjörð Bjarmason
@ 2018-12-05 21:01 ` Jeff King
  2018-12-06  0:23   ` brian m. carlson
                     ` (2 more replies)
  2018-12-05 22:37 ` Ævar Arnfjörð Bjarmason
  2 siblings, 3 replies; 17+ messages in thread
From: Jeff King @ 2018-12-05 21:01 UTC (permalink / raw)
  To: Coiner, John; +Cc: git@vger.kernel.org

On Wed, Dec 05, 2018 at 08:13:16PM +0000, Coiner, John wrote:

> One obstacle to moving AMD to git/VFSForGit is the lack of access 
> control support in git. AMD has a lot of data whose distribution must be 
> limited. Sometimes it's a legal requirement, eg. CPU core designs are 
> covered by US export control laws and not all employees may see them. 
> Sometimes it's a contractual obligation, as when a third party shares 
> data with us and we agree only to share this data with certain 
> employees. Any hypothetical AMD monorepo should be able to securely deny 
> read access in certain subtrees to users without required permissions.
> 
> Has anyone looked at adding access control to git, at a per-directory 
> granularity? Is this a feature that the git community would possibly 
> welcome?

In my opinion this feature is so contrary to Git's general assumptions
that it's likely to create a ton of information leaks of the supposedly
protected data.

For instance, Git is very eager to try to find delta-compression
opportunities between objects, even if they don't have any relationship
within the tree structure. So imagine I want to know the contents of
tree X. I push up a tree Y similar to X, then fetch it back, falsely
claiming to have X but not Y. If the server generates a delta, that may
reveal information about X (which you can then iterate to send Y', and
so on, treating the server as an oracle until you've guessed the content
of X).

You could work around that by teaching the server to refuse to use "X"
in any way when the client does not have the right permissions. But:

  - that's just one example; there are probably a number of such leaks

  - you're fighting an uphill battle against the way Git is implemented;
    there's no single point to enforce this kind of segmentation

The model that fits more naturally with how Git is implemented would be
to use submodules. There you leak the hash of the commit from the
private submodule, but that's probably obscure enough (and if you're
really worried, you can add a random nonce to the commit messages in the
submodule to make their hashes unguessable).

> I would welcome your feedback on whether this idea makes technical 
> sense, and whether the feature could ever be a fit for git.

Sorry I don't have a more positive response. What you want to do is
perfectly reasonable, but I just think it's a mismatch with how Git
works (and because of the security impact, one missed corner case
renders the whole thing useless).

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git, monorepos, and access control
  2018-12-05 20:58     ` Duy Nguyen
@ 2018-12-05 21:12       ` Ævar Arnfjörð Bjarmason
  2018-12-05 23:42         ` Coiner, John
  0 siblings, 1 reply; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-12-05 21:12 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Derrick Stolee, John.Coiner, Git Mailing List


On Wed, Dec 05 2018, Duy Nguyen wrote:

> On Wed, Dec 5, 2018 at 9:46 PM Derrick Stolee <stolee@gmail.com> wrote:
>> This directory-level security is not a goal for VFS for Git, and I don't
>> see itbecoming a priority as it breaks a number of design decisions we
>> made in our object storage and communication models.
>>
>> The best I can think about when considering Git as an approach would be
>> to use submodules for your security-related content, and then have server-
>> side security for access to those repos. Of course, submodules are not
>> supported in VFS for Git, either.
>
> Another option is builtin per-blob encryption (maybe with just
> clean/smudge filter), then access control will be about obtaining the
> decryption key (*) and we don't break object storage and
> communication. Of course pack delta compression becomes absolutely
> useless. But that is perhaps an acceptable trade off.

Right, this is another option, but from what John described wouldn't
work in this case. "Any hypothetical AMD monorepo should be able to
securely deny read access in certain subtrees to users without required
permissions".

I.e. in this case there will be a
secret-stuff-here/ryzen-microcode.code.encrypted or whatever,
unauthorized users can't see the content, but they can see from the
filename that it exists, and from "git log" who works on it.

It'll also baloon in size on the server-side since we can't delta any of
these objects, they'll all be X sized encrypted binaries.

> (*) Git will not cache the key in any shape or form. Whenever it needs
> to deflate an encrypted blob, it asks for the key from a separate
> daemon. This guy handles all the access control.
>
>> The Gerrit service has _branch_ level security, which is related to the
>> reachability questions that a directory security would need. However,
>> the problem is quite different. Gerrit does have a lot of experience in
>> dealing with submodules, though, so that's probably a good place to
>> start.
>>
>> Thanks,
>> -Stolee

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git, monorepos, and access control
  2018-12-05 20:13 git, monorepos, and access control Coiner, John
  2018-12-05 20:34 ` Ævar Arnfjörð Bjarmason
  2018-12-05 21:01 ` Jeff King
@ 2018-12-05 22:37 ` Ævar Arnfjörð Bjarmason
  2 siblings, 0 replies; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-12-05 22:37 UTC (permalink / raw)
  To: Coiner, John; +Cc: git@vger.kernel.org

On Wed, Dec 05 2018, Coiner, John wrote:

I forgot to mention this in my initial reply in
<878t13zp8y.fsf@evledraar.gmail.com>, but on a re-reading I re-spotted
this:

>   - hashes are secret. If the hashes from a protected tree leak, the
> data also leaks. No check on the server prevents it from handing out
> contents for correctly-guessed hashes.

This is a thing I know *way* less about so maybe I'm completely wrong,
but even if we have all the rest of the things outlined in your post to
support this, isn't this part going to be susceptible to timing attacks?

We'll do more work if you send a SHA-1 during negotiation that shares a
prefix with an existing SHA-1, since we need to binary search & compare
further. SHA-1 is 160 bits which gives you a huge space of potential
hashes, but not if I can try one bit at a time working from the start of
the hash to work my way to a valid existing hash stored on the server.

Of course that assumes a way to do this over the network, it'll be on
AMD's internal network so much faster than average, but maybe this is
completely implausible.

NetSpectre was different and relied on executing code on the remote
computer in a sandbox, not waiting for network roundtrips for each try,
so maybe this would be a non-issue.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git, monorepos, and access control
  2018-12-05 21:12       ` Ævar Arnfjörð Bjarmason
@ 2018-12-05 23:42         ` Coiner, John
  2018-12-06  7:23           ` Jeff King
  0 siblings, 1 reply; 17+ messages in thread
From: Coiner, John @ 2018-12-05 23:42 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Duy Nguyen
  Cc: Derrick Stolee, Git Mailing List, peff@peff.net

inline...

On 12/05/2018 04:12 PM, Ævar Arnfjörð Bjarmason wrote:
> On Wed, Dec 05 2018, Duy Nguyen wrote:
>
>>
>> Another option is builtin per-blob encryption (maybe with just
>> clean/smudge filter), then access control will be about obtaining the
>> decryption key (*) and we don't break object storage and
>> communication. Of course pack delta compression becomes absolutely
>> useless. But that is perhaps an acceptable trade off.
> Right, this is another option, but from what John described wouldn't
> work in this case. "Any hypothetical AMD monorepo should be able to
> securely deny read access in certain subtrees to users without required
> permissions".
>
> I.e. in this case there will be a
> secret-stuff-here/ryzen-microcode.code.encrypted or whatever,
> unauthorized users can't see the content, but they can see from the
> filename that it exists, and from "git log" who works on it.

Ah, clean/smudge has potential. It locates the security boundary 
entirely outside the git service and outside the repo. No big software 
engineering project is needed. It should work in VFSForGit just as it 
does in plain old git. Beautiful!

For our use case, it might be OK for commit logs, branch names, and 
filenames to be world-readable. Commit logs are probably the greatest 
concern; and perhaps we could address that through other means.

Maybe it's possible to design a text-to-text cipher that still permits 
git to store compact diffs of ciphered text, with only modest impact on 
security?

Thank you all for your thoughtful replies.

>
> This is a thing I know *way* less about so maybe I'm completely wrong,
> but even if we have all the rest of the things outlined in your post to
> support this, isn't this part going to be susceptible to timing attacks?

That's a good point. It's not easy to estimate exposure to something 
like this.

> For instance, Git is very eager to try to find delta-compression
> opportunities between objects, even if they don't have any relationship
> within the tree structure. So imagine I want to know the contents of
> tree X. I push up a tree Y similar to X, then fetch it back, falsely
> claiming to have X but not Y. If the server generates a delta, that may
> reveal information about X (which you can then iterate to send Y', and
> so on, treating the server as an oracle until you've guessed the content
> of X).
Another good point. I wouldn't have thought of either of these attacks. 
You're scaring me (appropriately) about the risks of adding security to 
a previously-unsecured interface. Let me push on the smudge/clean 
approach and maybe that will bear fruit.

Thanks again.

- John


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git, monorepos, and access control
  2018-12-05 21:01 ` Jeff King
@ 2018-12-06  0:23   ` brian m. carlson
  2018-12-06  1:08   ` Junio C Hamano
  2018-12-06 20:08   ` Johannes Schindelin
  2 siblings, 0 replies; 17+ messages in thread
From: brian m. carlson @ 2018-12-06  0:23 UTC (permalink / raw)
  To: Jeff King; +Cc: Coiner, John, git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1288 bytes --]

On Wed, Dec 05, 2018 at 04:01:05PM -0500, Jeff King wrote:
> You could work around that by teaching the server to refuse to use "X"
> in any way when the client does not have the right permissions. But:
> 
>   - that's just one example; there are probably a number of such leaks
> 
>   - you're fighting an uphill battle against the way Git is implemented;
>     there's no single point to enforce this kind of segmentation
> 
> The model that fits more naturally with how Git is implemented would be
> to use submodules. There you leak the hash of the commit from the
> private submodule, but that's probably obscure enough (and if you're
> really worried, you can add a random nonce to the commit messages in the
> submodule to make their hashes unguessable).

I tend to agree that submodules are the way to go here over a monorepo.
Not only do you have much better access control opportunities, in
general, smaller repositories are going to perform better and be easier
to work with than monorepos. While VFS for Git is a great technology,
using a set of smaller repositories is going to outperform it every
time, both on developer machines, and on your source code hosting
platform.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git, monorepos, and access control
  2018-12-05 21:01 ` Jeff King
  2018-12-06  0:23   ` brian m. carlson
@ 2018-12-06  1:08   ` Junio C Hamano
  2018-12-06  7:20     ` Jeff King
  2018-12-06 20:08   ` Johannes Schindelin
  2 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2018-12-06  1:08 UTC (permalink / raw)
  To: Jeff King; +Cc: Coiner, John, git@vger.kernel.org

Jeff King <peff@peff.net> writes:

> In my opinion this feature is so contrary to Git's general assumptions
> that it's likely to create a ton of information leaks of the supposedly
> protected data.
> ...

Yup, with s/implemented/designed/, I agree all you said here
(snipped).

> Sorry I don't have a more positive response. What you want to do is
> perfectly reasonable, but I just think it's a mismatch with how Git
> works (and because of the security impact, one missed corner case
> renders the whole thing useless).

Yup, again.

Storing source files encrypted and decrypting with smudge filter
upon checkout (and those without the access won't get keys and will
likely to use sparse checkout to exclude these priviledged sources)
is probably the only workaround that does not involve submodules.
Viewing "diff" and "log -p" would still be a challenge, which
probably could use the same filter as smudge for textconv.

I wonder (and this is the primary reason why I am responding to you)
if it is common enough wish to use the same filter for smudge and
textconv?  So far, our stance (which can be judged from the way the
clean/smudge filters are named) has been that the in-repo
representation is the canonical, and the representation used in the
checkout is ephemeral, and that is why we run "diff", "grep",
etc. over the in-repo representation, but the "encrypted in repo,
decrypted in checkout" abuse would be helped by an option to do the
reverse---find changes and look substrings in the representation
used in the checkout.  I am not sure if there are other use cases
that is helped by such an option.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git, monorepos, and access control
  2018-12-06  1:08   ` Junio C Hamano
@ 2018-12-06  7:20     ` Jeff King
  2018-12-06  9:17       ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff King @ 2018-12-06  7:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Coiner, John, git@vger.kernel.org

On Thu, Dec 06, 2018 at 10:08:57AM +0900, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > In my opinion this feature is so contrary to Git's general assumptions
> > that it's likely to create a ton of information leaks of the supposedly
> > protected data.
> > ...
> 
> Yup, with s/implemented/designed/, I agree all you said here
> (snipped).

Heh, yeah, I actually scratched my head over what word to use. I think
Git _could_ be written in a way that is both compatible with existing
repositories (i.e., is still recognizably Git) and is careful about
object access control. But either way, what we have now is not close to
that.

> > Sorry I don't have a more positive response. What you want to do is
> > perfectly reasonable, but I just think it's a mismatch with how Git
> > works (and because of the security impact, one missed corner case
> > renders the whole thing useless).
> 
> Yup, again.
> 
> Storing source files encrypted and decrypting with smudge filter
> upon checkout (and those without the access won't get keys and will
> likely to use sparse checkout to exclude these priviledged sources)
> is probably the only workaround that does not involve submodules.
> Viewing "diff" and "log -p" would still be a challenge, which
> probably could use the same filter as smudge for textconv.

I suspect there are going to be some funny corner cases there. I use:

  [diff "gpg"]
  textconv = gpg -qd --no-tty

which works pretty well, but it's for files which are _never_ decrypted
by Git. So they're encrypted in the working tree too, and I don't use
clean/smudge filters.

If the files are already decrypted in the working tree, then running
them through gpg again would be the wrong thing. I guess for a diff
against the working tree, we would always do a "clean" operation to
produce the encrypted text, and then decrypt the result using textconv.
Which would work, but is rather slow.

> I wonder (and this is the primary reason why I am responding to you)
> if it is common enough wish to use the same filter for smudge and
> textconv?  So far, our stance (which can be judged from the way the
> clean/smudge filters are named) has been that the in-repo
> representation is the canonical, and the representation used in the
> checkout is ephemeral, and that is why we run "diff", "grep",
> etc. over the in-repo representation, but the "encrypted in repo,
> decrypted in checkout" abuse would be helped by an option to do the
> reverse---find changes and look substrings in the representation
> used in the checkout.  I am not sure if there are other use cases
> that is helped by such an option.

Hmm. Yeah, I agree with your line of reasoning here. I'm not sure how
common it is. This is the first I can recall it. And personally, I have
never really used clean/smudge filters myself, beyond some toy
experiments.

The other major user of that feature I can think of is LFS. There Git
ends up diffing the LFS pointers, not the big files. Which arguably is
the wrong thing (you'd prefer to see the actual file contents diffed),
but I think nobody cares in practice because large files generally don't
have readable diffs anyway.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git, monorepos, and access control
  2018-12-05 23:42         ` Coiner, John
@ 2018-12-06  7:23           ` Jeff King
  0 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2018-12-06  7:23 UTC (permalink / raw)
  To: Coiner, John
  Cc: Ævar Arnfjörð Bjarmason, Duy Nguyen,
	Derrick Stolee, Git Mailing List

On Wed, Dec 05, 2018 at 11:42:09PM +0000, Coiner, John wrote:

> > For instance, Git is very eager to try to find delta-compression
> > opportunities between objects, even if they don't have any relationship
> > within the tree structure. So imagine I want to know the contents of
> > tree X. I push up a tree Y similar to X, then fetch it back, falsely
> > claiming to have X but not Y. If the server generates a delta, that may
> > reveal information about X (which you can then iterate to send Y', and
> > so on, treating the server as an oracle until you've guessed the content
> > of X).
> Another good point. I wouldn't have thought of either of these attacks. 
> You're scaring me (appropriately) about the risks of adding security to 
> a previously-unsecured interface. Let me push on the smudge/clean 
> approach and maybe that will bear fruit.

If you do look into that approach, check out how git-lfs works. In fact,
you might even be able to build around lfs itself. It's already putting
placeholder objects into the repository, and then faulting them in from
external storage. All you would need to do is lock down access to that
external storage, which is typically accessed via http.

(That all assumes you're OK with sharing the actual filenames with
everybody, and just restricting access to the blob contents. There's no
way to clean/smudge a whole subtree. For that you'd have to use
submodules).

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git, monorepos, and access control
  2018-12-06  7:20     ` Jeff King
@ 2018-12-06  9:17       ` Ævar Arnfjörð Bjarmason
  2018-12-06  9:30         ` Jeff King
  0 siblings, 1 reply; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-12-06  9:17 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Coiner, John, git@vger.kernel.org


On Thu, Dec 06 2018, Jeff King wrote:

> On Thu, Dec 06, 2018 at 10:08:57AM +0900, Junio C Hamano wrote:
>
>> Jeff King <peff@peff.net> writes:
>>
>> > In my opinion this feature is so contrary to Git's general assumptions
>> > that it's likely to create a ton of information leaks of the supposedly
>> > protected data.
>> > ...
>>
>> Yup, with s/implemented/designed/, I agree all you said here
>> (snipped).
>
> Heh, yeah, I actually scratched my head over what word to use. I think
> Git _could_ be written in a way that is both compatible with existing
> repositories (i.e., is still recognizably Git) and is careful about
> object access control. But either way, what we have now is not close to
> that.
>
>> > Sorry I don't have a more positive response. What you want to do is
>> > perfectly reasonable, but I just think it's a mismatch with how Git
>> > works (and because of the security impact, one missed corner case
>> > renders the whole thing useless).
>>
>> Yup, again.
>>
>> Storing source files encrypted and decrypting with smudge filter
>> upon checkout (and those without the access won't get keys and will
>> likely to use sparse checkout to exclude these priviledged sources)
>> is probably the only workaround that does not involve submodules.
>> Viewing "diff" and "log -p" would still be a challenge, which
>> probably could use the same filter as smudge for textconv.
>
> I suspect there are going to be some funny corner cases there. I use:
>
>   [diff "gpg"]
>   textconv = gpg -qd --no-tty
>
> which works pretty well, but it's for files which are _never_ decrypted
> by Git. So they're encrypted in the working tree too, and I don't use
> clean/smudge filters.
>
> If the files are already decrypted in the working tree, then running
> them through gpg again would be the wrong thing. I guess for a diff
> against the working tree, we would always do a "clean" operation to
> produce the encrypted text, and then decrypt the result using textconv.
> Which would work, but is rather slow.
>
>> I wonder (and this is the primary reason why I am responding to you)
>> if it is common enough wish to use the same filter for smudge and
>> textconv?  So far, our stance (which can be judged from the way the
>> clean/smudge filters are named) has been that the in-repo
>> representation is the canonical, and the representation used in the
>> checkout is ephemeral, and that is why we run "diff", "grep",
>> etc. over the in-repo representation, but the "encrypted in repo,
>> decrypted in checkout" abuse would be helped by an option to do the
>> reverse---find changes and look substrings in the representation
>> used in the checkout.  I am not sure if there are other use cases
>> that is helped by such an option.
>
> Hmm. Yeah, I agree with your line of reasoning here. I'm not sure how
> common it is. This is the first I can recall it. And personally, I have
> never really used clean/smudge filters myself, beyond some toy
> experiments.
>
> The other major user of that feature I can think of is LFS. There Git
> ends up diffing the LFS pointers, not the big files. Which arguably is
> the wrong thing (you'd prefer to see the actual file contents diffed),
> but I think nobody cares in practice because large files generally don't
> have readable diffs anyway.

I don't use this either, but I can imagine people who use binary files
via clean/smudge would be well served by dumping out textual metadata of
the file for diffing instead of showing nothing.

E.g. for a video file I might imagine having lines like:

    duration-seconds: 123
    camera-model: Shiny Thingamabob

Then when you check in a new file your "git diff" will show (using
normal diff view) that:

   - duration-seconds: 123
   + duration-seconds: 321
    camera-model: Shiny Thingamabob

etc.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git, monorepos, and access control
  2018-12-06  9:17       ` Ævar Arnfjörð Bjarmason
@ 2018-12-06  9:30         ` Jeff King
  0 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2018-12-06  9:30 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Junio C Hamano, Coiner, John, git@vger.kernel.org

On Thu, Dec 06, 2018 at 10:17:24AM +0100, Ævar Arnfjörð Bjarmason wrote:

> > The other major user of that feature I can think of is LFS. There Git
> > ends up diffing the LFS pointers, not the big files. Which arguably is
> > the wrong thing (you'd prefer to see the actual file contents diffed),
> > but I think nobody cares in practice because large files generally don't
> > have readable diffs anyway.
> 
> I don't use this either, but I can imagine people who use binary files
> via clean/smudge would be well served by dumping out textual metadata of
> the file for diffing instead of showing nothing.
> 
> E.g. for a video file I might imagine having lines like:
> 
>     duration-seconds: 123
>     camera-model: Shiny Thingamabob
> 
> Then when you check in a new file your "git diff" will show (using
> normal diff view) that:
> 
>    - duration-seconds: 123
>    + duration-seconds: 321
>     camera-model: Shiny Thingamabob

I think that's orthogonal to clean/smudge, though. Neither the in-repo
nor on-disk formats are going to show that kind of output. For that
you'd want a separate textconv filter (and fun fact: showing exif data
was actually the original use case for which I wrote textconv).

If you are using something like LFS, using textconv on top is a little
trickier, because we'd always feed the filter the LFS pointer file, not
the actual data contents. Doing the "reversal" that Junio suggested
would fix that. Or with the code as it is, you can simply define your
filter to convert the LFS pointer data into the real content. I don't
really use LFS, but it looks like:

  [diff "mp4"]
  textconv = git lfs smudge | extract-metadata

would probably work.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git, monorepos, and access control
  2018-12-05 21:01 ` Jeff King
  2018-12-06  0:23   ` brian m. carlson
  2018-12-06  1:08   ` Junio C Hamano
@ 2018-12-06 20:08   ` Johannes Schindelin
  2018-12-06 22:15     ` Stefan Beller
  2018-12-06 22:59     ` Coiner, John
  2 siblings, 2 replies; 17+ messages in thread
From: Johannes Schindelin @ 2018-12-06 20:08 UTC (permalink / raw)
  To: Jeff King; +Cc: Coiner, John, git@vger.kernel.org

Hi,

On Wed, 5 Dec 2018, Jeff King wrote:

> The model that fits more naturally with how Git is implemented would be
> to use submodules. There you leak the hash of the commit from the
> private submodule, but that's probably obscure enough (and if you're
> really worried, you can add a random nonce to the commit messages in the
> submodule to make their hashes unguessable).

I hear myself frequently saying: "Friends don't let friends use
submodules". It's almost like: "Some people think their problem is solved
by using submodules. Only now they have two problems."

There are big reasons, after all, why some companies go for monorepos: it
is not for lack of trying to go with submodules, it is the problems that
were incurred by trying to treat entire repositories the same as single
files (or even trees): they are just too different.

In a previous life, I also tried to go for submodules, was burned, and had
to restart the whole thing. We ended up with something that might work in
this instance, too, although our use case was not need-to-know type of
encapsulation. What we went for was straight up modularization.

What I mean is that we split the project up into over 100 individual
projects that are now all maintained in individual repositories, and they
are connected completely outside of Git, via a dependency management
system (in this case, Maven, although that is probably too Java-centric
for AMD's needs).

I just wanted to throw that out here: if you can split up your project
into individual projects, it might make sense not to maintain them as
submodules but instead as individual repositories whose artifacts are
uploaded into a central, versioned artifact store (Maven, NuGet, etc). And
those artifacts would then be retrieved by the projects that need them.

I figure that that scheme might work for you better than submodules: I
could imagine that you need to make the build artifacts available even to
people who are not permitted to look at the corresponding source code,
anyway.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git, monorepos, and access control
  2018-12-06 20:08   ` Johannes Schindelin
@ 2018-12-06 22:15     ` Stefan Beller
  2018-12-06 22:59     ` Coiner, John
  1 sibling, 0 replies; 17+ messages in thread
From: Stefan Beller @ 2018-12-06 22:15 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jeff King, John.Coiner, git

On Thu, Dec 6, 2018 at 12:09 PM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> Hi,
>
> On Wed, 5 Dec 2018, Jeff King wrote:
>
> > The model that fits more naturally with how Git is implemented would be
> > to use submodules. There you leak the hash of the commit from the
> > private submodule, but that's probably obscure enough (and if you're
> > really worried, you can add a random nonce to the commit messages in the
> > submodule to make their hashes unguessable).
>
> I hear myself frequently saying: "Friends don't let friends use
> submodules". It's almost like: "Some people think their problem is solved
> by using submodules. Only now they have two problems."

Blaming tools for their lack of evolution/development is not necessarily the
right approach. I recall having patches rejected on this very mailing list
that fixed obvious but minor good things like whitespaces and coding style,
because it *might* produce merge conflicts. Would that situation warrant me
to blame the lacks in the merge algorithm, or could you imagine a better
way out? (No need to answer, it's purely to demonstrate that blaming
tooling is not always the right approach; only sometimes it may be)

> There are big reasons, after all, why some companies go for monorepos: it
> is not for lack of trying to go with submodules, it is the problems that
> were incurred by trying to treat entire repositories the same as single
> files (or even trees): they are just too different.

We could change that in more places.

One example you might think of is the output of git-status that displays
changed files. And in case of submodules it would just show
"submodule changes", which we already differentiate into "dirty tree" and
"different sha1 at HEAD".
Instead we could have the output of all changed files recursively in the
superprojects git-status output.

Another example is the diff machinery, which already knows some
basics such as embedding submodule logs or actual diffs.

> In a previous life, I also tried to go for submodules, was burned, and had
> to restart the whole thing. We ended up with something that might work in
> this instance, too, although our use case was not need-to-know type of
> encapsulation. What we went for was straight up modularization.

So this is a "Fix the data instead of the tool", which seems to be a local
optimization (i.e. you only have to do it once, such that it is cheaper to
do than fixing the tool for that workgroup)
... and because everyone does that the tool never gets fixed.

> What I mean is that we split the project up into over 100 individual
> projects that are now all maintained in individual repositories, and they
> are connected completely outside of Git, via a dependency management
> system (in this case, Maven, although that is probably too Java-centric
> for AMD's needs).

Once you have the dependency management system in place, you
will encounter the rare case of still wanting to change things across
repository boundaries at the same time. Submodules offer that, which
is why Android wants to migrate off of the repo tool, and there it seems
natural to go for submodules.

> I just wanted to throw that out here: if you can split up your project
> into individual projects, it might make sense not to maintain them as
> submodules but instead as individual repositories whose artifacts are
> uploaded into a central, versioned artifact store (Maven, NuGet, etc). And
> those artifacts would then be retrieved by the projects that need them.

This is cool and industry standard. But once you happen to run in a bug
that involves 2 new artifacts (but each of the new artifacts work fine
on their own), then you'd wish for something like "git-bisect but across
repositories". Submodules (in theory) allow for fine grained bisection
across these repository boundaries, I would think.

> I figure that that scheme might work for you better than submodules: I
> could imagine that you need to make the build artifacts available even to
> people who are not permitted to look at the corresponding source code,
> anyway.

This is a sensible suggestion, as they probably don't want to ramp up
development on submodules. :-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git, monorepos, and access control
  2018-12-06 20:08   ` Johannes Schindelin
  2018-12-06 22:15     ` Stefan Beller
@ 2018-12-06 22:59     ` Coiner, John
  1 sibling, 0 replies; 17+ messages in thread
From: Coiner, John @ 2018-12-06 22:59 UTC (permalink / raw)
  To: Johannes Schindelin, Jeff King; +Cc: git@vger.kernel.org

Johannes,

Thanks for your feedback.

I'm not looking closely at submodules, as it's my understanding that 
VFSForGit does not support them. A VFS would be a killer feature for us. 
If VFSForGit were to support submodules, we'd look at them. They would 
provide access control in a way that's clearly nonabusive. I hear you on 
the drawbacks.

AMD today looks more like the 100 independent repos you describe, except 
we don't have automation at the delivery arcs. Integrations are manual 
and thus not particularly frequent.

I'm probably biased toward favoring a monorepo, which I've seen applied 
at a former employer, versus continuous delivery. That's due to lack of 
personal familiarity with CD -- not any real objections.

Thanks,

John

On 12/06/2018 03:08 PM, Johannes Schindelin wrote:
> Hi,
>
> On Wed, 5 Dec 2018, Jeff King wrote:
>
>> The model that fits more naturally with how Git is implemented would be
>> to use submodules. There you leak the hash of the commit from the
>> private submodule, but that's probably obscure enough (and if you're
>> really worried, you can add a random nonce to the commit messages in the
>> submodule to make their hashes unguessable).
> I hear myself frequently saying: "Friends don't let friends use
> submodules". It's almost like: "Some people think their problem is solved
> by using submodules. Only now they have two problems."
>
> There are big reasons, after all, why some companies go for monorepos: it
> is not for lack of trying to go with submodules, it is the problems that
> were incurred by trying to treat entire repositories the same as single
> files (or even trees): they are just too different.
>
> In a previous life, I also tried to go for submodules, was burned, and had
> to restart the whole thing. We ended up with something that might work in
> this instance, too, although our use case was not need-to-know type of
> encapsulation. What we went for was straight up modularization.
>
> What I mean is that we split the project up into over 100 individual
> projects that are now all maintained in individual repositories, and they
> are connected completely outside of Git, via a dependency management
> system (in this case, Maven, although that is probably too Java-centric
> for AMD's needs).
>
> I just wanted to throw that out here: if you can split up your project
> into individual projects, it might make sense not to maintain them as
> submodules but instead as individual repositories whose artifacts are
> uploaded into a central, versioned artifact store (Maven, NuGet, etc). And
> those artifacts would then be retrieved by the projects that need them.
>
> I figure that that scheme might work for you better than submodules: I
> could imagine that you need to make the build artifacts available even to
> people who are not permitted to look at the corresponding source code,
> anyway.
>
> Ciao,
> Johannes


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-12-06 22:59 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-05 20:13 git, monorepos, and access control Coiner, John
2018-12-05 20:34 ` Ævar Arnfjörð Bjarmason
2018-12-05 20:43   ` Derrick Stolee
2018-12-05 20:58     ` Duy Nguyen
2018-12-05 21:12       ` Ævar Arnfjörð Bjarmason
2018-12-05 23:42         ` Coiner, John
2018-12-06  7:23           ` Jeff King
2018-12-05 21:01 ` Jeff King
2018-12-06  0:23   ` brian m. carlson
2018-12-06  1:08   ` Junio C Hamano
2018-12-06  7:20     ` Jeff King
2018-12-06  9:17       ` Ævar Arnfjörð Bjarmason
2018-12-06  9:30         ` Jeff King
2018-12-06 20:08   ` Johannes Schindelin
2018-12-06 22:15     ` Stefan Beller
2018-12-06 22:59     ` Coiner, John
2018-12-05 22:37 ` Ævar Arnfjörð Bjarmason

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).