git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Philip Oakley <philipoakley@iee.email>
To: Damien Robert <damien.olivier.robert@gmail.com>,
	James Ramsay <james@jramsay.com.au>
Cc: git@vger.kernel.org
Subject: Re: [TOPIC 3/17] Obliterate
Date: Mon, 16 Mar 2020 20:01:13 +0000	[thread overview]
Message-ID: <5cab1530-f8b6-cef3-7b93-48fad410a160@iee.email> (raw)
In-Reply-To: <20200315221940.bdgi5mluxuetq2lz@doriath>

Hi Damien, James, (and 4 other who voted for the topic)

I had been thinking about 'missing' blobs for a long while as a earlier
'partial clone' concept (unpublished)

On 15/03/2020 22:19, Damien Robert wrote:
> From James Ramsay, Thu 12 Mar 2020 at 14:57:24 (+1100) :
>> 6. Elijah: replace refs helps, but not supported by hosts like GitHub etc
>>     a. Stolee: breaks commit graph because of generation numbers.
>>     b. Replace refs for blobs, then special packfile, there were edge cases.
> I am interested in more details on how to handle this using replace.
>
> My situation: coworkers push big files by mistake, I don't want to rewrite
> history because they are not too well versed with git, but I want to keep
> *my* repo clean.
>
> Partial solution:
> - identify the large blobs (easy)
> - write a replace ref (easy):
>   $ git replace b5f74037bb91 $(git hash-object -w -t blob /dev/null)
>   and replace the file (if it is still in the repo) by an empty file.

Here, my idea was to create a deliberately malformed blob object that
would allow self reference to say "this blob is deliberately missing".
(i.e. the same content would exist under two oids, one valid, one
invalid) The change would require extra code (more below).

Managing the verification of the replacement is a bigger problem,
especially if already pushed to a server.
> Now the pain points start:
> - first the index does not handle replace (I think), so the replaced file
>   appear as changed in git status, even through eg git diff shows nothing.
>
> => Solution: configure .git/info/sparse-checkout
>
> - secondly, I want to remove the large blob from my repo.
>
> Ideally I'd like to repack everything but filter this blob, except that
> repack does not understand --filter. So I need to use `git pack-objects`
> directly and then do the naming and clean up that repack usually does
> manually, which is error prone.
>
> Furthermore, while `git pack-objects` accepts --filter, I can only filter on
> blob size, not blob oid. (there is filter=sparse:oid where I could reuse my
> sparse checkout file, but I would need to make a blob of it first). And if I
> have one large file I want to keep, I cannot filter by blob size.
>
> Another solution would be to use `git unpack-objects` to unpack all objects
> (except I would need to do that in an empty git dir), remove the blob, and
> then repack everything.
>
> Am I missing a simpler solution?
>
> - finally, checkouting to a ref including the replaced (now missing) blob
>   gives error messages of the form:
> error: invalid object 100644 b5f74037bb91c45606b233b0ad6aad86f8e3875e for 'Silverman-Height-NonTorsion.pdf'
>
> On the one hand it is reassuring that git checks that the real object
> (rather than only the replaced object) is still there, on the other hand it
> would be nice to ask git to completely forget about the original object
> (except fsck of course).
>
> Thanks,
> Damien
My notes on the "13. Obliterate" ideas.

1. If the object is in the wild & is dangerous : Stop: Failed: Damage
limitation.
2. If the object is external, but still tame : Seek and recapture;
either treat as internal, or treat as wild [1].
3. The object is in captivity, even if distributed around an enclosure.
Proceed to vaccination [4].
4. Create new blob object with exact content "Git revoke: <oid>" (or
similar) This object includes the embedded object type coding as part of
the object format. This object is/becomes part of the git signature/oid
commit hierarchy. This should (ultimately) be on 'master' branch as it
is the verifier for the obliteration.
5. In the old revoked object <oid>, replace the object content (after
zlib etc) with the same content as created in step 4. This deliberately
_malformed_ object would normally cause fsck to barf. see [6]
6. However here we/fsck would detect the length and prefix of the
(barfed) object contents and so determine its oid (the oid of the
content). This results in an oid equal to that found in 4. which can be
looked up and determined to be a self referral to this obliterated oid,
so an fsck 'pass:obliterated' result is returned. This content could be
actually be stored in any removed file if checked out!

Consequences:
Packs and other served object contents no longer contain the  revoked oid.
Hygiene/vaccination needs applied to other distributed recipients of the
former defective object.

Possible attacks: Attacker removes other important commits/blobs/trees
by adding a 'revoke' which propagates to other users: Separate the
hygiene cycle from the initial server revocation.

For trees(?) and commits the message is "revoke <oid> <use-oid>". But
where to 'hold' the commit & tree (maybe require that tree revoke is
treated as a commit revoke, so the the new tree is got for free). We
still need the new commit to be walked by fsck/gc, and the old oid
contents to be gc'd.
For a 'commit' revocation it (the new msg/trees/revision) could maybe be
a 2nd (or third parent after {0}) so a 'normal walk finds it, but
probably that's just a recipe for disaster.
Maybe a revocation reflog that doesn't expire? or can be rebuilt (fsck
would extend it's lost/found to include a revoked list).

The new (XY) problem is now one of tying in the new revoked blob to the
'old' commit/tree hierarchy which only handles tracked files! Maybe its
a .revoked file (like a .gitignore) which has a list of the old oids and
has actual blobs attached under a .revoked tree.

Also need to make sure that re-packing is done if the blob/tree/commit
was a delta-base at the point of obliteration. Also need to prompt the
local user, just in case it's a spoof!. Plus need a way of 'sending' the
revocation. (and flag for what to do about a fetch pack containing a
revocation for which we have the original, esp if we have it as a pack
that will take a long time to recreate. Need a way of writing the
'defective' object (more code).

Newhash transition. When histories are rewritten, then the obliterated
artefacts are truly removed. For new repos using the newhash then the
revocation mechanism is essentially the same other than extending the
nominal size of the revocation objects.

Perhaps use the 'submodule' commit object type (i.e 'stuff held
elsewhere')  for the holder of the revoked ID (for commits & trees).
This could be locked into the history (details not fully thought
through..).

If there is a design error within Git, its the lack of an 'after the
fact' redaction mechanism (and how it is spread across branches and
distributed users/servers) - not easy.

Philip

  parent reply	other threads:[~2020-03-16 20:01 UTC|newest]

Thread overview: 125+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-12  3:55 Notes from Git Contributor Summit, Los Angeles (April 5, 2020) James Ramsay
2020-03-12  3:56 ` [TOPIC 1/17] Reftable James Ramsay
2020-03-12  3:56 ` [TOPIC 2/17] Hooks in the future James Ramsay
2020-03-12 14:16   ` Emily Shaffer
2020-03-13 17:56     ` Junio C Hamano
2020-04-07 23:01       ` Emily Shaffer
2020-04-07 23:51         ` Emily Shaffer
2020-04-08  0:40           ` Junio C Hamano
2020-04-08  1:09             ` Emily Shaffer
2020-04-10 21:31           ` Jeff King
2020-04-13 19:15             ` Emily Shaffer
2020-04-13 21:52               ` Jeff King
2020-04-14  0:54                 ` [RFC PATCH v2 0/2] configuration-based hook management (was: [TOPIC 2/17] Hooks in the future) Emily Shaffer
2020-04-14  0:54                   ` [RFC PATCH v2 1/2] hook: scaffolding for git-hook subcommand Emily Shaffer
2020-04-14  0:54                   ` [RFC PATCH v2 2/2] hook: add --list mode Emily Shaffer
2020-04-14 15:15                   ` [RFC PATCH v2 0/2] configuration-based hook management Phillip Wood
2020-04-14 19:24                     ` Emily Shaffer
2020-04-14 20:27                       ` Jeff King
2020-04-15 10:01                         ` Phillip Wood
2020-04-14 20:03                     ` Josh Steadmon
2020-04-15 10:08                       ` Phillip Wood
2020-04-14 20:32                     ` Jeff King
2020-04-15 10:01                       ` Phillip Wood
2020-04-15 14:51                         ` Junio C Hamano
2020-04-15 20:30                           ` Emily Shaffer
2020-04-15 22:19                             ` Junio C Hamano
2020-04-15  3:45                 ` [TOPIC 2/17] Hooks in the future Jonathan Nieder
2020-04-15 20:59                   ` Emily Shaffer
2020-04-20 23:53                     ` [PATCH] doc: propose hooks managed by the config Emily Shaffer
2020-04-21  0:22                       ` Emily Shaffer
2020-04-21  1:20                         ` Junio C Hamano
2020-04-24 23:14                           ` Emily Shaffer
2020-04-25 20:57                       ` brian m. carlson
2020-05-06 21:33                         ` Emily Shaffer
2020-05-06 23:13                           ` brian m. carlson
2020-05-19 20:10                           ` Emily Shaffer
2020-04-15 22:42                   ` [TOPIC 2/17] Hooks in the future Jeff King
2020-04-15 22:48                     ` Emily Shaffer
2020-04-15 22:57                       ` Jeff King
2020-03-12  3:57 ` [TOPIC 3/17] Obliterate James Ramsay
2020-03-12 18:06   ` Konstantin Ryabitsev
2020-03-15 22:19   ` Damien Robert
2020-03-16 12:55     ` Konstantin Tokarev
2020-03-26 22:27       ` Damien Robert
2020-03-16 16:32     ` Elijah Newren
2020-03-26 22:30       ` Damien Robert
2020-03-16 18:32     ` Phillip Susi
2020-03-26 22:37       ` Damien Robert
2020-03-16 20:01     ` Philip Oakley [this message]
2020-05-16  2:21       ` nbelakovski
2020-03-12  3:58 ` [TOPIC 4/17] Sparse checkout James Ramsay
2020-03-12  4:00 ` [TOPIC 5/17] Partial Clone James Ramsay
2020-03-17  7:38   ` Allowing only blob filtering was: " Christian Couder
2020-03-17 20:39     ` [RFC PATCH 0/2] upload-pack.c: limit allowed filter choices Taylor Blau
2020-03-17 20:39       ` [RFC PATCH 1/2] list_objects_filter_options: introduce 'list_object_filter_config_name' Taylor Blau
2020-03-17 20:53         ` Eric Sunshine
2020-03-18 10:03           ` Jeff King
2020-03-18 19:40             ` Junio C Hamano
2020-03-18 22:38             ` Eric Sunshine
2020-03-19 17:15               ` Jeff King
2020-03-18 21:05           ` Taylor Blau
2020-03-17 20:39       ` [RFC PATCH 2/2] upload-pack.c: allow banning certain object filter(s) Taylor Blau
2020-03-17 21:11         ` Eric Sunshine
2020-03-18 21:18           ` Taylor Blau
2020-03-18 11:18         ` Philip Oakley
2020-03-18 21:20           ` Taylor Blau
2020-03-18 10:18       ` [RFC PATCH 0/2] upload-pack.c: limit allowed filter choices Jeff King
2020-03-18 18:26         ` Re*: " Junio C Hamano
2020-03-19 17:03           ` Jeff King
2020-03-18 21:28         ` Taylor Blau
2020-03-18 22:41           ` Junio C Hamano
2020-03-19 17:10             ` Jeff King
2020-03-19 17:09           ` Jeff King
2020-04-17  9:41         ` Christian Couder
2020-04-17 17:40           ` Taylor Blau
2020-04-17 18:06             ` Jeff King
2020-04-21 12:34               ` Christian Couder
2020-04-22 20:41                 ` Taylor Blau
2020-04-22 20:42               ` Taylor Blau
2020-04-21 12:17             ` Christian Couder
2020-03-12  4:01 ` [TOPIC 6/17] GC strategies James Ramsay
2020-03-12  4:02 ` [TOPIC 7/17] Background operations/maintenance James Ramsay
2020-03-12  4:03 ` [TOPIC 8/17] Push performance James Ramsay
2020-03-12  4:04 ` [TOPIC 9/17] Obsolescence markers and evolve James Ramsay
2020-05-09 21:31   ` Noam Soloveichik
2020-05-15 22:26     ` Jeff King
2020-03-12  4:05 ` [TOPIC 10/17] Expel ‘git shell’? James Ramsay
2020-03-12  4:07 ` [TOPIC 11/17] GPL enforcement James Ramsay
2020-03-12  4:08 ` [TOPIC 12/17] Test harness improvements James Ramsay
2020-03-12  4:09 ` [TOPIC 13/17] Cross implementation test suite James Ramsay
2020-03-12  4:11 ` [TOPIC 14/17] Aspects of merge-ort: cool, or crimes against humanity? James Ramsay
2020-03-12  4:13 ` [TOPIC 15/17] Reachability checks James Ramsay
2020-03-12  4:14 ` [TOPIC 16/17] “I want a reviewer” James Ramsay
2020-03-12 13:31   ` Emily Shaffer
2020-03-12 17:31     ` Konstantin Ryabitsev
2020-03-12 17:42       ` Jonathan Nieder
2020-03-12 18:00         ` Konstantin Ryabitsev
2020-03-17  0:43     ` Philippe Blain
2020-03-13 21:25   ` Eric Wong
2020-03-14 17:27     ` Jeff King
2020-03-15  0:36       ` inbox indexing wishlist [was: [TOPIC 16/17] “I want a reviewer”] Eric Wong
2020-03-12  4:16 ` [TOPIC 17/17] Security James Ramsay
2020-03-12 14:38 ` Notes from Git Contributor Summit, Los Angeles (April 5, 2020) Derrick Stolee
2020-03-13 20:47 ` Jeff King
2020-03-15 18:42 ` Jakub Narebski
2020-03-16 19:31   ` Jeff King
  -- strict thread matches above, loose matches on Subject: below --
2019-12-10  2:33 [PATCH 0/6] configuration-based hook management Emily Shaffer
2019-12-10  2:33 ` [PATCH 1/6] hook: scaffolding for git-hook subcommand Emily Shaffer
2019-12-12  9:41   ` Bert Wesarg
2019-12-12 10:47   ` SZEDER Gábor
2019-12-10  2:33 ` [PATCH 2/6] config: add string mapping for enum config_scope Emily Shaffer
2019-12-10 11:16   ` Philip Oakley
2019-12-10 17:21     ` Philip Oakley
2019-12-10  2:33 ` [PATCH 3/6] hook: add --list mode Emily Shaffer
2019-12-12  9:38   ` Bert Wesarg
2019-12-12 10:58   ` SZEDER Gábor
2019-12-10  2:33 ` [PATCH 4/6] hook: support reordering of hook list Emily Shaffer
2019-12-11 19:21   ` Junio C Hamano
2019-12-10  2:33 ` [PATCH 5/6] hook: remove prior hook with '---' Emily Shaffer
2019-12-10  2:33 ` [PATCH 6/6] hook: teach --porcelain mode Emily Shaffer
2019-12-11 19:33   ` Junio C Hamano
2019-12-11 22:00     ` Emily Shaffer
2019-12-11 22:07       ` Junio C Hamano
2019-12-11 23:15         ` Emily Shaffer
2019-12-11 22:42 ` [PATCH 0/6] configuration-based hook management Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5cab1530-f8b6-cef3-7b93-48fad410a160@iee.email \
    --to=philipoakley@iee.email \
    --cc=damien.olivier.robert@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=james@jramsay.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).