git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Stefan Beller <sbeller@google.com>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs
Date: Tue, 11 Jul 2017 15:02:09 -0700	[thread overview]
Message-ID: <CAGZ79kZ8i0Xqp2Yft19JKzPFMUdbzEDNJ01zER9RoM8ZzKPu8A@mail.gmail.com> (raw)
In-Reply-To: <f9c7d4b3f800ea31e85e4897ee7048fec1e3c2f0.1499800530.git.jonathantanmy@google.com>

On Tue, Jul 11, 2017 at 12:48 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> Currently, Git does not support repos with very large numbers of blobs
> or repos that wish to minimize manipulation of certain blobs (for
> example, because they are very large) very well, even if the user
> operates mostly on part of the repo, because Git is designed on the
> assumption that every blob referenced by a tree object is available
> somewhere in the repo storage.
>
> As a first step to reducing this problem, introduce the concept of
> promised blobs. Each Git repo can contain a list of promised blobs and
> their sizes at $GIT_DIR/objects/promisedblob. This patch contains
> functions to query them; functions for creating and modifying that file
> will be introduced in later patches.
>
> A repository that is missing a blob but has that blob promised is not
> considered to be in error, so also teach fsck this.

Here I wondered what this file looks like, in a later patch you
add documentation:

  +objects/promisedblob::
  +       This file records the sha1 object names and sizes of promised
  +       blobs.
  +

but that leaves more fundamental questions:
* Is that a list of sha1s, separated by LF? (CRLF? How would Windows
  users interact with it, are they supposed to ever modify this file?)
* Similar to .git/packed-refs, would we want to have a first line
  that has some sort of comment?
* In the first question I assumed a linear list, will that be a sorted list,
  or do we want to have some fancy data structure here?
  (critbit tree, bloom filter)
* is the contents in ASCII or binary? (What are the separators?)
* In the future I presume we'd want to quickly answer "Is X in the
  promised blobs list?" so would it be possible (in the future) to
  improve the searching e.g. binary search?
* ... I'll read on to see my questions answered, but I'd guess
  others would prefer to see it in the docs. :)


> + * A mmap-ed byte array of size (missing_blob_nr * ENTRY_SIZE). Each
> + * ENTRY_SIZE-sized entry consists of the SHA-1 of the promised blob and its
> + * 64-bit size in network byte order. The entries are sorted in ascending SHA-1
> + * order.

This seems to be the answer to the docs. :)
So binary representation, the size variable takes a fixed amount, such
that the n-th element in the file is at n * ENTRY_SIZE.


> +       if (st.st_size % ENTRY_SIZE) {
> +               die("Size of %s is not a multiple of %d", filename, ENTRY_SIZE);
> +       }

So it looks as if the file format is an array of ENTRY_SIZE.

Similar to other files, would we want to prefix the file with
a 4 letter magic number and a version number?

> +       prepare_promised_blobs();
> +       result = sha1_entry_pos(promised_blobs, ENTRY_SIZE, 0, 0,
> +                               promised_blob_nr, promised_blob_nr, oid->hash);

IIRC, this is a binary search.

> +int for_each_promised_blob(each_promised_blob_fn cb, void *data)
> +{
> +       struct object_id oid;
> +       int i, r;
> +
> +       prepare_promised_blobs();
> +       for (i = 0; i < promised_blob_nr; i++) {
> +               memcpy(oid.hash, &promised_blobs[i * ENTRY_SIZE],
> +                      GIT_SHA1_RAWSZ);
> +               r = cb(&oid, data);
> +               if (r)
> +                       return r;
> +       }
> +       return 0;
> +}
> diff --git a/promised-blob.h b/promised-blob.h
> new file mode 100644
> index 000000000..a303ea1ff
> --- /dev/null
> +++ b/promised-blob.h
> @@ -0,0 +1,14 @@
> +#ifndef PROMISED_BLOB_H
> +#define PROMISED_BLOB_H
> +
> +/*
> + * Returns 1 if oid is the name of a promised blob. If size is not NULL, also
> + * returns its size.
> + */
> +extern int is_promised_blob(const struct object_id *oid,
> +                           unsigned long *size);
> +
> +typedef int each_promised_blob_fn(const struct object_id *oid, void *data);
> +int for_each_promised_blob(each_promised_blob_fn, void *);
> +
> +#endif
> diff --git a/t/t3907-promised-blob.sh b/t/t3907-promised-blob.sh
> new file mode 100755
> index 000000000..827072004
> --- /dev/null
> +++ b/t/t3907-promised-blob.sh
> @@ -0,0 +1,29 @@
> +#!/bin/sh
> +
> +test_description='promised blobs'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'fsck fails on missing blobs' '
> +       rm -rf repo &&

We should not need to delete the repo here, this is the first test?

> +       rm repo/.git/objects/$(echo $HASH | cut -c1-2)/$(echo $HASH | cut -c3-40) &&

Later down the road, do we want to have a
(plumbing) command to move an object from
standard blob to promised blob (as of now I'd think
it would perform this rm call as well as an insert into
the promised blobs file) ?
(Well, we'd also have to think about how to get objects
out of a pack)

With such a command you can easily write your own custom
filter to free up blobs again.


> +       test_must_fail git -C repo fsck

    test_i18n_grep missing out ?

maybe, too? (Maybe that is already tested elsewhere,
so no need for it)


> +'
> +
> +test_expect_success '...but succeeds if it is a promised blob' '
> +       printf "%s%016x" "$HASH" "$(wc -c <repo/1.t)" |
> +               hex_pack >repo/.git/objects/promisedblob &&
> +       git -C repo fsck
> +'
> +
> +test_expect_success '...but fails again with GIT_IGNORE_PROMISED_BLOBS' '
> +       GIT_IGNORE_PROMISED_BLOBS=1 test_must_fail git -C repo fsck &&
> +       unset GIT_IGNORE_PROMISED_BLOBS
> +'

This patch makes sense, so far.

Thanks,
Stefan

  reply	other threads:[~2017-07-11 22:02 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-11 19:48 [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Jonathan Tan
2017-07-11 19:48 ` [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs Jonathan Tan
2017-07-11 22:02   ` Stefan Beller [this message]
2017-07-19 23:37     ` Jonathan Tan
2017-07-12 17:29   ` Jeff Hostetler
2017-07-12 19:28     ` Jonathan Nieder
2017-07-13 14:48       ` Jeff Hostetler
2017-07-13 15:05         ` Jeff Hostetler
2017-07-13 19:39     ` Jonathan Tan
2017-07-14 20:03       ` Jeff Hostetler
2017-07-14 21:30         ` Jonathan Nieder
2017-07-11 19:48 ` [RFC PATCH 2/3] sha1-array: support appending unsigned char hash Jonathan Tan
2017-07-11 22:06   ` Stefan Beller
2017-07-19 23:56     ` Jonathan Tan
2017-07-20  0:06       ` Stefan Beller
2017-07-11 19:48 ` [RFC PATCH 3/3] sha1_file: add promised blob hook support Jonathan Tan
2017-07-11 22:38   ` Stefan Beller
2017-07-12 17:40   ` Ben Peart
2017-07-12 20:38     ` Jonathan Nieder
2017-07-16 15:23 ` [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Philip Oakley
2017-07-17 17:43   ` Ben Peart
2017-07-25 20:48     ` Philip Oakley
2017-07-17 18:03   ` Jonathan Nieder
2017-07-29 12:51     ` Philip Oakley
2017-07-20  0:21 ` [RFC PATCH v2 0/4] Partial clone: promised objects (not only blobs) Jonathan Tan
2017-07-20  0:21 ` [RFC PATCH v2 1/4] object: remove "used" field from struct object Jonathan Tan
2017-07-20  0:36   ` Stefan Beller
2017-07-20  0:55     ` Jonathan Tan
2017-07-20 17:44       ` Ben Peart
2017-07-20 21:20   ` Junio C Hamano
2017-07-20  0:21 ` [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects Jonathan Tan
2017-07-20 18:07   ` Stefan Beller
2017-07-20 19:17     ` Jonathan Tan
2017-07-20 19:58   ` Ben Peart
2017-07-20 21:13     ` Jonathan Tan
2017-07-21 16:24       ` Ben Peart
2017-07-21 20:33         ` Jonathan Tan
2017-07-25 15:10           ` Ben Peart
2017-07-29 13:26             ` Philip Oakley
2017-07-20  0:21 ` [RFC PATCH v2 3/4] sha1-array: support appending unsigned char hash Jonathan Tan
2017-07-20  0:21 ` [RFC PATCH v2 4/4] sha1_file: support promised object hook Jonathan Tan
2017-07-20 18:23   ` Stefan Beller
2017-07-20 20:58     ` Ben Peart
2017-07-20 21:18       ` Jonathan Tan
2017-07-21 16:27         ` Ben Peart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGZ79kZ8i0Xqp2Yft19JKzPFMUdbzEDNJ01zER9RoM8ZzKPu8A@mail.gmail.com \
    --to=sbeller@google.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).