git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Design of multiple hash support
@ 2018-11-05  1:00 brian m. carlson
  2018-11-05  2:36 ` Junio C Hamano
  2018-11-05 19:03 ` Duy Nguyen
  0 siblings, 2 replies; 7+ messages in thread
From: brian m. carlson @ 2018-11-05  1:00 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 1469 bytes --]

I'm currently working on getting Git to support multiple hash algorithms
in the same binary (SHA-1 and SHA-256).  In order to have a fully
functional binary, we'll need to have some way of indicating to certain
commands (such as init and show-index) that they should assume a certain
hash algorithm.

There are basically two approaches I can take.  The first is to provide
each command that needs to learn about this with its own --hash
argument.  So we'd have:

  git init --hash=sha256
  git show-index --hash=sha256 <some-file

The other alternative is that we provide a global option to git, which
is parsed by all programs, like so:

  git --hash=sha256 init
  git --hash=sha256 show-index <some-file

There's also the question of what we want to call the option.  The
obvious name is --hash, which is intuitive and straightforward.
However, the transition plan names the config option
extensions.objectFormat, so --object-format is also a possibility.  If
we ever decide to support, say, zstd compression instead of zlib, we
could leverage the same option (say, --object-format=sha256:zstd) and
avoid the need for an additional option.  This might be planning for a
future that never occurs, though.

I'd like to write this code in the way most acceptable to the list, so
I'd appreciate input from others on what they'd like to see in the final
series.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Design of multiple hash support
  2018-11-05  1:00 Design of multiple hash support brian m. carlson
@ 2018-11-05  2:36 ` Junio C Hamano
  2018-11-05 18:03   ` Stefan Beller
  2018-11-05 19:03 ` Duy Nguyen
  1 sibling, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2018-11-05  2:36 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> I'm currently working on getting Git to support multiple hash algorithms
> in the same binary (SHA-1 and SHA-256).  In order to have a fully
> functional binary, we'll need to have some way of indicating to certain
> commands (such as init and show-index) that they should assume a certain
> hash algorithm.
>
> There are basically two approaches I can take.  The first is to provide
> each command that needs to learn about this with its own --hash
> argument.  So we'd have:
>
>   git init --hash=sha256
>   git show-index --hash=sha256 <some-file
>
> The other alternative is that we provide a global option to git, which
> is parsed by all programs, like so:
>
>   git --hash=sha256 init
>   git --hash=sha256 show-index <some-file

I am assuming that "show-index" above is a typo for something like
"hash-object"?

It is hard to answer the question without knowing what exactly does
"(to) support multiple hash algorithms" mean.  For example, inside
today's repository, what should this command do?

	git --hash=sha256 cat-file commit HEAD

It can work this way:

 - read HEAD, discover that I am on 'master' branch, read refs/heads/master
   to learn the object name in 40-hex, realize that it cannot be
   sha256 and report "corrupt ref".

Or it can work this way:

 - read repository format, realize it is a good old sha1 repository.

 - do the usual thing to get to read_object() to read the commit
   object data for the commit at HEAD, doing all of it in sha1.

 - in the commit object data, locate references to other objects
   that use sha1 name.

 - replace these sha1 references with their sha256 counterparts and
   show the result.

I am guessing that you are doing the former as a good first step, in
which case, as an option that changes/affects the behaviour of git
globally, I think "git --hash=sha256" would make sense, like other
global options like --literal-pathspecs and --no-replace-objects.

Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Design of multiple hash support
  2018-11-05  2:36 ` Junio C Hamano
@ 2018-11-05 18:03   ` Stefan Beller
  2018-11-05 23:54     ` brian m. carlson
  0 siblings, 1 reply; 7+ messages in thread
From: Stefan Beller @ 2018-11-05 18:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: brian m. carlson, git

On Sun, Nov 4, 2018 at 6:36 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
>
> > I'm currently working on getting Git to support multiple hash algorithms
> > in the same binary (SHA-1 and SHA-256).  In order to have a fully
> > functional binary, we'll need to have some way of indicating to certain
> > commands (such as init and show-index) that they should assume a certain
> > hash algorithm.
> >
> > There are basically two approaches I can take.  The first is to provide
> > each command that needs to learn about this with its own --hash
> > argument.  So we'd have:
> >
> >   git init --hash=sha256
> >   git show-index --hash=sha256 <some-file
> >
> > The other alternative is that we provide a global option to git, which
> > is parsed by all programs, like so:
> >
> >   git --hash=sha256 init
> >   git --hash=sha256 show-index <some-file
>
> I am assuming that "show-index" above is a typo for something like
> "hash-object"?

Actually both seem plausible, as both do not require
RUN_SETUP, which means they cannot rely on the
extensions.objectFormat setting.

When having a global setting, would that override the configured
object format extension in a repository, or do we error out?

So maybe

  git -c extensions.objectFormat=sha256 init

is the way to go, for now? (Are repository format extensions parsed
just like normal config, such that non-RUN_SETUP commands
can rely on the (non-)existence to determine whether to use
the default or the given hash function?)

> It is hard to answer the question without knowing what exactly does
> "(to) support multiple hash algorithms" mean.  For example, inside
> today's repository, what should this command do?
>
>         git --hash=sha256 cat-file commit HEAD

There is a section "Object names on the command line"
in Documentation/technical/hash-function-transition.txt
and I assume that this before the "dark launch"
phase, so I would expect the latter to work (no error
but conversion/translation on the fly) eventually as a goal.
But the former might be in scope of one series.

> It can work this way:
>
>  - read HEAD, discover that I am on 'master' branch, read refs/heads/master
>    to learn the object name in 40-hex, realize that it cannot be
>    sha256 and report "corrupt ref".
>
> Or it can work this way:
>
>  - read repository format, realize it is a good old sha1 repository.
>
>  - do the usual thing to get to read_object() to read the commit
>    object data for the commit at HEAD, doing all of it in sha1.
>
>  - in the commit object data, locate references to other objects
>    that use sha1 name.
>
>  - replace these sha1 references with their sha256 counterparts and
>    show the result.
>
> I am guessing that you are doing the former as a good first step, in
> which case, as an option that changes/affects the behaviour of git
> globally, I think "git --hash=sha256" would make sense, like other
> global options like --literal-pathspecs and --no-replace-objects.
>
> Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Design of multiple hash support
  2018-11-05  1:00 Design of multiple hash support brian m. carlson
  2018-11-05  2:36 ` Junio C Hamano
@ 2018-11-05 19:03 ` Duy Nguyen
  2018-11-05 22:00   ` Jonathan Nieder
  1 sibling, 1 reply; 7+ messages in thread
From: Duy Nguyen @ 2018-11-05 19:03 UTC (permalink / raw)
  To: brian m. carlson, Git Mailing List

On Mon, Nov 5, 2018 at 2:02 AM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
>
> I'm currently working on getting Git to support multiple hash algorithms
> in the same binary (SHA-1 and SHA-256).  In order to have a fully
> functional binary, we'll need to have some way of indicating to certain
> commands (such as init and show-index) that they should assume a certain
> hash algorithm.
>
> There are basically two approaches I can take.  The first is to provide
> each command that needs to learn about this with its own --hash
> argument.  So we'd have:
>
>   git init --hash=sha256
>   git show-index --hash=sha256 <some-file
>
> The other alternative is that we provide a global option to git, which
> is parsed by all programs, like so:
>
>   git --hash=sha256 init
>   git --hash=sha256 show-index <some-file
>

I suppose this is about the "no repository/standalone" mode, because

 - it's hard to pass global arguments down to builtin commands (we
often have to rely on global variables which are on the way out)

 - global options confuse new people and also harder to reorder (if
you forget it, you have to alt-b all the way back to near the
beginning of the command line and add it there, instead of near the
end)

 - there aren't that many standalone commands

I'm leaning towards "git foo --hash=".

> There's also the question of what we want to call the option.  The
> obvious name is --hash, which is intuitive and straightforward.
> However, the transition plan names the config option
> extensions.objectFormat, so --object-format is also a possibility.  If
> we ever decide to support, say, zstd compression instead of zlib, we
> could leverage the same option (say, --object-format=sha256:zstd) and
> avoid the need for an additional option.  This might be planning for a
> future that never occurs, though.

--object-format is less vague than --hash. The downside is it's longer
(more to type) but I'm counting on git-completion.bash and the guess
that people rarely need to use this option.
-- 
Duy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Design of multiple hash support
  2018-11-05 19:03 ` Duy Nguyen
@ 2018-11-05 22:00   ` Jonathan Nieder
  2018-11-06  0:13     ` brian m. carlson
  0 siblings, 1 reply; 7+ messages in thread
From: Jonathan Nieder @ 2018-11-05 22:00 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: brian m. carlson, Git Mailing List

Hi,

Duy Nguyen wrote:
> On Mon, Nov 5, 2018 at 2:02 AM brian m. carlson
> <sandals@crustytoothpaste.net> wrote:

>> There are basically two approaches I can take.  The first is to provide
>> each command that needs to learn about this with its own --hash
>> argument.  So we'd have:
>>
>>   git init --hash=sha256
>>   git show-index --hash=sha256 <some-file
>>
>> The other alternative is that we provide a global option to git, which
>> is parsed by all programs, like so:
>>
>>   git --hash=sha256 init
>>   git --hash=sha256 show-index <some-file
[...]
> I'm leaning towards "git foo --hash=".

Can you say a little more about the semantics of the option?  For
commands like "git init", I tend to agree with Duy here, since it
allows each command's manual to describe what the option means in the
context of that command.

For "git show-index", ideally Git should use the object format named
in the idx file.

>> There's also the question of what we want to call the option.  The
>> obvious name is --hash, which is intuitive and straightforward.
>> However, the transition plan names the config option
>> extensions.objectFormat,
[...]
> --object-format is less vague than --hash. The downside is it's longer
> (more to type) but I'm counting on git-completion.bash and the guess
> that people rarely need to use this option.

Agreed.  --object-format makes more sense to me than --hash, since
it's more precise about what the option affects.

Thanks for looking into this.

Sincerely,
Jonathan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Design of multiple hash support
  2018-11-05 18:03   ` Stefan Beller
@ 2018-11-05 23:54     ` brian m. carlson
  0 siblings, 0 replies; 7+ messages in thread
From: brian m. carlson @ 2018-11-05 23:54 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Junio C Hamano, git

[-- Attachment #1: Type: text/plain, Size: 5556 bytes --]

On Mon, Nov 05, 2018 at 10:03:21AM -0800, Stefan Beller wrote:
> On Sun, Nov 4, 2018 at 6:36 PM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> >
> > > I'm currently working on getting Git to support multiple hash algorithms
> > > in the same binary (SHA-1 and SHA-256).  In order to have a fully
> > > functional binary, we'll need to have some way of indicating to certain
> > > commands (such as init and show-index) that they should assume a certain
> > > hash algorithm.
> > >
> > > There are basically two approaches I can take.  The first is to provide
> > > each command that needs to learn about this with its own --hash
> > > argument.  So we'd have:
> > >
> > >   git init --hash=sha256
> > >   git show-index --hash=sha256 <some-file
> > >
> > > The other alternative is that we provide a global option to git, which
> > > is parsed by all programs, like so:
> > >
> > >   git --hash=sha256 init
> > >   git --hash=sha256 show-index <some-file
> >
> > I am assuming that "show-index" above is a typo for something like
> > "hash-object"?

> Actually both seem plausible, as both do not require
> RUN_SETUP, which means they cannot rely on the
> extensions.objectFormat setting.

Correct.  In general, I assume that options that want a repository will
use the repository for that information.  There are a small number of
programs, such as init, that need to either set up a repository (without
reference to another repository) or need to inspect files without
necessarily being in a repository.

For example, we will want to have a way of indicating which hash we
would like to use in a fresh repository.  I am for the moment assuming
that we're in a stage 4 configuration: that is, that we're all SHA-1 or
all SHA-256.  A clone will provide this for us, but a git init will not.

Also, our pack index v3 format knows about which hash algorithm is in
use, but packs are not labeled with the algorithm they use.  This isn't
really a problem in normal use, since we always know from context which
algorithm is in use, but we'll need to indicate to index-pack (which
technically need not run in a repository) which algorithm it should use.

show-index will eventually learn to parse the index itself to learn
which algorithms are in use, so it is technically not required here.

> When having a global setting, would that override the configured
> object format extension in a repository, or do we error out?
> 
> So maybe
> 
>   git -c extensions.objectFormat=sha256 init
> 
> is the way to go, for now? (Are repository format extensions parsed
> just like normal config, such that non-RUN_SETUP commands
> can rely on the (non-)existence to determine whether to use
> the default or the given hash function?)

The extensions callbacks are only handled in check_repo_format, so they
necessarily require a repository.  This is not new with my code.

Furthermore, one would have to specify "-c
core.repositoryformatversion=1" as well, as extensions require that
version in order to have any effect.

My current approach for the testsuite is to have git init honor a new
GIT_DEFAULT_HASH environment variable so we need not modify every place
in the testsuite that calls git init (of which there are many).  That
may or may not be greeted with joy by reviewers, but it seemed to be the
minimum viable approach.

> There is a section "Object names on the command line"
> in Documentation/technical/hash-function-transition.txt
> and I assume that this before the "dark launch"
> phase, so I would expect the latter to work (no error
> but conversion/translation on the fly) eventually as a goal.
> But the former might be in scope of one series.

Currently, I'm not implementing the stage 1-3 implementations.  I'm
merely going from the point where we have a binary that does only
SHA-256 and cannot perform SHA-1 operations at all to a stage 4
implementation, where the binary can do either, but a repository is
wholly one or the other.

> > It can work this way:
> >
> >  - read HEAD, discover that I am on 'master' branch, read refs/heads/master
> >    to learn the object name in 40-hex, realize that it cannot be
> >    sha256 and report "corrupt ref".
> >
> > Or it can work this way:
> >
> >  - read repository format, realize it is a good old sha1 repository.
> >
> >  - do the usual thing to get to read_object() to read the commit
> >    object data for the commit at HEAD, doing all of it in sha1.
> >
> >  - in the commit object data, locate references to other objects
> >    that use sha1 name.
> >
> >  - replace these sha1 references with their sha256 counterparts and
> >    show the result.
> >
> > I am guessing that you are doing the former as a good first step, in
> > which case, as an option that changes/affects the behaviour of git
> > globally, I think "git --hash=sha256" would make sense, like other
> > global options like --literal-pathspecs and --no-replace-objects.

Right now, we always read the repository configuration when possible,
and honor that.  I'm not planning, even when we have a full
implementation, to let the configuration of input and output format be
modified by command-line options.  That's a configuration of the
repository in the current transition plan, and I have no intention of
changing that (apart from possibly honoring "git -c").
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Design of multiple hash support
  2018-11-05 22:00   ` Jonathan Nieder
@ 2018-11-06  0:13     ` brian m. carlson
  0 siblings, 0 replies; 7+ messages in thread
From: brian m. carlson @ 2018-11-06  0:13 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Duy Nguyen, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 1616 bytes --]

On Mon, Nov 05, 2018 at 02:00:42PM -0800, Jonathan Nieder wrote:
> Hi,
> 
> Duy Nguyen wrote:
> > On Mon, Nov 5, 2018 at 2:02 AM brian m. carlson
> > <sandals@crustytoothpaste.net> wrote:
> 
> >> There are basically two approaches I can take.  The first is to provide
> >> each command that needs to learn about this with its own --hash
> >> argument.  So we'd have:
> >>
> >>   git init --hash=sha256
> >>   git show-index --hash=sha256 <some-file
> >>
> >> The other alternative is that we provide a global option to git, which
> >> is parsed by all programs, like so:
> >>
> >>   git --hash=sha256 init
> >>   git --hash=sha256 show-index <some-file
> [...]
> > I'm leaning towards "git foo --hash=".
> 
> Can you say a little more about the semantics of the option?  For
> commands like "git init", I tend to agree with Duy here, since it
> allows each command's manual to describe what the option means in the
> context of that command.

Sure.  The semantics for git init are "produce a repository with this
hash algorithm".  The semantics for git index-pack are "the pack I want
you to index uses this hash algorithm".  Essentially, more generically,
the semantics are "the repository or data object uses this hash
algorithm".

> For "git show-index", ideally Git should use the object format named
> in the idx file.

I agree that will be the eventual goal.  It will also be what I ship in
the final series, in all likelihood.  I have most of pack v3
implemented, but it's not complete yet.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-11-06  0:13 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-05  1:00 Design of multiple hash support brian m. carlson
2018-11-05  2:36 ` Junio C Hamano
2018-11-05 18:03   ` Stefan Beller
2018-11-05 23:54     ` brian m. carlson
2018-11-05 19:03 ` Duy Nguyen
2018-11-05 22:00   ` Jonathan Nieder
2018-11-06  0:13     ` brian m. carlson

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).