git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Patrick Steinhardt <ps@pks.im>
To: shejialuo <shejialuo@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: Discuss GSoC: Implement consistency checks for refs
Date: Wed, 6 Mar 2024 15:45:17 +0100	[thread overview]
Message-ID: <ZeiBfVyTCHUywliI@tanuki> (raw)
In-Reply-To: <ZehtpMtxPLuYYmgO@ArchLinux>

[-- Attachment #1: Type: text/plain, Size: 4189 bytes --]

On Wed, Mar 06, 2024 at 09:20:36PM +0800, shejialuo wrote:
> Hi All,
> 
> I am interested in "Implement consistency checks for refs" GSoC idea.
> However, implementing a feautre is much harder. So I wanna ask you some
> questions to better work on.

Sure!

> As [1] shows, I think the idea is easy to understand. We need to ensure
> the consistency of the refs. The current `git-fsck` only checks the
> connectivity from ref to the object file. There is a possiblity that ref
> itself could be corrupted. And we should avoid it through this project.

I know this is splitting hairs, but git-fsck(1) doesn't give us the
tools to avoid corruption. It only gives us the tools to detect it after
the fact.

> I have read some source codes. Based on what I have learned, I know
> there are two backends. One is file and another is reftable. I have
> no idea about the reftable currently. So at now, I will focus on the
> file backend.

Yeah, the "reftable" backend is new in the Git v2.45 release cycle, so
it's totally expected that most peeople have no idea about it. It's also
part of the motivation for this project though. Because as you noted, it
is a binary format that is thus not as readily parseable by a human as
the old "files, backend. This makes it much more important to provide
the tooling to detect whether things look as expected.

> I think the principle behind the `git-fsck` is that it will traverse
> every object file, read its content and use SHA-1 to hash the content
> and compare the value with the stored ref value. So if we want to add
> consistency checks for refs. We may need to add a new file to store the
> last commit state (not only last commit state, do we need to consider
> the stash state). However, from my perspective, it's a bad idea to use a
> file to store the refs' states and we cannot use object file to check
> whether the ref is corrupted.

I agree a 100% -- tracking ref states in a secondary database is not a
good idea.

> So this is my first question, what mechanism should we use to provide
> consistency? And to what extend for the consistency. And I think this
> mechanism should be general for both text-based and binary-based refs.

The exact extent will need some discussion. What's clear is that it does
not need to be perfect from the beginning, and we are sure to discover
more checks over time that may make sense.

Some ideas from the top of my head:

  - generic
    - Ensure that all ref names are conformant.
    - Ensure that there are no directory/file conflicts for the ref
      names.
  - files
    - Ensure that "packed-refs" is well-formatted.
    - Ensure that refs in "packed-refs" are ordered lexicographically.
    - Check for corrupted loose refs in "refs/".
  - reftable
    - Ensure that there are no garbage files in "reftable/".
    - Ensure that "tables.list" is well-formatted.
    - Ensure that each table is well-formatted.
    - Ensure that refs in each table are ordered correctly.

This list is not exhaustive, there may of course be other checks that
may make sense. Any additional ideas by you or other interested students
are be welcome.

For what it's worth, not all of the checks need to be implemented as
part of GSoC. At a minimum, it should result in the infra to allow for
backend-specific checks and a couple of checks for at least one of the
backends.

> And I have a more general qeustion, I think I need understand `fsck.c`
> and of couse the reftable format. However, I am confused whether I need
> to understand the ref internal. And could you please provide me more
> infomration to make this idea more clear.

You will certainly need to learn about ref internals a bit. There are
some common rules and restrictions that are important in order to figure
out what we want to check in the first place. Understanding the
"reftable" format would be great, but you may also get away with only
implementing generic or "files"-backend specific consistency checks.
This depends on the scope you are aiming for.

Patrick

> Thanks,
> Jialuo
> 
> [1] https://lore.kernel.org/git/ZakIPEytlxHGCB9Y@tanuki/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2024-03-06 14:45 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-06 13:20 Discuss GSoC: Implement consistency checks for refs shejialuo
2024-03-06 14:45 ` Patrick Steinhardt [this message]
  -- strict thread matches above, loose matches on Subject: below --
2024-03-10 10:01 shejialuo
2024-03-14  3:38 ` Kaartic Sivaraam

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZeiBfVyTCHUywliI@tanuki \
    --to=ps@pks.im \
    --cc=git@vger.kernel.org \
    --cc=shejialuo@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).