git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [RFC PATCH] midx.c: clean up .rev file
@ 2022-06-22 11:50 haoyurenzhuxia
  2022-06-22 15:56 ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 9+ messages in thread
From: haoyurenzhuxia @ 2022-06-22 11:50 UTC (permalink / raw)
  To: git; +Cc: avarab, me, derrickstolee, dyroneteng, Xia XiaoWen

From: Xia XiaoWen <haoyurenzhuxia@gmail.com>

The command: `git multi-pack-index write --bitmap` will create 3
files in `objects/pack/`:
    * multi-pack-index
    * multi-pack-index-*.bitmap
    * multi-pack-index-*.rev

But if the command is terminated by the user (such as Ctl-C) or
the system, the midx reverse index file (`multi-pack-index-*.rev`)
is not removed and still exists in `objects/pack/`:

    $ GIT_TEST_MIDX_WRITE_REV=1 git multi-pack-index write --bitmap
    Selecting bitmap commits: 133020, done.
    Building bitmaps:   0% (3/331)
    ^C^C

    $ tree objects/pack/
    objects/pack/
    ├── multi-pack-index-3b048d1b965842cd866e10b6ec1a3035dbede0a5.rev
    ├── pack-b7d425f1b01727d5f364f5d9fbab2d1900fcd5c0.idx
    └── pack-b7d425f1b01727d5f364f5d9fbab2d1900fcd5c0.pack

This patch resolves this by adding a cleanup handler to the sigchain.

Signed-off-by: Xia XiaoWen <haoyurenzhuxia@gmail.com>
---
 midx.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/midx.c b/midx.c
index 5f0dd386b0..6586051a62 100644
--- a/midx.c
+++ b/midx.c
@@ -17,6 +17,7 @@
 #include "refs.h"
 #include "revision.h"
 #include "list-objects.h"
+#include "sigchain.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -41,6 +42,8 @@
 
 #define PACK_EXPIRED UINT_MAX
 
+static struct strbuf rev_filename = STRBUF_INIT;
+
 const unsigned char *get_midx_checksum(struct multi_pack_index *m)
 {
 	return m->data + m->data_len - the_hash_algo->rawsz;
@@ -884,21 +887,29 @@ static uint32_t *midx_pack_order(struct write_midx_context *ctx)
 	return pack_order;
 }
 
+static void remove_rev_file_on_signal(int signo)
+{
+	if (unlink(rev_filename.buf))
+		die_errno(_("failed to remove %s"), rev_filename.buf);
+
+	sigchain_pop(signo);
+	raise(signo);
+}
+
 static void write_midx_reverse_index(char *midx_name, unsigned char *midx_hash,
 				     struct write_midx_context *ctx)
 {
-	struct strbuf buf = STRBUF_INIT;
 	const char *tmp_file;
 
-	strbuf_addf(&buf, "%s-%s.rev", midx_name, hash_to_hex(midx_hash));
+	strbuf_addf(&rev_filename, "%s-%s.rev", midx_name, hash_to_hex(midx_hash));
 
 	tmp_file = write_rev_file_order(NULL, ctx->pack_order, ctx->entries_nr,
 					midx_hash, WRITE_REV);
 
-	if (finalize_object_file(tmp_file, buf.buf))
+	if (finalize_object_file(tmp_file, rev_filename.buf))
 		die(_("cannot store reverse index file"));
 
-	strbuf_release(&buf);
+	sigchain_push_common(remove_rev_file_on_signal);
 }
 
 static void clear_midx_files_ext(const char *object_dir, const char *ext,
-- 
2.34.1.52.g9efd6505a3.dirty.agit.6.5.6


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] midx.c: clean up .rev file
  2022-06-22 11:50 [RFC PATCH] midx.c: clean up .rev file haoyurenzhuxia
@ 2022-06-22 15:56 ` Ævar Arnfjörð Bjarmason
  2022-06-22 17:53   ` Junio C Hamano
  2022-06-27  3:53   ` Xiaowen Xia
  0 siblings, 2 replies; 9+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-06-22 15:56 UTC (permalink / raw)
  To: haoyurenzhuxia; +Cc: git, me, derrickstolee, dyroneteng


On Wed, Jun 22 2022, haoyurenzhuxia@gmail.com wrote:

> From: Xia XiaoWen <haoyurenzhuxia@gmail.com>
>
> The command: `git multi-pack-index write --bitmap` will create 3
> files in `objects/pack/`:
>     * multi-pack-index
>     * multi-pack-index-*.bitmap
>     * multi-pack-index-*.rev
>
> But if the command is terminated by the user (such as Ctl-C) or
> the system, the midx reverse index file (`multi-pack-index-*.rev`)
> is not removed and still exists in `objects/pack/`:
>
>     $ GIT_TEST_MIDX_WRITE_REV=1 git multi-pack-index write --bitmap
>     Selecting bitmap commits: 133020, done.
>     Building bitmaps:   0% (3/331)
>     ^C^C
>
>     $ tree objects/pack/
>     objects/pack/
>     ├── multi-pack-index-3b048d1b965842cd866e10b6ec1a3035dbede0a5.rev
>     ├── pack-b7d425f1b01727d5f364f5d9fbab2d1900fcd5c0.idx
>     └── pack-b7d425f1b01727d5f364f5d9fbab2d1900fcd5c0.pack
>
> This patch resolves this by adding a cleanup handler to the sigchain.
>
> Signed-off-by: Xia XiaoWen <haoyurenzhuxia@gmail.com>
> ---
>  midx.c | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/midx.c b/midx.c
> index 5f0dd386b0..6586051a62 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -17,6 +17,7 @@
>  #include "refs.h"
>  #include "revision.h"
>  #include "list-objects.h"
> +#include "sigchain.h"
>  
>  #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
>  #define MIDX_VERSION 1
> @@ -41,6 +42,8 @@
>  
>  #define PACK_EXPIRED UINT_MAX
>  
> +static struct strbuf rev_filename = STRBUF_INIT;

Is the rest of this API thread safe, and no longer is because of this?
You're doing this because...

>  const unsigned char *get_midx_checksum(struct multi_pack_index *m)
>  {
>  	return m->data + m->data_len - the_hash_algo->rawsz;
> @@ -884,21 +887,29 @@ static uint32_t *midx_pack_order(struct write_midx_context *ctx)
>  	return pack_order;
>  }
>  
> +static void remove_rev_file_on_signal(int signo)
> +{
> +	if (unlink(rev_filename.buf))
> +		die_errno(_("failed to remove %s"), rev_filename.buf);
> +
> +	sigchain_pop(signo);
> +	raise(signo);

We need to handle this signalling.

I wonder if we could (ab)use the lockfile.c/tempfile.c API instead here,
and get the signal handling, cleanup etc. for free.

Also, the commit message doesn't really say *why*, i.e. in cmd_repack()
we've suffered from this already, but don't we have "git gc" cleaning
these up? Maybe not (I didn't check), but maybe that was the previous
assumption...

I mean, I think it makes sense to clean these up, but are we doing the
same for the other X.* files for the X.pack? Should we?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] midx.c: clean up .rev file
  2022-06-22 15:56 ` Ævar Arnfjörð Bjarmason
@ 2022-06-22 17:53   ` Junio C Hamano
  2022-06-22 18:13     ` Taylor Blau
  2022-06-27  3:53   ` Xiaowen Xia
  1 sibling, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2022-06-22 17:53 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: haoyurenzhuxia, git, me, derrickstolee, dyroneteng

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> On Wed, Jun 22 2022, haoyurenzhuxia@gmail.com wrote:
>
>> From: Xia XiaoWen <haoyurenzhuxia@gmail.com>
>>
>> The command: `git multi-pack-index write --bitmap` will create 3
>> files in `objects/pack/`:
>>     * multi-pack-index
>>     * multi-pack-index-*.bitmap
>>     * multi-pack-index-*.rev
>>
>> But if the command is terminated by the user (such as Ctl-C) or
>> the system, the midx reverse index file (`multi-pack-index-*.rev`)
>> is not removed and still exists in `objects/pack/`:
>>
>>     $ GIT_TEST_MIDX_WRITE_REV=1 git multi-pack-index write --bitmap
>>     Selecting bitmap commits: 133020, done.
>>     Building bitmaps:   0% (3/331)
>>     ^C^C
>>
>>     $ tree objects/pack/
>>     objects/pack/
>>     ├── multi-pack-index-3b048d1b965842cd866e10b6ec1a3035dbede0a5.rev
>>     ├── pack-b7d425f1b01727d5f364f5d9fbab2d1900fcd5c0.idx
>>     └── pack-b7d425f1b01727d5f364f5d9fbab2d1900fcd5c0.pack
>> ...
> Also, the commit message doesn't really say *why*, i.e. in cmd_repack()
> we've suffered from this already, but don't we have "git gc" cleaning
> these up? Maybe not (I didn't check), but maybe that was the previous
> assumption...

Exactly my thought.  Well said.

The _only_ case that might matter is if the next "write --bitmap" gets
confused ir there is a leftover file(s) from a previous run, but then
such a bug should be fixed on the reading side, I would think.

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] midx.c: clean up .rev file
  2022-06-22 17:53   ` Junio C Hamano
@ 2022-06-22 18:13     ` Taylor Blau
  2022-06-22 19:58       ` Junio C Hamano
  2022-06-23 12:38       ` Teng Long
  0 siblings, 2 replies; 9+ messages in thread
From: Taylor Blau @ 2022-06-22 18:13 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, haoyurenzhuxia, git,
	derrickstolee, dyroneteng

On Wed, Jun 22, 2022 at 10:53:24AM -0700, Junio C Hamano wrote:
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
> > On Wed, Jun 22 2022, haoyurenzhuxia@gmail.com wrote:
> >
> >> From: Xia XiaoWen <haoyurenzhuxia@gmail.com>
> >>
> >> The command: `git multi-pack-index write --bitmap` will create 3
> >> files in `objects/pack/`:
> >>     * multi-pack-index
> >>     * multi-pack-index-*.bitmap
> >>     * multi-pack-index-*.rev
> >>
> >> But if the command is terminated by the user (such as Ctl-C) or
> >> the system, the midx reverse index file (`multi-pack-index-*.rev`)
> >> is not removed and still exists in `objects/pack/`:
> >>
> >>     $ GIT_TEST_MIDX_WRITE_REV=1 git multi-pack-index write --bitmap
> >>     Selecting bitmap commits: 133020, done.
> >>     Building bitmaps:   0% (3/331)
> >>     ^C^C
> >>
> >>     $ tree objects/pack/
> >>     objects/pack/
> >>     ├── multi-pack-index-3b048d1b965842cd866e10b6ec1a3035dbede0a5.rev
> >>     ├── pack-b7d425f1b01727d5f364f5d9fbab2d1900fcd5c0.idx
> >>     └── pack-b7d425f1b01727d5f364f5d9fbab2d1900fcd5c0.pack
> >> ...
> > Also, the commit message doesn't really say *why*, i.e. in cmd_repack()
> > we've suffered from this already, but don't we have "git gc" cleaning
> > these up? Maybe not (I didn't check), but maybe that was the previous
> > assumption...
>
> Exactly my thought.  Well said.

`repack` relies on `git multi-pack-index write --bitmap` to do the
actual work here. A few things worth noting:

  - the MIDX file itself is written using a lock_file, so it is
    atomically moved into place, and the temporary file is either
    removed, or cleaned up automatically with a sigchain handler on
    process death

  - the bitmap (written in bitmap_writer_finish(), which is the path for
    both single- and multi-pack bitmaps) is written to a temporary file
    and moved into place after the bitmaps are written.

    ...but this temporary file isn't automatically cleaned up, so it
    could stick around after process death. Luckily the race window here
    is pretty small, since all of the bitmaps have been computed already
    and are held in memory.

    This is probably worth a cleanup on its own, too.

  - unless GIT_TEST_MIDX_WRITE_REV=1 is in your environment, we won't
    *write* a .rev file, hence this is pretty rare to deal with in
    practice.

So I think there are two things worth doing here:

  - make sure that the temporary file used to stage the .bitmap is a
    lock_file
  - use a temporary file to stage the .rev file (when forced to write
    one), and ensure that that too is a lock_file

This is mostly the same as Ævar's original suggestion, just with a
larger scope. But I agree that it's the right direction nonetheless.

> The _only_ case that might matter is if the next "write --bitmap" gets
> confused ir there is a leftover file(s) from a previous run, but then
> such a bug should be fixed on the reading side, I would think.

It shouldn't in practice. We'll recognize that the .rev file is garbage
if it's checksum doesn't match the MIDX's, and we squashed the bug where
changing the object *order* (but not the objects themselves) doesn't
change the MIDX checksum (it does now).

We won't write over an existing .rev file with the right name, but we'll
always prefer to read the .rev data from the MIDX itself, if it exists.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] midx.c: clean up .rev file
  2022-06-22 18:13     ` Taylor Blau
@ 2022-06-22 19:58       ` Junio C Hamano
  2022-06-22 21:31         ` Taylor Blau
  2022-06-23 12:38       ` Teng Long
  1 sibling, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2022-06-22 19:58 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Ævar Arnfjörð Bjarmason, haoyurenzhuxia, git,
	derrickstolee, dyroneteng

Taylor Blau <me@ttaylorr.com> writes:

>   - the MIDX file itself is written using a lock_file, so it is
>     atomically moved into place, and the temporary file is either
>     removed, or cleaned up automatically with a sigchain handler on
>     process death

Good.

>   - the bitmap (written in bitmap_writer_finish(), which is the path for
>     both single- and multi-pack bitmaps) is written to a temporary file
>     and moved into place after the bitmaps are written.
>
>     ...but this temporary file isn't automatically cleaned up, so it
>     could stick around after process death. Luckily the race window here
>     is pretty small, since all of the bitmaps have been computed already
>     and are held in memory.
>
>     This is probably worth a cleanup on its own, too.

As long as the "temporary file" is clearly a temporary file that
"gc" can recognize and get rid of, it would be OK, I would think.

>   - unless GIT_TEST_MIDX_WRITE_REV=1 is in your environment, we won't
>     *write* a .rev file, hence this is pretty rare to deal with in
>     practice.

OK, but if we were to write one, we should do the same "write into a
temporary, rename it in place" dance, right?  Or is a separate .rev
file is pretty much a thing of last decade that we do not have to
worry too much about?

> So I think there are two things worth doing here:
>
>   - make sure that the temporary file used to stage the .bitmap is a
>     lock_file

Yes.

>   - use a temporary file to stage the .rev file (when forced to write
>     one), and ensure that that too is a lock_file

Yes.

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] midx.c: clean up .rev file
  2022-06-22 19:58       ` Junio C Hamano
@ 2022-06-22 21:31         ` Taylor Blau
  2022-06-27  5:05           ` Xiaowen Xia
  0 siblings, 1 reply; 9+ messages in thread
From: Taylor Blau @ 2022-06-22 21:31 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Taylor Blau, Ævar Arnfjörð Bjarmason,
	haoyurenzhuxia, git, derrickstolee, dyroneteng

On Wed, Jun 22, 2022 at 12:58:42PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> >   - the MIDX file itself is written using a lock_file, so it is
> >     atomically moved into place, and the temporary file is either
> >     removed, or cleaned up automatically with a sigchain handler on
> >     process death
>
> Good.
>
> >   - the bitmap (written in bitmap_writer_finish(), which is the path for
> >     both single- and multi-pack bitmaps) is written to a temporary file
> >     and moved into place after the bitmaps are written.
> >
> >     ...but this temporary file isn't automatically cleaned up, so it
> >     could stick around after process death. Luckily the race window here
> >     is pretty small, since all of the bitmaps have been computed already
> >     and are held in memory.
> >
> >     This is probably worth a cleanup on its own, too.
>
> As long as the "temporary file" is clearly a temporary file that
> "gc" can recognize and get rid of, it would be OK, I would think.

Yeah, I think either is fine (though it would be slightly nicer to have
the sigchain code clean it up ahead of time when possible). In the
meantime, nothing is broken, though.

> >   - unless GIT_TEST_MIDX_WRITE_REV=1 is in your environment, we won't
> >     *write* a .rev file, hence this is pretty rare to deal with in
> >     practice.
>
> OK, but if we were to write one, we should do the same "write into a
> temporary, rename it in place" dance, right?  Or is a separate .rev
> file is pretty much a thing of last decade that we do not have to
> worry too much about?

Right. Technically we link() it into place, cf. our discussion about it
here:

    https://lore.kernel.org/git/xmqq5yqeghck.fsf@gitster.g/

But in general this is not really at all considered common anymore,
since we expect all new writes to have the reverse index embedded in the
MIDX itself.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] midx.c: clean up .rev file
  2022-06-22 18:13     ` Taylor Blau
  2022-06-22 19:58       ` Junio C Hamano
@ 2022-06-23 12:38       ` Teng Long
  1 sibling, 0 replies; 9+ messages in thread
From: Teng Long @ 2022-06-23 12:38 UTC (permalink / raw)
  To: me; +Cc: avarab, derrickstolee, dyroneteng, git, gitster, haoyurenzhuxia

On Wed, Wed, 22 Jun 2022 14:13:47 -0400, Taylor Blay wrote:

> It shouldn't in practice. We'll recognize that the .rev file is garbage
> if it's checksum doesn't match the MIDX's, and we squashed the bug where
> changing the object *order* (but not the objects themselves) doesn't
> change the MIDX checksum (it does now).
> 
> We won't write over an existing .rev file with the right name, but we'll
> always prefer to read the .rev data from the MIDX itself, if it exists.

Yes, this is what I want to say.

On the patch itself, I think it may worth to remove the tmp rev file at earlier
time because there may reduce doubt about whether the file remaining after
the interruption will have side effects (but we know it won't for now ,
because regenerating midx and rev will ensure an strict match based on the
checksum).

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] midx.c: clean up .rev file
  2022-06-22 15:56 ` Ævar Arnfjörð Bjarmason
  2022-06-22 17:53   ` Junio C Hamano
@ 2022-06-27  3:53   ` Xiaowen Xia
  1 sibling, 0 replies; 9+ messages in thread
From: Xiaowen Xia @ 2022-06-27  3:53 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git List, Taylor Blau, derrickstolee, dyroneteng

Thanks for your attention.

> Is the rest of this API thread safe, and no longer is because of this?
> You're doing this because...

* The `gettext()` and `fputs()` which calls in `die_errno(_())` are seem
highly unlikely async-signal-safety, this may cause problems in signal
handlers, I should remove it.
* This is thread-safe. The *.rev file will no longer be read/written after
creating the signal handler by `sigchain_push_common()`.
* On the other hand, the command: `git multi-pack-index write --bitmap` will
create an lockfile named `multi-pack-index.lock` and the signal handler
will not affect the execution of the same command or git-process.

The purpose of this patch is the same as what Teng Long said, to clean up the
files generated by the process when the process exits abnormally.

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] midx.c: clean up .rev file
  2022-06-22 21:31         ` Taylor Blau
@ 2022-06-27  5:05           ` Xiaowen Xia
  0 siblings, 0 replies; 9+ messages in thread
From: Xiaowen Xia @ 2022-06-27  5:05 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason, Git List,
	derrickstolee, dyroneteng

> But in general this is not really at all considered common anymore,
> since we expect all new writes to have the reverse index embedded in the
> MIDX itself.

You mean, *.rev file will be embedded in MIDX files in the future, and *.rev
file don't exist anymore?
If this is the plan, the patch is helpless, ignore it.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-06-27  5:07 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-22 11:50 [RFC PATCH] midx.c: clean up .rev file haoyurenzhuxia
2022-06-22 15:56 ` Ævar Arnfjörð Bjarmason
2022-06-22 17:53   ` Junio C Hamano
2022-06-22 18:13     ` Taylor Blau
2022-06-22 19:58       ` Junio C Hamano
2022-06-22 21:31         ` Taylor Blau
2022-06-27  5:05           ` Xiaowen Xia
2022-06-23 12:38       ` Teng Long
2022-06-27  3:53   ` Xiaowen Xia

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).