From: Eric Wong <e@80x24.org>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: git@vger.kernel.org, Jeff King <peff@peff.net>,
Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH] update-server-info: avoid needless overwrites
Date: Sun, 12 May 2019 00:38:59 +0000 [thread overview]
Message-ID: <20190512003859.GA22361@dcvr> (raw)
In-Reply-To: <87v9ygwoj0.fsf@evledraar.gmail.com>
Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> On Sat, May 11 2019, Eric Wong wrote:
> > +static int files_differ(FILE *fp, const char *path)
> > +{
> > + struct stat st;
> > + git_hash_ctx c;
> > + struct object_id oid_old, oid_new;
> > + struct strbuf tmp = STRBUF_INIT;
> > + long new_len = ftell(fp);
> > +
> > + if (new_len < 0 || stat(path, &st) < 0)
> > + return 1;
> > + if (!S_ISREG(st.st_mode))
> > + return 1;
> > + if ((off_t)new_len != st.st_size)
> > + return 1;
> > +
> > + rewind(fp);
> > + if (strbuf_fread(&tmp, (size_t)new_len, fp) != (size_t)new_len)
> > + return 1;
> > + the_hash_algo->init_fn(&c);
> > + the_hash_algo->update_fn(&c, tmp.buf, tmp.len);
> > + the_hash_algo->final_fn(oid_new.hash, &c);
> > + strbuf_release(&tmp);
> > +
> > + if (strbuf_read_file(&tmp, path, (size_t)st.st_size) < 0)
> > + return 1;
> > + the_hash_algo->init_fn(&c);
> > + the_hash_algo->update_fn(&c, tmp.buf, tmp.len);
> > + the_hash_algo->final_fn(oid_old.hash, &c);
> > + strbuf_release(&tmp);
> > +
> > + return hashcmp(oid_old.hash, oid_new.hash);
> > +}
>
> This way of doing it just seems so weirdly convoluted. Read them one at
> a time, compute the SHA-1, just to see if they're different. Why not
> something closer to a plain memcmp():
>
> static int files_differ(FILE *fp, const char *path)
> {
> struct strbuf old = STRBUF_INIT, new = STRBUF_INIT;
> long new_len = ftell(fp);
> int diff = 1;
>
> rewind(fp);
> if (strbuf_fread(&new, (size_t)new_len, fp) != (size_t)new_len)
> goto release_return;
> if (strbuf_read_file(&old, path, 0) < 0)
> goto release_return;
>
> diff = strbuf_cmp(&old, &new);
>
> release_return:
> strbuf_release(&old);
> strbuf_release(&new);
>
> return diff;
> }
>
> I.e. optimze for code simplicity with something close to a dumb "cmp
> --silent". I'm going to make the bold claim that nobody using "dumb
> http" is going to be impacted by the performance of reading their
> for-each-ref or for-each-pack dump, hence opting for not even bothing to
> stat() to get the size before reading.
I've been trying to improve dumb HTTP for more cases; actually.
(since it's much cheaper than smart HTTP in server memory/CPU)
> Because really, if we were *trying* to micro-optimize this for time or
> memory use there's much better ways, e.g. reading the old file and
> memcmp() as we go and stream the "generate" callback, but I just don't
> see the point of trying in this case.
I was actually going towards that route; but wasn't sure if this
idea would be accepted at all (and I've been trying to stay away
from using non-scripting languages).
I don't slurping all of info/refs into memory at all; so maybe
a streaming memcmp of the existing file is worth doing...
> > /*
> > * Create the file "path" by writing to a temporary file and renaming
> > * it into place. The contents of the file come from "generate", which
> > * should return non-zero if it encounters an error.
> > */
> > -static int update_info_file(char *path, int (*generate)(FILE *))
> > +static int update_info_file(char *path, int (*generate)(FILE *), int force)
> > {
> > char *tmp = mkpathdup("%s_XXXXXX", path);
>
> Unrelated to this, patch, but I hadn't thought about this nasty race
> condition. We recommend users run this from the "post-update" (or
> "post-receive") hook, and don't juggle the lock along with the ref
> update, thus due to the vagaries of scheduling you can end up with two
> concurrent ref updates where the "old" one wins.
>
> But I guess that brings me back to something close to "nobody with that
> sort of update rate is using 'dumb http'" :)
>
> For this to work properly we'd need to take some sort of global "ref
> update/pack update" lock, and I guess at that point this "cmp" case
> would be a helper similar to commit_lock_file_to(),
> i.e. a commit_lock_file_to_if_different().
Worth a separate patch, at some point, I think. I'm not too
familiar with the existing locking in git, actually...
Along those lines, I think repack/gc should automatically
update objects/info/packs if the file already exists.
> > int ret = -1;
> > int fd = -1;
> > FILE *fp = NULL, *to_close;
> > + int do_update;
> >
> > safe_create_leading_directories(path);
> > fd = git_mkstemp_mode(tmp, 0666);
> > if (fd < 0)
> > goto out;
> > - to_close = fp = fdopen(fd, "w");
> > + to_close = fp = fdopen(fd, "w+");
> > if (!fp)
> > goto out;
> > fd = -1;
> > ret = generate(fp);
> > if (ret)
> > goto out;
> > +
> > + do_update = force || files_differ(fp, path);
> > [...]
> >
> > -static int update_info_refs(void)
> > +static int update_info_refs(int force)
>
> So, I was going to say "shouldn't we update the docs?" which for --force
> say "Update the info files from scratch.".
>
> But reading through it that "from scratch" wording is from c743e6e3c0
> ("Add a link from update-server-info documentation to repository
> layout.", 2005-09-04).
Yes, that wording is awkward and I can update it. But maybe making
it undocumented is sufficient and would save us the trouble
of describing it :)
"--force" might be seen as a performance optimization for cases
where you're certain the result will differ, but I'm not
sure if that's worth mentioning in the manpage.
> There it was a refrence to a bug since fixed in 60d0526aaa ("Unoptimize
> info/refs creation.", 2005-09-14), and then removed from the docs in
> c5fe5b6de9 ("Remove obsolete bug warning in man git-update-server-info",
> 2009-04-25).
>
> Then in b3223761c8 ("update_info_refs(): drop unused force parameter",
> 2019-04-05) Jeff removed the unused-for-a-decade "force" param.
>
> So having gone through the history I think we're better off just
> dropping the --force argument entirely, or at least changing the
> docs.
I can update the docs, or make it undocumented. Compatibility
from the command-line needs to remain in case there are scripts
using it.
next prev parent reply other threads:[~2019-05-12 0:39 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-11 1:34 [PATCH] update-server-info: avoid needless overwrites Eric Wong
2019-05-11 7:35 ` Eric Sunshine
2019-05-11 20:47 ` [PATCH v2] " Eric Wong
2019-05-11 21:17 ` [PATCH] " Eric Wong
2019-05-11 23:37 ` Ævar Arnfjörð Bjarmason
2019-05-12 0:38 ` Eric Wong [this message]
2019-05-12 4:08 ` Jeff King
2019-05-12 7:16 ` Ævar Arnfjörð Bjarmason
2019-05-14 9:47 ` Jeff King
2019-05-14 10:33 ` Ævar Arnfjörð Bjarmason
2019-05-14 11:24 ` Jeff King
2019-05-14 11:57 ` Ævar Arnfjörð Bjarmason
2019-05-14 11:50 ` Eric Wong
2019-05-14 12:13 ` dumb HTTP things I want to do Eric Wong
2019-05-14 12:27 ` Jeff King
2019-05-14 12:19 ` [PATCH] update-server-info: avoid needless overwrites Ævar Arnfjörð Bjarmason
2019-05-14 12:29 ` Jeff King
2019-05-15 0:45 ` [PATCH 2/1] server-info: conditionally update on fetch Eric Wong
2019-05-15 20:38 ` [WIP] repack leaving stale entries in objects/info/packs Eric Wong
2019-05-15 21:48 ` Jeff King
2019-05-23 8:59 ` [PATCH] server-info: do not list unlinked packs Eric Wong
2019-05-23 10:24 ` Jeff King
2019-05-23 17:27 ` [PATCH v2] " Eric Wong
2019-05-24 6:05 ` Jeff King
2019-05-24 7:34 ` Ævar Arnfjörð Bjarmason
2019-05-13 23:17 ` [PATCH v3] update-server-info: avoid needless overwrites Eric Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190512003859.GA22361@dcvr \
--to=e@80x24.org \
--cc=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).