git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: John Cai via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, John Cai <johncai86@gmail.com>
Subject: Re: [PATCH] promisor-remote.c: use oidset for deduplication
Date: Fri, 14 Jan 2022 13:11:57 +0100	[thread overview]
Message-ID: <220114.86wnj2se41.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <pull.1187.git.git.1642105926064.gitgitgadget@gmail.com>


On Thu, Jan 13 2022, John Cai via GitGitGadget wrote:

> From: John Cai <johncai86@gmail.com>
>
> swap out oid_array for oidset in promisor-remote.c since we get
> a deduplicated set of oids to operate on.
>
> swap our calls for oid_array for oidset in callers of
> promisor_remote_get_direct().
>
> Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>

FWIW we had an off-list discussion about oidset v.s. a custom hashmap to
do the same off-list, not about s/oid_array/oidset/ in this area.

> Signed-off-by: John Cai <johncai86@gmail.com>
> ---
>     promisor-remote.c: use oidset for deduplication
>     
>     

Extra \n between $subject and $body.

>     swap out oid_array for oidset in promisor-remote.c since we get a

Nit: s/swap/Swap/, i.e. start the sentence with a capital letter...

>     deduplicated set of oids to operate on. Any direct fetch we can save
>     will be well worth it.
>     
>     oidset does use a slightly larger memory footprint than oid_array, so
>     this change is a tradeoff between memory footprint and network calls.
>     
>     What I'm not sure about is if it's worth swapping out all the calls of
>     oid_array for oidset in callers of promisor_remote_get_direct().

From what I understand of GGG I think you updated only the summary on
the PR but not the commit itself, the latter is what would go into
git.git. Here the commit should be updated so we get this message.

The part after the "---" is usually just used in this project for ad-hoc
list-only comment.

>  builtin/index-pack.c   |  9 +++---
>  builtin/pack-objects.c |  9 +++---
>  diff.c                 | 13 +++-----
>  diffcore-rename.c      | 12 +++----
>  diffcore.h             |  3 +-
>  merge-ort.c            |  8 ++---
>  object-file.c          |  4 ++-
>  promisor-remote.c      | 72 ++++++++++++++----------------------------
>  promisor-remote.h      |  6 ++--
>  read-cache.c           |  9 +++---
>  10 files changed, 57 insertions(+), 88 deletions(-)

Rather than inline comments, a comment that applies to all of these:

The difference between these two APIs is thaht oidset is hash-backed,
and you'd insert into it and we de-duplicate any duplicate OIDs on-the-fly.

The oid_array is just an realloc()'d "struct object_id *oid". On
insertion you can insert duplicates, but it has the ability to track
"I've sorted these", and "let's iterate over this sorted, and de-dup any
duplicates".

We have the two APIs for a reason, but I don't know in any of these
cases whether this change is safe.

Does e.g. index-pack.c always receive de-duplicated OIDs and we were
wasting CPU cycles using an oidset?

Do some of these like pack-objects.c receive de-duplicated OIDs from
e.g. "git repack" *now*, but we just lack test coverage to see that
they're happy to get duplicate OIDs on stdin (e.g. manually from a
user), and this would introduce a bug?

But most importantly is it worth it? What's the rationale for the
change? Less CPU/memory use? Getting e.g. "hyperfine" or "/usr/bin/time
-v" output for those (if so) would be valuable.

  parent reply	other threads:[~2022-01-14 12:20 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-13 20:32 [PATCH] promisor-remote.c: use oidset for deduplication John Cai via GitGitGadget
2022-01-13 23:45 ` Junio C Hamano
2022-01-14 12:11 ` Ævar Arnfjörð Bjarmason [this message]
2022-01-14 19:14   ` Junio C Hamano
2022-01-14 23:12     ` Junio C Hamano
2022-01-24 22:55   ` John Cai
2022-01-25 19:17     ` Jonathan Tan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=220114.86wnj2se41.gmgdl@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=johncai86@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).