Re: [GSoC] Git Blog 4

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: ZheNing Hu <adlternative@gmail.com>
To: Christian Couder <christian.couder@gmail.com>
Cc: Git List <git@vger.kernel.org>,
	Junio C Hamano <gitster@pobox.com>,
	Hariom verma <hariom18599@gmail.com>, Jeff King <peff@peff.net>
Subject: Re: [GSoC] Git Blog 4
Date: Tue, 15 Jun 2021 16:59:31 +0800	[thread overview]
Message-ID: <CAOLTT8QS7bG5M2Ro+vApUDtOgjxgrpUg5Mgp+tOQgyqwpENN1Q@mail.gmail.com> (raw)
In-Reply-To: <CAP8UFD0jiZuPvO-oYXw9PmVQ56tpYc9nxUxAjPQrc2f1qwEqUQ@mail.gmail.com>

Christian Couder <christian.couder@gmail.com> 于2021年6月14日周一 下午4:02写道：
>
> On Sun, Jun 13, 2021 at 4:17 PM ZheNing Hu <adlternative@gmail.com> wrote:
>
> > In addition, some scripts like `printf "%b" "a\0b\0c" >blob1` will
> > be truncated at the first NUL on a 32-bit machine, but it performs
> > well on 64-bit machines, and NUL is normally stored in the file.
> > This made me think that Git's file decompression had an error on
> > a 32-bit machine before I used Ubuntu32's docker container to
> > clone the git repository and In-depth analysis of bugs... In the end,
> > I used `printf "a\0b\0c"` to make 32-bit machines not truncated
> > in NUL. Is there a better way to write binary data onto a file than
> > `printf` and `echo`?
>
> You might want to take a look at t/t4058-diff-duplicates.sh which has
> the following:
>
> # make_tree_entry <mode> <mode> <sha1>
> #
> # We have to rely on perl here because not all printfs understand
> # hex escapes (only octal), and xxd is not portable.
> make_tree_entry () {
>        printf '%s %s\0' "$1" "$2" &&
>        perl -e 'print chr(hex($_)) for ($ARGV[0] =~ /../g)' "$3"
> }
>

Yes, perl can indeed do this, and perhaps python can do it too.
However, python may need to consider portability issues.

> > Since I am a newbie to docker, I would like to know if there is any
> > way to run the Git's Github CI program remotely or locally?
>
> There are scripts in the ci/ directory, but yeah it could help if
> there was a README there.
>

Thanks, I probably know how to use it.
As you said in another article, GitHub-Travis CI, this is exactly what I need.

> > In the second half of this week, I tried to make `cat-file` reuse the
> > logic of `ref-filter`. I have to say that this is a very difficult process.
> > "rebase -i" again and again to repair the content of previous commits.
> > squeeze commits, split commits, modify commit messages... Finally, I
> > submitted the patches to the Git mailing list in
> > [[PATCH 0/8] [GSOC][RFC] cat-file: reuse `ref-filter`
> > logic](https://lore.kernel.org/git/pull.980.git.1623496458.gitgitgadget@gmail.com/).
> > Now `cat-file` has learned most of the atoms in `ref-filter`. I am very
> > happy to be able to make git support richer functions through my own code.
> >
> > Regrettably, `git cat-file --batch --batch-all-objects` seems to take up
> > a huge amount of memory on a large repo such as git.git, and it will
> > be killed by Linux's oom.
>
> In the cover letter of your patch series you say:
>
> "There is still an unresolved issue: performance overhead is very large, so
> that when we use:
>
> git cat-file --batch --batch-all-objects >/dev/null
>
> on git.git, it may fail."
>
> Is this the same issue? Is it only a memory issue, or is your patch
> series also making things slower?
>

Yes, they are talking about the same thing, the memory usage is too large.
Of course I should check for memory leaks first. However, this is mainly
caused by changes in the strategy of cat-file printing object data.

The original cat-file needs do fewer (one time) copies in read_object_file()
or stream_blob(), now cat-file needs do four time (or more) copy in
oid_object_info_extended(), grab_sub_body_contents(), append_atom(),
and pop_stack_element().

> > This is mainly because we will make a large
> > number of copies of the object's raw data. The original `git cat-file`
> > uses `read_object_file()` or `stream_blob()` to output the object's
> > raw data, but in `ref-filter`, we have to use `v->s` to copy the object's
> > data, it is difficult to eliminate `v->s` and print the output directly to the
> > final output buffer. Because we may have atoms like `%(if)`, `%(else)`
> > that need to use buffers on the stack to build the final output string
> > layer by layer,
>
> What does layer by layer mean here?
>

In the case of using multiple nested %(if) %(else), the data may be
copied to the
"previous level" buffer of the stack through pop_stack_element().

> > or the `cmp_ref_sorting()` needs to use `v->s` to
> > compare two refs. In short, it is very difficult for `ref-filter` to reduce
> > copy overhead. I even thought about using the string pool API
> > `memintern()` to replace `xmemdupz()`, but it seems that the effect
> > is not obvious. A large number of objects' data will still reside in memory,
> > so this may not be a good method.
>
> Would it be possible to keep the data for a limited number of objects,
> then print everything related to these objects, free their data and
> start again with another limited number of objects?
>

"limited number of objects", is this want to reduce the overhead of free()?
May be a good solution. But I think, can we just only release the memory
of an object after printing it instead of free() together like ref_array_clear()
does?

> > Anyway, stay confident. I can solve these difficult problems with
> > the help of mentors and reviewers. `:)`
>
> Sure :-)

Thanks!
--
ZheNing Hu

next prev parent reply	other threads:[~2021-06-15  8:59 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-13 14:17 [GSoC] Git Blog 4 ZheNing Hu
2021-06-13 23:28 ` Eric Sunshine
2021-06-14  3:41   ` ZheNing Hu
2021-06-14  8:02 ` Christian Couder
2021-06-14 12:02   ` Christian Couder
2021-06-15  8:59   ` ZheNing Hu [this message]
2021-06-15 12:30     ` ZheNing Hu
2021-06-14 13:20 ` Atharva Raykar
2021-06-15  9:06   ` ZheNing Hu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOLTT8QS7bG5M2Ro+vApUDtOgjxgrpUg5Mgp+tOQgyqwpENN1Q@mail.gmail.com \
    --to=adlternative@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=hariom18599@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).