git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <junkio@cox.net>
To: Geert Bosch <bosch@gnat.com>
Cc: git@vger.kernel.org
Subject: Re: PATCH: New diff-delta.c implementation (updated)
Date: Thu, 27 Apr 2006 21:28:30 -0700	[thread overview]
Message-ID: <7vfyjyfhn5.fsf@assigned-by-dhcp.cox.net> (raw)
In-Reply-To: <7v1wvigzka.fsf@assigned-by-dhcp.cox.net> (Junio C. Hamano's message of "Thu, 27 Apr 2006 20:16:05 -0700")

Junio C Hamano <junkio@cox.net> writes:

> In the kernel repository (checked out is near the tip of the
> source tree), the largest files are fs/nls/nls_cp949.c (900kB
> korean character encoding), drivers/usb/misc/emi62_fw_s.h
> (800kB, Emagic firmware blob), arch/m68k/ifpsp060/src/fpsp.S
> (750kB, floating point emulation?), and nowhere near your
> algorithm really should shine.
>
> We would probably want some internal logic that says "if we see
> that blobs larger than X MB is involved in the packing, we
> should use this version of diff-delta, otherwise the other one."

Third impression, synthetic workload.  A sequence of single file
project, the file is tarball of git.git tree (that is,
"git-tar-tree vX.Y.Z >tarball"), 120 objects or so (1 commit per
rev, 1 tree to hold 1 blob).  The (uncompressed) size of the 40
blobs in the pack are between 2.06MB - 2.86MB (average 2.30MB).

(Nico)
Total 123, written 123 (delta 38), reused 0 (delta 0)
67.26user 1.03system 1:08.76elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+136066minor)pagefaults 0swaps

1822079 pack-nico-26989d516c62197592d0d52db24dfc6a58b633eb.pack


(Geert)
Total 123, written 123 (delta 38), reused 0 (delta 0)
67.23user 1.35system 1:09.25elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+164124minor)pagefaults 0swaps

1683139 pack-geert-26989d516c62197592d0d52db24dfc6a58b633eb.pack

That's an 8% improvement in the same time, which is quite
impressive.  But I am _very_ unhappy about this particular
synthetic workload.  I wonder if there are projects with many
large blobs that is updated often, so that we can use it as a
yardstick.  Maybe Wine people have icons, background images and
sounds perhaps?  But I suspect you would not update them that
often.

Thinking about it, it does not make much sense, at least to me,
to store large tarballs or binary blobs or whatnot in a SCM (we
are _not_ in the archival business) and keeping track of their
changes.  The tarball is out of question -- it is not a source
(in GPL sense of the word -- it is not a preferred way to make
modification; you modify constituent files and bundle up the
result as a new tarball).  Graphics images, perhaps.

      reply	other threads:[~2006-04-28  4:28 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-04-28  1:59 PATCH: New diff-delta.c implementation (updated) Geert Bosch
2006-04-28  2:07 ` Geert Bosch
2006-04-28  3:16 ` Junio C Hamano
2006-04-28  4:28   ` Junio C Hamano [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7vfyjyfhn5.fsf@assigned-by-dhcp.cox.net \
    --to=junkio@cox.net \
    --cc=bosch@gnat.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).