git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 3/4] cat-file: add --batch-disk-sizes option
Date: Sun, 7 Jul 2013 14:19:25 -0400	[thread overview]
Message-ID: <20130707181925.GA20085@sigill.intra.peff.net> (raw)
In-Reply-To: <7vtxk645vp.fsf@alter.siamese.dyndns.org>

On Sun, Jul 07, 2013 at 10:49:46AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Perhaps we need
> >
> >   git cat-file --batch-format="%(disk-size) %(object)"
> >
> > or similar.
> 
> I agree with your reasoning.  It may be simpler to give an interface
> to ask for which pieces of info, e.g. --batch-cols=size,disksize,
> without giving the readers a flexible "format".

Yeah, that is probably a lot more sane. That would be sufficient for my
use, I doubt anyone really wants the full format, and it would be easy
to add it later if we are wrong. It would also be easy to add other
items from the sha1_object_info_extended list, too (e.g.,
loose/cached/packed).

I'll do that in my re-roll.

> > +NOTE: The on-disk size reported is accurate, but care should be taken in
> > +drawing conclusions about which refs or objects are responsible for disk
> > +usage. [...]
> 
> This is a good note to leave to the readers. I was wondering how
> valid to accuse that B is taking a lot of space compared to C when
> you have three objects A, B and C (in decreasing order of on-disk
> footprint) when A is huge and C is a small delta against A and B is
> independent.  The role of A and C in their delta chain could easily
> be swapped during the next full repack and then C will appear a lot
> larger than B.

Yeah. I exercise a lot of human analysis when I use this tool myself.
What I am usually looking for is that somebody has forked a 100M repo,
and then dumped 2G of extra data on top. Those cases are not all that
hard to spot, and would not usually change too much in a repack.

> It might be interesting to measure the total disk footprint of an
> entire delta "family" (the objects that delta against the same
> base).  You may find out that hello.c with a manageable size have
> very many revisions and overall have a larger on-disk footprint than
> a single copy of unchanging help.mov clip used in the documentation
> does, which may be an interesting observation to make.

Yeah, that is an interesting stat, though I have not had a need for it
myself. Certainly you could do:

  git rev-list --objects --all |
  grep ' hello.c$' |
  cut -d' ' -f1 |
  git cat-file --batch-disk-sizes

to see hello.c's size. But I cannot think offhand of a way to get the
list of objects that are in a delta chain together (potentially crossing
path boundaries), short of parsing verfiy-pack output myself. I think it
is orthogonal to this patch, though. This exposes more information about
objects themselves; it would be up to another patch to help discover and
narrow the list of interesting objects.

-Peff

  reply	other threads:[~2013-07-07 18:19 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-07 10:01 [RFC/PATCH 0/4] cat-file --batch-disk-sizes Jeff King
2013-07-07 10:03 ` [PATCH 1/4] zero-initialize object_info structs Jeff King
2013-07-07 17:34   ` Junio C Hamano
2013-07-07 10:04 ` [PATCH 2/4] teach sha1_object_info_extended a "disk_size" query Jeff King
2013-07-07 10:09 ` [PATCH 3/4] cat-file: add --batch-disk-sizes option Jeff King
2013-07-07 17:49   ` Junio C Hamano
2013-07-07 18:19     ` Jeff King [this message]
2013-07-08 11:04     ` Duy Nguyen
2013-07-08 12:00       ` Ramkumar Ramachandra
2013-07-08 13:13         ` Duy Nguyen
2013-07-08 13:37           ` Ramkumar Ramachandra
2013-07-09  2:55             ` Duy Nguyen
2013-07-09 10:32               ` Ramkumar Ramachandra
2013-07-10 11:16             ` Jeff King
2013-07-08 16:40           ` Junio C Hamano
2013-07-10 11:04     ` Jeff King
2013-07-11 16:35       ` Junio C Hamano
2013-07-07 21:15   ` brian m. carlson
2013-07-10 10:57     ` Jeff King
2013-07-07 10:14 ` [PATCH 4/4] pack-revindex: radix-sort the revindex Jeff King
2013-07-07 23:52   ` Shawn Pearce
2013-07-08  7:57     ` Jeff King
2013-07-08 15:38       ` Shawn Pearce
2013-07-08 20:50   ` Brandon Casey
2013-07-08 21:35     ` Brandon Casey
2013-07-10 10:57       ` Jeff King
2013-07-10 10:52     ` Jeff King
2013-07-10 11:34 ` [PATCHv2 00/10] cat-file formats/on-disk sizes Jeff King
2013-07-10 11:35   ` [PATCH 01/10] zero-initialize object_info structs Jeff King
2013-07-10 11:35   ` [PATCH 02/10] teach sha1_object_info_extended a "disk_size" query Jeff King
2013-07-10 11:36   ` [PATCH 03/10] t1006: modernize output comparisons Jeff King
2013-07-10 11:38   ` [PATCH 04/10] cat-file: teach --batch to stream blob objects Jeff King
2013-07-10 11:38   ` [PATCH 05/10] cat-file: refactor --batch option parsing Jeff King
2013-07-10 11:45   ` [PATCH 06/10] cat-file: add --batch-check=<format> Jeff King
2013-07-10 11:57     ` Eric Sunshine
2013-07-10 14:51     ` Ramkumar Ramachandra
2013-07-11 11:24       ` Jeff King
2013-07-10 11:46   ` [PATCH 07/10] cat-file: add %(objectsize:disk) format atom Jeff King
2013-07-10 11:48   ` [PATCH 08/10] cat-file: split --batch input lines on whitespace Jeff King
2013-07-10 15:29     ` Ramkumar Ramachandra
2013-07-11 11:36       ` Jeff King
2013-07-11 17:42         ` Junio C Hamano
2013-07-11 20:45         ` [PATCHv3 " Jeff King
2013-07-10 11:50   ` [PATCH 09/10] pack-revindex: use unsigned to store number of objects Jeff King
2013-07-10 11:55   ` [PATCH 10/10] pack-revindex: radix-sort the revindex Jeff King
2013-07-10 12:00     ` Jeff King
2013-07-10 13:17     ` Ramkumar Ramachandra
2013-07-11 11:03       ` Jeff King
2013-07-10 17:10     ` Brandon Casey
2013-07-11 11:17       ` Jeff King
2013-07-11 12:16     ` [PATCHv3 " Jeff King
2013-07-11 21:12       ` Brandon Casey

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130707181925.GA20085@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --subject='Re: [PATCH 3/4] cat-file: add --batch-disk-sizes option' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).