git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Torsten Bögershausen" <tboegi@web.de>
To: Ramsay Jones <ramsay@ramsayjones.plus.com>
Cc: Junio C Hamano <gitster@pobox.com>,
	Martin Koegler <martin.koegler@chello.at>,
	git@vger.kernel.org, Johannes.Schindelin@gmx.de
Subject: Re: [PATCH 1/9] Convert pack-objects to size_t
Date: Mon, 14 Aug 2017 22:32:35 +0200	[thread overview]
Message-ID: <20170814203235.GA4293@tor.lan> (raw)
In-Reply-To: <c49628e0-6a57-34d6-6727-f8111b80cbab@ramsayjones.plus.com>

On Mon, Aug 14, 2017 at 08:31:50PM +0100, Ramsay Jones wrote:
> 
> 
> On 14/08/17 18:08, Junio C Hamano wrote:
> > Junio C Hamano <gitster@pobox.com> writes:
> > 
> >> One interesting question is which of these two types we should use
> >> for the size of objects Git uses.  
> >>
> >> Most of the "interesting" operations done by Git require that the
> >> thing is in core as a whole before we can do anything (e.g. compare
> >> two such things to produce delta, have one in core and apply patch),
> >> so it is tempting that we deal with size_t, but at the lowest level
> >> to serve as a SCM, i.e. recording the state of a file at each
> >> version, we actually should be able to exceed the in-core
> >> limit---both "git add" of a huge file whose contents would not fit
> >> in-core and "git checkout" of a huge blob whose inflated contents
> >> would not fit in-core should (in theory, modulo bugs) be able to
> >> exercise the streaming interface to handle such case without holding
> >> everything in-core at once.  So from that point of view, even size_t
> >> may not be the "correct" type to use.
> > 
> > A few additions to the above observations.
> > 
> >  - We have varint that encodes how far the location from a delta
> >    representation of an object to its base object in the packfile.
> >    Both encoding and decoding sides in the current code use off_t to
> >    represent this offset, so we can already reference an object that
> >    is far in the same packfile as a base.
> > 
> >  - I think it is OK in practice to limit the size of individual
> >    objects to size_t (i.e. on 32-bit arch, you cannot interact with
> >    a repository with an object whose size exceeds 4GB).  Using off_t
> >    would allow occasional ultra-huge objects that can only be added
> >    and checked in via the streaming API on such a platform, but I
> >    suspect that it may become too much of a hassle to maintain.
> 
> In a previous comment, I said that (on 32-bit Linux) it was likely
> that an object of > 4GB could not be handled correctly anyway. (more
> likely > 2GB). This was based on the code from (quite some) years ago.
> In particular, before you added the "streaming API". So, maybe a 32-bit
> arch _should_ be able to handle objects as large as the LFS API allows.
> (Ignoring, for the moment, that I think anybody who puts files of that
> size into an SCM probably gets what they deserve. :-P ).

In general, yes.
I had a patch that allowed a 32 bit linux to commit a file >4GB.
There is however a restriction: The file must not yet be known to the
index. Otherwise the "diff" machinery kicks in, and tries to
compare the "old" and the "new" versions, which means that -both-
must fit into "memory" at the same time. Memory means here the available
address space rather then real memory.
So there may be room for improvements for 32 bit systems, but that is another
story, which can be developped independent of the ulong->size_t conversion.

> 
> The two patches I commented on, however, changed the type of some
> variables from off_t to size_t. In general, the patches did not
> seem to make anything worse, but these type changes could potentially
> do harm. Hence my comment. (I still haven't tried the patches on my
> 32-bit Linux system. I only boot it up about once a week, and I would
> rather wait until the patches are in the 'pu' branch before testing).
> 
> ATB,
> Ramsay Jones
> 

And here (in agreement with the answer from Junio):
We should not change any off_t into size_t, since that means a possible
downgrade.
Whenever an off_t is converted into size_t, a function in git-compat-util.h is
used:

static inline size_t xsize_t(off_t len)
{
	if (len > (size_t) len)
		die("Cannot handle files this big");
	return (size_t)len;
}

Having written this, it may be a good idea to define a similar function
xulong() which is used when we do calls into zlib.

Thanks to Martin for working on this.
If there is a branch on a public repo, I can e.g. run the test suite on
different systems.

  parent reply	other threads:[~2017-08-14 20:32 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-12  8:47 [PATCH 1/9] Convert pack-objects to size_t Martin Koegler
2017-08-12  8:47 ` [PATCH 2/9] Convert index-pack " Martin Koegler
2017-08-12 13:51   ` Ramsay Jones
2017-08-12  8:47 ` [PATCH 3/9] Convert unpack-objects " Martin Koegler
2017-08-12 14:07   ` Martin Ågren
2017-08-13 18:25     ` Martin Koegler
2017-08-12  8:47 ` [PATCH 4/9] Convert archive functions " Martin Koegler
2017-08-12  8:47 ` [PATCH 5/9] Convert various things " Martin Koegler
2017-08-12 13:27   ` Martin Ågren
2017-08-13 17:48     ` Martin Koegler
2017-08-12  8:47 ` [PATCH 6/9] Use size_t for config parsing Martin Koegler
2017-08-12  8:47 ` [PATCH 7/9] Convert ref-filter to size_t Martin Koegler
2017-08-12  8:47 ` [PATCH 8/9] Convert tree-walk " Martin Koegler
2017-08-12  8:47 ` [PATCH 9/9] Convert xdiff-interface " Martin Koegler
2017-08-12  9:59 ` [PATCH 1/9] Convert pack-objects " Torsten Bögershausen
2017-08-13 18:27   ` Martin Koegler
2017-08-12 13:47 ` Ramsay Jones
2017-08-13 18:30   ` Martin Koegler
2017-08-13 19:45     ` Ramsay Jones
2017-08-13 22:15       ` Junio C Hamano
2017-08-14 17:08         ` Junio C Hamano
2017-08-14 19:31           ` Ramsay Jones
2017-08-14 19:58             ` Junio C Hamano
2017-08-14 20:32             ` Torsten Bögershausen [this message]
2017-08-15  0:30               ` Ramsay Jones
2017-08-16 20:22           ` Martin Koegler
2017-08-17 10:38             ` Torsten Bögershausen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170814203235.GA4293@tor.lan \
    --to=tboegi@web.de \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=martin.koegler@chello.at \
    --cc=ramsay@ramsayjones.plus.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).