git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Dana How" <danahow@gmail.com>
To: "Linus Torvalds" <torvalds@linux-foundation.org>
Cc: "David Lang" <david.lang@digitalinsight.com>,
	"Sam Vilain" <sam@vilain.net>,
	"Git Mailing List" <git@vger.kernel.org>,
	"Junio C Hamano" <junkio@cox.net>,
	danahow@gmail.com
Subject: Re: [PATCH 5/6] Teach "fsck" not to follow subproject links
Date: Thu, 12 Apr 2007 11:32:57 -0700	[thread overview]
Message-ID: <56b7f5510704121132g3961060amb394978bb49093e6@mail.gmail.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0704111903060.4061@woody.linux-foundation.org>

On 4/11/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, 11 Apr 2007, David Lang wrote:
> > this is why I was suggesting a --multiple-project option to let you tell fsck
> > about all of the repositories that it needs to look for refs in.
>
> Well, just from a personal observation:
>  - I would *personally* actually refuse to share objects with anybody
>    else.
>
> I just find the idea too scary. Somebody doing something bad to their
> object store by mistake (running "git prune" without realizing that there
> are *my* objects there too, or just deciding that they want to play with
> the object directory by hand, or running a new fancy experimental importer
> that has a subtle bug wrt object handling or anything like that).
>
> I'll endorse use "alternates" files, but partly because I know the main
> project is safe (any alternates usage is in the "satellite" clones anyway,
> and they will never write to the alternate object directory), and partly
> because at least for the kernel, we don't have branches that get reset in
> the main project, so there's no reason to fear that a "git repack -a -d"
> will ever screw up any of the satellite repositories even by mistake.
>
> But for git projects, even alternates isn't safe, in case somebody bases
> their own work on a version of "pu" that eventually goes away (even with
> reflogs, pruning *eventually* takes place).
>
> So I tend to think that alternates and shared object directories are
> really for "temporary" stuff, or for *managed* repositories that are at
> git *hosting* sites (eg repo.or.cz), and where there is some other safety
> involved, ie users don't actually access the object directories directly
> in any way.
>
> So I've at least personally come to the conclusion that for a *developer*
> (as opposed to a hosting site!), shared object directories just never make
> sense. The downsides are just too big. Even alternates is something where
> you just need to be fairly careful!

These arguments all seem pretty convincing to me --
maybe the problem is that I'm not a "*developer*" right now.
Instead I'm part of a multi-developer *site*.
Below I talk about a possible way we could use git
without changing it (since I recognize this would be a minority usage pattern).

We use perforce to manage a mixed hardware/software project
(I'm the 55GB check-out guy, remember?).  We have at least 3 different
kinds of data with different usage patterns, and using perforce for
everything in one centralized server was not the best solution.

Each user ("client") has their own worktree and the perforce
repository is on a shared central server.  You can consider perforce
to have the equivalent of git's index, but it is stored on the server,
in one file ("db.have") covering all clients.  Obviously that becomes a
bottleneck -- and recently db.have got larger than the total cache RAM on
the server, which really slowed things down until we moved to a larger
server.  But repository architecture aside,  the real problem has been
perforce's usability.  Frequently one contributor,  having gotten ahead
of the team,  needs to share this more recent work with only a few
people.  This could be done with p4 branching,  but this is really clunky.
So instead the work is pushed out (submitted) to everyone, causing
instability; this is partially remedied by doing it in smaller chunks.
Another perforce problem is that tagging consumes a lot of server
space (and may slow things down as well).

Some of this data will stay in perforce, some will move into revision
control built-in to some of our other tools, and I'd like to try to move some
of it into git.  The main attraction for the last group is the lightweight
branching that would allow early/tentative work to be easily shared.
I think the subproject work currently being discussed is going to
be very helpful as well -- the perforce equivalent is chaotic.

We could give each user a work tree and an object repository,
and then have a "release" repository.  Unfortunately,  this would be
slower to use than the current perforce "solution": users would check
in to their local repository, at the speed of gzip, anyone checking
it out would do so at the speed of gzip, and all work would need
to be resubmitted (using perforce jargon here) to the central repo,
again at the speed of gzip.  Currently, people either submit or
check out from the central repo, and it's all done at the speed of
a network copy.  This speed issue is important because of
the size of a commit we'd like to share (but not yet release):
about 40 files, half of them control files of several KB each, 1/4 of them
design files of several MB each, and the last 1/4 detailed design
files 100X larger.  These 40 files will reference (include) 50 others
of several KB each sprinkled through-out the hierarchy, a few of which
might have changed.  And yes, almost all of these are generated files,
but the generation time, and the instability of the tool and script environment,
preclude forcing the other users to regenerate them, like you would
with a .o file.

So, there are 2 alternative set-ups. In one, everyone uses a shared
object repository (everyone's .git/objects is a symlink to it). In this
repository, objects/. , objects/?? , objects/pack , and objects/info all
have "sticky" set, and we do the appropriate machinations to make
all files read-only. There would be an additional phantom user "git"
who owns the shared object repository (the only user whose .git/objects
is not a symlink).  Users would commit to their own repositories,
which would write data to the shared object repository and
update their refs (e.g. HEAD). To "release", push to the ~git repository.
This push would be like a current push -- fast-forward only, figure out the list
of objects that need to be transmitted -- but instead of transmitting the
objects, change their ownership to ~git and then update ~git's refs.
Since users can share local commits, maybe the ~git ownership
change should happen at commit time.  This all seems do-able
without change in git; instead I'd add a few bash wrapper scripts
(and see below for fsck and pack/prune).

Another setup is like the previous, but make the central repo have
its own hidden object repository. You would push to it using the
standard git command.

Finally, users could run git-fsck [with misleading output];
they could run git-prune{,-packed}, but these commands wouldn't
be able to delete anything.  If we don't want users to pack,
then ~git/.git/objects/pack would be writable only by ~git.
So basically, normal people wouldn't do the things in this paragraph.

To do meaningful and safe fsck/prune on the shared repository
as ~git,  I'd add some scripting.  If you require all users'
GIT_DIR's to look like /home/USER/*/.git , then you can get all
their refs and do a meaningful fsck.  If not, you could do a fsck
--unreachable as ~git and filter the result by date and/or type.
(This sort of corresponds to abandoned changesets in perforce.)
Once you have an fsck method you like, its filtered output (i.e.,
--unreachable objects you want to keep) can be fed to git-prune.

Care would also be required with git-repack/git-prune-packed,
but it seems mostly addressable with scheduling.

If I proceed down this path,  I'd like to implement this procedure
without any change in git's .c or .sh files.  It's clear this is a
minority use and should not depend on anything being maintained
for it inside git.  I would write a few bash scripts and a README/HOWTO
for possible inclusion in contrib.

BTW,
has anyone ever thought of writing an "Administrator's Manual" for git?

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

  parent reply	other threads:[~2007-04-12 18:33 UTC|newest]

Thread overview: 101+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-04-10  4:12 [PATCH 0/6] Initial subproject support (RFC?) Linus Torvalds
     [not found] ` <Pi ne.LNX.4.64.0704092115020.6730@woody.linux-foundation.org>
2007-04-10  4:13 ` [PATCH 1/6] diff-lib: use ce_mode_from_stat() rather than messing with modes manually Linus Torvalds
2007-04-10  4:13 ` [PATCH 2/6] Avoid overflowing name buffer in deep directory structures Linus Torvalds
2007-04-10  4:14 ` [PATCH 3/6] Add 'resolve_gitlink_ref()' helper function Linus Torvalds
2007-04-10  9:38   ` Alex Riesen
2007-04-10 14:58     ` Linus Torvalds
2007-04-10 15:35       ` Alex Riesen
2007-04-10 15:52         ` Linus Torvalds
2007-04-10 15:57           ` Alex Riesen
2007-04-10 16:16             ` Linus Torvalds
2007-04-10 15:54       ` Josef Weidendorfer
2007-04-10  4:14 ` [PATCH 4/6] Add "S_IFDIRLNK" file mode infrastructure for git links Linus Torvalds
2007-04-10  4:15 ` [PATCH 5/6] Teach "fsck" not to follow subproject links Linus Torvalds
2007-04-11 22:41   ` Sam Vilain
2007-04-11 22:48     ` Linus Torvalds
2007-04-11 22:59       ` Sam Vilain
2007-04-11 23:16         ` Linus Torvalds
2007-04-11 23:05           ` David Lang
2007-04-11 23:53             ` Linus Torvalds
2007-04-11 23:30               ` David Lang
2007-04-12  2:14                 ` Linus Torvalds
2007-04-12  2:30                   ` Junio C Hamano
2007-04-12 17:18                   ` David Lang
2007-04-12 18:32                   ` Dana How [this message]
2007-04-12 19:17                     ` Linus Torvalds
2007-04-13  9:00                       ` Rogan Dawes
2007-04-13 15:23                         ` Linus Torvalds
2007-04-15  6:50                       ` Dana How
2007-04-12  0:00               ` Dana How
2007-04-12  0:03               ` Sam Vilain
2007-04-12  0:34           ` Junio C Hamano
2007-04-12  1:52             ` Linus Torvalds
2007-04-12  2:00               ` Junio C Hamano
2007-04-12  2:06                 ` Junio C Hamano
2007-04-12  2:28                   ` Linus Torvalds
2007-04-11 23:30         ` Dana How
2007-04-10  4:20 ` [PATCH 6/6] Teach core object handling functions about gitlinks Linus Torvalds
2007-04-10  8:40   ` Frank Lichtenheld
2007-04-10 11:31     ` Alex Riesen
2007-04-10 14:55     ` Linus Torvalds
2007-04-10 16:28   ` Josef Weidendorfer
2007-04-10 16:50     ` Alex Riesen
2007-04-10 17:23       ` Josef Weidendorfer
2007-04-10 18:45     ` Linus Torvalds
2007-04-10 19:04       ` Andy Parkins
2007-04-10 19:20         ` Linus Torvalds
2007-04-10 20:19           ` Junio C Hamano
2007-04-10 20:33             ` Linus Torvalds
2007-04-12  0:12               ` Sam Vilain
2007-04-12  0:35                 ` Martin Waitz
2007-04-12  2:01                 ` Linus Torvalds
2007-04-12  3:56                   ` Sam Vilain
2007-04-10 19:41         ` David Lang
2007-04-10 20:06         ` Junio C Hamano
2007-04-10 19:29       ` Josef Weidendorfer
2007-04-10 19:45         ` Linus Torvalds
2007-04-11 23:47           ` Sam Vilain
2007-04-12  0:13             ` Linus Torvalds
2007-04-12  0:42       ` Torgil Svensson
2007-04-12  0:56         ` Martin Waitz
2007-04-12 21:23           ` Torgil Svensson
2007-04-11 23:36     ` Sam Vilain
2007-04-11  8:06   ` Martin Waitz
2007-04-11  8:29     ` Alex Riesen
2007-04-11  8:36       ` Martin Waitz
2007-04-11  8:49         ` Alex Riesen
2007-04-11  9:20           ` Martin Waitz
2007-04-11  9:15         ` Junio C Hamano
2007-04-11 10:03           ` Martin Waitz
2007-04-11 20:01             ` Junio C Hamano
2007-04-11 22:19               ` Martin Waitz
2007-04-11 22:36                 ` Linus Torvalds
2007-04-11  9:47     ` Andy Parkins
2007-04-11 11:31       ` Martin Waitz
2007-04-11 15:16     ` Linus Torvalds
2007-04-11 22:49       ` Sam Vilain
2007-04-11 23:54       ` Martin Waitz
2007-04-12  1:57         ` Brian Gernhardt
2007-04-12 15:12         ` Josef Weidendorfer
2007-04-10  4:46 ` [PATCH 0/6] Initial subproject support (RFC?) Linus Torvalds
2007-04-10 13:04   ` Alex Riesen
2007-04-10 15:13     ` Linus Torvalds
2007-04-10 15:48       ` Alex Riesen
2007-04-10 16:07         ` Linus Torvalds
2007-04-10 16:43           ` Alex Riesen
2007-04-10 19:32           ` Junio C Hamano
2007-04-10 20:11             ` Linus Torvalds
2007-04-10 20:52               ` Junio C Hamano
2007-04-10 21:02                 ` Sam Ravnborg
2007-04-10 21:27                   ` Junio C Hamano
2007-04-10 21:03                 ` Nicolas Pitre
2007-04-15 23:21                   ` J. Bruce Fields
2007-04-11  8:08                 ` David Kågedal
2007-04-11  9:32                   ` Junio C Hamano
2007-04-15 23:25                     ` J. Bruce Fields
2007-04-11  8:32     ` Martin Waitz
2007-04-11  8:42       ` Alex Riesen
2007-04-11  8:57         ` Martin Waitz
2007-04-10 13:39   ` [PATCH] allow git-update-index work on subprojects Alex Riesen
2007-04-10 23:19     ` [PATCH] Allow " Alex Riesen
2007-04-11  2:55       ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56b7f5510704121132g3961060amb394978bb49093e6@mail.gmail.com \
    --to=danahow@gmail.com \
    --cc=david.lang@digitalinsight.com \
    --cc=git@vger.kernel.org \
    --cc=junkio@cox.net \
    --cc=sam@vilain.net \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).