* "Garbage collect" old commits in git repository to free disk space
@ 2020-02-18 0:29 Tomas Mudrunka
2020-02-18 5:51 ` Jeff King
0 siblings, 1 reply; 2+ messages in thread
From: Tomas Mudrunka @ 2020-02-18 0:29 UTC (permalink / raw)
To: git
Hello,
is there safe way to garbage collect old commits from git repository?
Lets say that i want to always keep only last 100 commits and throw
everything older away. To achieve similar goal as git clone --depth=100,
but on the server side. I had partial success with doing shallow clone
and then converting to bare repo while removing the shallow flag from
.git/config. But i didn't liked that solution and wasn't really sure
what consequences in terms of data integrity and forward compatibility
with newer git versions might be.
To tell you more about my USE CASE:
I want to create free opensource sofware similar to dropbox, but based
on git. My idea is following:
1.) Automaticaly pull/commit/push changed files to/from several laptops
to single git server (and forcefully resolve all conflicts, this will
work unless you plan to use it for software development)
2.) On central server maintain tags indicating latest commits
synchronized to individual laptops.
3.) On server delete old commits that are no longer needed by any laptop
to sync their worktree. Once synced, delete these commits on laptops as
well. (optionaly leaving eg. 1 month or 1GB of old commits in case you
might need to rollback. Possibly keep the history only on the server,
while deleting it from clients)
This way computers can stay in sync forever without running out of disk
space, because old commits are removed.
Eg. If i accidentaly add some very big file to synced folder and then
delete it, it will eventualy get deleted, once everybody gets in sync
again.
I am aware that this is not something which git was designed for, but to
me it seems like it should be more than doable. Do you think, any of you
can give me some hints on how to approach this problem please?
These are some projects which inspired me to explore this route:
https://github.com/presslabs/gitfs
https://www.syncany.org/
https://www.cis.upenn.edu/~bcpierce/unison/
https://etckeeper.branchable.com/
--
S pozdravem
Best regards
Tomáš Mudruňka - SPOJE.NET s.r.o.
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: "Garbage collect" old commits in git repository to free disk space
2020-02-18 0:29 "Garbage collect" old commits in git repository to free disk space Tomas Mudrunka
@ 2020-02-18 5:51 ` Jeff King
0 siblings, 0 replies; 2+ messages in thread
From: Jeff King @ 2020-02-18 5:51 UTC (permalink / raw)
To: Tomas Mudrunka; +Cc: git
On Tue, Feb 18, 2020 at 01:29:24AM +0100, Tomas Mudrunka wrote:
> is there safe way to garbage collect old commits from git repository? Lets
> say that i want to always keep only last 100 commits and throw everything
> older away. To achieve similar goal as git clone --depth=100, but on the
> server side. I had partial success with doing shallow clone and then
> converting to bare repo while removing the shallow flag from .git/config.
> But i didn't liked that solution and wasn't really sure what consequences in
> terms of data integrity and forward compatibility with newer git versions
> might be.
I can't say for sure, but what you did with the shallow file is likely
to bite you later. Shallow repositories are supposed to know where their
boundary cutoffs are, and that information is stored in that file.
The normal answer here is that you'd want to rewrite the history using
grafts and git-filter-branch, or the new git-filter-repo. But...
> To tell you more about my USE CASE:
>
> I want to create free opensource sofware similar to dropbox, but based on
> git. My idea is following:
...I think the rewrite would defeat the purpose, since you're relying on
the stability of the hashes to let all sides of the conversation figure
out when they're in sync.
> I am aware that this is not something which git was designed for, but to me
> it seems like it should be more than doable. Do you think, any of you can
> give me some hints on how to approach this problem please?
I don't know that there's an easy way. Git is close to what you want,
but really is designed to assume the other side has all the reachable
objects. Shallow clones are the feature that's closest to what you
want, but:
- I haven't had good experiences with repeatedly fetching into a
shallow clone. I believe the shallow list can grow because the
client doesn't realize which commits are reachable from others (and
hence are redundant). And I have seen shallow cuts crossing merge
boundaries cause a lot of extra objects to be transferred.
- It's really designed for _some_ repository to have all of the
objects. Fetching out of a shallow repository does work, I think,
but I would guess isn't very well exercised. So I have no idea what
kind of dragons you'd encounter.
Another option would be to periodically rewrite the history to a point
that you think all clients have synced, and then somehow communicate the
rewrite to them (outside of Git, but it sounds like you'd have software
wrapping Git). And then they could all do the identical rewrite and keep
going.
Also look at git-annex if you haven't, which I think supports this kind
of history truncation. I don't remember how it works exactly, but I
think that it may not put the blobs into Git at all, but rather just
pointers. So you're free to remove the actual data, but the pointer
remains (and will later say "sorry, I can't get that data for you").
-Peff
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2020-02-18 5:51 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-18 0:29 "Garbage collect" old commits in git repository to free disk space Tomas Mudrunka
2020-02-18 5:51 ` Jeff King
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).