git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [BUG] `git push` sends unnecessary objects
@ 2023-09-13 22:59 Javier Mora
  2023-09-17 13:21 ` Bagas Sanjaya
  0 siblings, 1 reply; 4+ messages in thread
From: Javier Mora @ 2023-09-13 22:59 UTC (permalink / raw
  To: git

I came across this issue accidentally when trying to move a directory
containing a very large file, and deleting another file in that
directory while I was at it.
It seems to be caused by `pack.useSparse=true` being the default since
v2.27 (which I found out after spending quite a while manually
bisecting and compiling git since I noticed that this didn't happen in
v2.25; commit de3a864 introduces this regression).

* Expected:
    Pushing a commit that moves a file without modifying it shouldn't
require sending a blob object for that file, since the remote server
already has that blob object.
* Observed:
    Pushing a commit that moves a directory containing a file and also
adds/deletes other files in that directory will for some reason also
send blobs for all the files in that directory, even the ones that
were already in the remote.
* Consequences:
    This has a very big impact in push times for very small commits
that just move around files, if those files are very big (I had this
happen with a >100MB file over a problematic connection... yikes!)
* Note:
    The commit introducing the regression does warn about possible
scenarios involving a special arrangement of exact copies across
directories, but these are not "copies", I just moved a file, which
seems like a rather common operation.

Code snippet for reproduction:
```
mkdir TEST_git
cd TEST_git

mkdir -p local remote/origin.git
cd remote/origin.git
git init --bare
cd ../../local
git init
git remote add origin file://"${PWD%/*}"/remote/origin.git

mkdir zig
for i in a b c d e; do
    dd if=/dev/urandom of=zig/"$i" bs=1M count=1
done
git add .
git commit -m 'Add big files'
git push -u origin master
#>> Writing objects: 100% (8/8), 5.00 MiB | 13.27 MiB/s, done.
#^ makes sense: 1 commit + 2 trees (/ and /zig) + 5 files = 8;
#  5 MiB in total for the 5x 1 MiB binary files

git mv zig zag
git commit -m 'Move zig'
git push
#>> Writing objects: 100% (2/2), 233 bytes | 233.00 KiB/s, done.
#^ makes sense: 1 commit + 1 tree (/ renames /zig to /zag) = 2;
#  a,b,c,d,e objects already in remote

git mv zag zog
touch zog/f
git add zog/f
git commit -m 'For great justice'
git push
#>> Writing objects: 100% (9/9), 5.00 MiB | 24.63 MiB/s, done.
#^ It re-uploaded the 5x 1 MiB blobs
#  even though remote already had them.
```

Note that the latter doesn't happen if I use `git -c pack.useSparse=false push`.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [BUG] `git push` sends unnecessary objects
  2023-09-13 22:59 [BUG] `git push` sends unnecessary objects Javier Mora
@ 2023-09-17 13:21 ` Bagas Sanjaya
  2023-11-25 14:54   ` Javier Mora
  0 siblings, 1 reply; 4+ messages in thread
From: Bagas Sanjaya @ 2023-09-17 13:21 UTC (permalink / raw
  To: Javier Mora, Git Mailing List; +Cc: Derrick Stolee, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 2863 bytes --]

On Wed, Sep 13, 2023 at 11:59:35PM +0100, Javier Mora wrote:
> I came across this issue accidentally when trying to move a directory
> containing a very large file, and deleting another file in that
> directory while I was at it.
> It seems to be caused by `pack.useSparse=true` being the default since
> v2.27 (which I found out after spending quite a while manually
> bisecting and compiling git since I noticed that this didn't happen in
> v2.25; commit de3a864 introduces this regression).
> 
> * Expected:
>     Pushing a commit that moves a file without modifying it shouldn't
> require sending a blob object for that file, since the remote server
> already has that blob object.
> * Observed:
>     Pushing a commit that moves a directory containing a file and also
> adds/deletes other files in that directory will for some reason also
> send blobs for all the files in that directory, even the ones that
> were already in the remote.
> * Consequences:
>     This has a very big impact in push times for very small commits
> that just move around files, if those files are very big (I had this
> happen with a >100MB file over a problematic connection... yikes!)
> * Note:
>     The commit introducing the regression does warn about possible
> scenarios involving a special arrangement of exact copies across
> directories, but these are not "copies", I just moved a file, which
> seems like a rather common operation.
> 
> Code snippet for reproduction:
> ```
> mkdir TEST_git
> cd TEST_git
> 
> mkdir -p local remote/origin.git
> cd remote/origin.git
> git init --bare
> cd ../../local
> git init
> git remote add origin file://"${PWD%/*}"/remote/origin.git
> 
> mkdir zig
> for i in a b c d e; do
>     dd if=/dev/urandom of=zig/"$i" bs=1M count=1
> done
> git add .
> git commit -m 'Add big files'
> git push -u origin master
> #>> Writing objects: 100% (8/8), 5.00 MiB | 13.27 MiB/s, done.
> #^ makes sense: 1 commit + 2 trees (/ and /zig) + 5 files = 8;
> #  5 MiB in total for the 5x 1 MiB binary files
> 
> git mv zig zag
> git commit -m 'Move zig'
> git push
> #>> Writing objects: 100% (2/2), 233 bytes | 233.00 KiB/s, done.
> #^ makes sense: 1 commit + 1 tree (/ renames /zig to /zag) = 2;
> #  a,b,c,d,e objects already in remote
> 
> git mv zag zog
> touch zog/f
> git add zog/f
> git commit -m 'For great justice'
> git push
> #>> Writing objects: 100% (9/9), 5.00 MiB | 24.63 MiB/s, done.
> #^ It re-uploaded the 5x 1 MiB blobs
> #  even though remote already had them.
> ```
> 
> Note that the latter doesn't happen if I use `git -c pack.useSparse=false push`.

I can reproduce this regression on v2.42.0 (self-compiled) on my Debian
testing system.

Cc'ing Derrick and Junio.

Thanks for the report!

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [BUG] `git push` sends unnecessary objects
  2023-09-17 13:21 ` Bagas Sanjaya
@ 2023-11-25 14:54   ` Javier Mora
       [not found]     ` <PH0PR00MB1349BB447657A94A8EA90A78A183A@PH0PR00MB1349.namprd00.prod.outlook.com>
  0 siblings, 1 reply; 4+ messages in thread
From: Javier Mora @ 2023-11-25 14:54 UTC (permalink / raw
  To: Bagas Sanjaya; +Cc: Git Mailing List, Derrick Stolee, Junio C Hamano

Apparently if I do that in two commits (one to move the dir, and a
second one add the file), and then push all that after the second
commit, this doesn't happen -- the resulting push will only contain 6
objects (2 commits, 3 trees, and 1 file), and be a few bytes large.

El dom, 17 sept 2023 a las 14:21, Bagas Sanjaya
(<bagasdotme@gmail.com>) escribió:
>
> On Wed, Sep 13, 2023 at 11:59:35PM +0100, Javier Mora wrote:
> > I came across this issue accidentally when trying to move a directory
> > containing a very large file, and deleting another file in that
> > directory while I was at it.
> > It seems to be caused by `pack.useSparse=true` being the default since
> > v2.27 (which I found out after spending quite a while manually
> > bisecting and compiling git since I noticed that this didn't happen in
> > v2.25; commit de3a864 introduces this regression).
> >
> > * Expected:
> >     Pushing a commit that moves a file without modifying it shouldn't
> > require sending a blob object for that file, since the remote server
> > already has that blob object.
> > * Observed:
> >     Pushing a commit that moves a directory containing a file and also
> > adds/deletes other files in that directory will for some reason also
> > send blobs for all the files in that directory, even the ones that
> > were already in the remote.
> > * Consequences:
> >     This has a very big impact in push times for very small commits
> > that just move around files, if those files are very big (I had this
> > happen with a >100MB file over a problematic connection... yikes!)
> > * Note:
> >     The commit introducing the regression does warn about possible
> > scenarios involving a special arrangement of exact copies across
> > directories, but these are not "copies", I just moved a file, which
> > seems like a rather common operation.
> >
> > Code snippet for reproduction:
> > ```
> > mkdir TEST_git
> > cd TEST_git
> >
> > mkdir -p local remote/origin.git
> > cd remote/origin.git
> > git init --bare
> > cd ../../local
> > git init
> > git remote add origin file://"${PWD%/*}"/remote/origin.git
> >
> > mkdir zig
> > for i in a b c d e; do
> >     dd if=/dev/urandom of=zig/"$i" bs=1M count=1
> > done
> > git add .
> > git commit -m 'Add big files'
> > git push -u origin master
> > #>> Writing objects: 100% (8/8), 5.00 MiB | 13.27 MiB/s, done.
> > #^ makes sense: 1 commit + 2 trees (/ and /zig) + 5 files = 8;
> > #  5 MiB in total for the 5x 1 MiB binary files
> >
> > git mv zig zag
> > git commit -m 'Move zig'
> > git push
> > #>> Writing objects: 100% (2/2), 233 bytes | 233.00 KiB/s, done.
> > #^ makes sense: 1 commit + 1 tree (/ renames /zig to /zag) = 2;
> > #  a,b,c,d,e objects already in remote
> >
> > git mv zag zog
> > touch zog/f
> > git add zog/f
> > git commit -m 'For great justice'
> > git push
> > #>> Writing objects: 100% (9/9), 5.00 MiB | 24.63 MiB/s, done.
> > #^ It re-uploaded the 5x 1 MiB blobs
> > #  even though remote already had them.
> > ```
> >
> > Note that the latter doesn't happen if I use `git -c pack.useSparse=false push`.
>
> I can reproduce this regression on v2.42.0 (self-compiled) on my Debian
> testing system.
>
> Cc'ing Derrick and Junio.
>
> Thanks for the report!
>
> --
> An old man doll... just what I always wanted! - Clara


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [EXTERNAL] Re: [BUG] `git push` sends unnecessary objects
       [not found]     ` <PH0PR00MB1349BB447657A94A8EA90A78A183A@PH0PR00MB1349.namprd00.prod.outlook.com>
@ 2023-11-30 13:33       ` Javier Mora
  0 siblings, 0 replies; 4+ messages in thread
From: Javier Mora @ 2023-11-30 13:33 UTC (permalink / raw
  To: Derrick Stolee; +Cc: Bagas Sanjaya, Git Mailing List, Junio C Hamano

> I just want to push back on the word "regression" here as it is an expected result.

Yeah, I wouldn't quite call it a "regression" either; just an
inconvenient side effect of a new feature.

>> However, it is possible that extra objects are added to the pack-file if the included commits contain certain types of direct renames.
>
> [1] https://git-scm.com/docs/git-config#Documentation/git-config.txt-packuseSparse
>
> It is unfortunate that this user hit this problem, but it is easy to work around it

I eventually found that note, but not until I figured out that that
option was the one causing the trouble.  So yes, it is easy to work
around this issue when you know about it.  Maybe this potential corner
case should be given more visibility in the documentation so that it
is easier to spot?  (Not just on the documentation for the option
itself, but maybe in the `git push` one.)

> and the benefits of the sparse algorithm should outweigh these kinds of infrequent issues (in my opinion).

Well, I don't use git THAT often and still got hit by this issue, so
it might not be that uncommon.  And when the files are large, the time
spent sending them through an internet connection will probably
outweigh the processing time saved by skipping some files in the local
processing.

I wonder if it would make sense to try to make the algorithm smarter
to avoid this corner case.  For example, ask the server if it already
has the objects before sending them, or modify the algorithm to look
into trees that have added OR deleted files (to detect that a file has
actually been moved, and thus the server must already have it), or
simply be extra careful when large files are involved.  But I don't
know the details of the algorithm so I'm not sure all those
suggestions are feasible or even make sense at all.


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-11-30 13:33 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-13 22:59 [BUG] `git push` sends unnecessary objects Javier Mora
2023-09-17 13:21 ` Bagas Sanjaya
2023-11-25 14:54   ` Javier Mora
     [not found]     ` <PH0PR00MB1349BB447657A94A8EA90A78A183A@PH0PR00MB1349.namprd00.prod.outlook.com>
2023-11-30 13:33       ` [EXTERNAL] " Javier Mora

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).