* [PATCH] git-p4: properly initialize the largefiles
@ 2022-04-26 18:59 Bertin Philippe via GitGitGadget
2022-04-26 20:40 ` Junio C Hamano
0 siblings, 1 reply; 2+ messages in thread
From: Bertin Philippe via GitGitGadget @ 2022-04-26 18:59 UTC (permalink / raw)
To: git; +Cc: Bertin Philippe, tumata
From: tumata <tumata.codes@gmail.com>
This commit takes care of initializing the largeFiles
with the already existing files referenced in the .gitattributes
instead of initializing with an empty set by default.
This makes it so that the 'git p4 sync' command works as
expected with git-lfs.
Signed-off-by: tumata <tumata.codes@gmail.com>
---
git-p4: Properly initialize the largeFiles set so that it includes the
already existing large files
git-p4 sync isn't taking into account existing files that are managed by
git-lfs. This is fine when doing git p4 clone but doing git p4 sync
needs to take into account the files that are already managed by
git-lfs.
This change reads-in the .gitattributes and re-generate the data so that
sync can iteratively add files to the git-lfs. It takes care of
initializing the largeFiles with the already existing files referenced
in the .gitattributes instead of initializing with an empty set by
default.
We had issues cloning the first commit, then syncing the remaining
commits of a really large Perforce repo (making hundreds of smaller sync
instead of one massive sync). The sync was deleting the git-lfs each
time. We were able to fix this issue by first reading the current
content of git-lfs instead of initializing to an empty set.
Signed-off-by: tumata 5684571+tumata@users.noreply.github.com
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1253%2Ftumata%2Fgit-p4-initialize-large-files-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1253/tumata/git-p4-initialize-large-files-v1
Pull-Request: https://github.com/git/git/pull/1253
git-p4.py | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)
diff --git a/git-p4.py b/git-p4.py
index a9b1f904410..576e923a1de 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -1195,9 +1195,28 @@ class LargeFileSystem(object):
"""Base class for large file system support."""
def __init__(self, writeToGitStream):
- self.largeFiles = set()
+ self.largeFiles = self.parseLargeFiles()
self.writeToGitStream = writeToGitStream
+ def parseLargeFiles(self):
+ """Parse large files in order to initially populate 'largeFiles'"""
+ paths = set()
+ try:
+ cmd = ['git', 'show', 'p4/master:.gitattributes']
+ p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
+ if p.returncode:
+ print("Failed to read .gitattributed - error code: " + p.returncode)
+ raise
+ out = p.stdout.readlines()
+ for line in out:
+ if line.startswith('/'):
+ path = line[1:].split(' ', 1)[0]
+ path = path.replace('[[:space:]]', ' ')
+ paths.add(path)
+ except:
+ print("parseLargeFiles: .gitattributes does not appear to exist.")
+ return paths
+
def generatePointer(self, cloneDestination, contentFile):
"""Return the content of a pointer file that is stored in Git instead of
the actual content."""
base-commit: 6cd33dceed60949e2dbc32e3f0f5e67c4c882e1e
--
gitgitgadget
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH] git-p4: properly initialize the largefiles
2022-04-26 18:59 [PATCH] git-p4: properly initialize the largefiles Bertin Philippe via GitGitGadget
@ 2022-04-26 20:40 ` Junio C Hamano
0 siblings, 0 replies; 2+ messages in thread
From: Junio C Hamano @ 2022-04-26 20:40 UTC (permalink / raw)
To: Bertin Philippe via GitGitGadget; +Cc: git, Bertin Philippe
"Bertin Philippe via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: tumata <tumata.codes@gmail.com>
>
> This commit takes care of initializing the largeFiles
> with the already existing files referenced in the .gitattributes
> instead of initializing with an empty set by default.
>
> This makes it so that the 'git p4 sync' command works as
> expected with git-lfs.
Why is this change needed? How well did the old code work, and how
does the code with this patch work, and what's the difference
between them?
To answer such questions better, we prefer certain flow of logic in
our proposed commit log message. We start by talking about what the
current code does, e.g.
The LargeFileSystem class initializes the .largeFile attribute
to an empty set, ignoring the paths that are already known to be
X. This causes "git p4 sync" subcommand to do Y.
But "git p4 sync" subcommand needs to do Z instead. For that,
populate the .largeFile attribute upfront by readind W from V.
I have little idea what problem this patch is trying to address, so
there are many placeholders in the above, but hopefully you get the
idea.
... the reviewer goes and looks ahead; sees a lot more plausible
explanation below ...
> Signed-off-by: tumata <tumata.codes@gmail.com>
Name mismatch that should have said Bertin Philippe instead?
As Documentation/SubmittingPatches says, the Signed-off-by: line
wants the patch author's real name, and the From: line wants to show
the same name-address that is on the Signed-off-by: line.
> ---
> git-p4: Properly initialize the largeFiles set so that it includes the
> already existing large files
>
> git-p4 sync isn't taking into account existing files that are managed by
> git-lfs. This is fine when doing git p4 clone but doing git p4 sync
> needs to take into account the files that are already managed by
> git-lfs.
>
> This change reads-in the .gitattributes and re-generate the data so that
> sync can iteratively add files to the git-lfs. It takes care of
> initializing the largeFiles with the already existing files referenced
> in the .gitattributes instead of initializing with an empty set by
> default.
>
> We had issues cloning the first commit, then syncing the remaining
> commits of a really large Perforce repo (making hundreds of smaller sync
> instead of one massive sync). The sync was deleting the git-lfs each
> time. We were able to fix this issue by first reading the current
> content of git-lfs instead of initializing to an empty set.
That sounds more like what we want to record in the commit, but
writing them after the three-dash line does not help anybody.
> git-p4.py | 21 ++++++++++++++++++++-
> 1 file changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/git-p4.py b/git-p4.py
> index a9b1f904410..576e923a1de 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -1195,9 +1195,28 @@ class LargeFileSystem(object):
> """Base class for large file system support."""
>
> def __init__(self, writeToGitStream):
> - self.largeFiles = set()
> + self.largeFiles = self.parseLargeFiles()
> self.writeToGitStream = writeToGitStream
OK, we can see that it used to be an empty set, but now we are
computing something.
> + def parseLargeFiles(self):
> + """Parse large files in order to initially populate 'largeFiles'"""
> + paths = set()
> + try:
> + cmd = ['git', 'show', 'p4/master:.gitattributes']
Curious. How is 'p4/master' branch so special? Is it always, no
matter what branch you are on, correct to read from that branch?
Shouldn't you be reading from the current branch instead?
> + p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
> + if p.returncode:
> + print("Failed to read .gitattributed - error code: " + p.returncode)
ted -> tes???
> + raise
> + out = p.stdout.readlines()
> + for line in out:
So, we read the contents of .gitattributes file line-by-line, and ...
> + if line.startswith('/'):
> + path = line[1:].split(' ', 1)[0]
> + path = path.replace('[[:space:]]', ' ')
> + paths.add(path)
... without checking what attribute each line is setting, we collect
the first token (the "pattern" to match paths, for which attributes
are given to) and add it to the "paths" set. The code does not care
what attribute the line was trying to add, and it just collects the
patterns that appear on each line.
I am having a hard time to see how this could possibly be correct.
* The first token on .gitattributes file could be '/*'
(i.e. match everything at the top of this working tree); is
the set in the .largeFiles attribute prepared to take such a
pattern, instead of concrete paths? Judging from the way the
set is used (e.g. isLargeFile() is fed a path and checks the
set membership, not pattern matching), I do not think so.
* The first token may happen to be the name of a single file but
it may be setting an attribute that has nothing to do with
LFS (e.g. "/hello.txt text"). What good does it do to add
"hello.txt" to the set of largeFiles?
* The first token may happen to be the name of an existing file
that indeed is under LFS, but the path may not be in the
current branch at all, even though it appears in the p4/master
branch. What good does it do to add such a non-existent path
to the set of largeFiles?
> + except:
> + print("parseLargeFiles: .gitattributes does not appear to exist.")
> + return paths
> +
> def generatePointer(self, cloneDestination, contentFile):
> """Return the content of a pointer file that is stored in Git instead of
> the actual content."""
In any case, if this is a patch to fix existing breakage, it should
come with a new test that demonstrates how the corrected command
("git p4 sync"?) behaves.
Thanks.
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2022-04-26 20:41 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-26 18:59 [PATCH] git-p4: properly initialize the largefiles Bertin Philippe via GitGitGadget
2022-04-26 20:40 ` Junio C Hamano
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).