git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Fastest way to set files date and time to latest commit time of each one
@ 2020-08-29  1:36 Ivan Baldo
  2020-08-29  3:20 ` Junio C Hamano
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Ivan Baldo @ 2020-08-29  1:36 UTC (permalink / raw)
  To: git

  Hello.
  I know this is not standard usage of git, but I need a way to have
more stable dates and times in the files in order to avoid rsync
checksumming.
  So I found this
https://stackoverflow.com/questions/2179722/checking-out-old-file-with-original-create-modified-timestamps/2179876#2179876
and modified it a bit to run in CentOS 7:

IFS="
"
for FILE in $(git ls-files -z | tr '\0' '\n')
do
    TIME=$(git log --pretty=format:%cd -n 1 --date=iso -- "$FILE")
    touch -c -m -d "$TIME" "$FILE"
done

  Unfortunately it takes ages for a 84k files repo.
  I see the CPU usage is dominated by the git log command.
  I know a way I could use to split the work for all the CPU threads
but anyway, I would like to know if you guys and girls know of a
faster way to do this.
  Also I know of other utilities that store the metadata in Git, but I
am trying to avoid that for the moment.
  Thanks a lot in advance!
  Have a nice day.
P.s.: please Cc replies to me.

-- 
Ivan Baldo - ibaldo@gmail.com - http://ibaldo.codigolibre.net/
Freelance C++/PHP programmer and GNU/Linux systems administrator.
The sky isn't the limit!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fastest way to set files date and time to latest commit time of each one
  2020-08-29  1:36 Fastest way to set files date and time to latest commit time of each one Ivan Baldo
@ 2020-08-29  3:20 ` Junio C Hamano
  2020-08-29  4:59   ` Raymond E. Pasco
  2020-08-29  4:48 ` Eric Wong
  2020-08-29  6:46 ` Andreas Schwab
  2 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2020-08-29  3:20 UTC (permalink / raw)
  To: Ivan Baldo; +Cc: git

Ivan Baldo <ibaldo@gmail.com> writes:

>   I know this is not standard usage of git, but I need a way to have
> more stable dates and times in the files in order to avoid rsync
> checksumming.

Would you care to elaborate a bit more about the use case?  From
what you wrote, I would assume:

 - The source of the rsync transfer is a git working tree.  It often
   has the checkout of the latest and greatest version, but during
   development, it may switch to older commit (e.g. to find where
   regression occurred) or not-yet-ready commit (e.g. work in
   progress that is not given to upstream).  You check out the
   version you want to sync to the destination before initiating
   rsync.

 - The destination of the rsync transfer is meant to serve as a
   back-up of the latest and greatest, periodical snapshot of a
   branch, etc., which is NOT controlled by git and transfer does
   not happen in the reverse direction [*2*]

Because the working tree of the source repository is used to check
out different versions between rsync sessions, files that did not
change between the commit you sync'ed to destination the last time
and the commit you are about to sync still may have been touched and
have different timestamp, requiring rsync to check the contents.

And as a workaround, you are willing to change the workflow to
"touch" the working tree files, immediately before you run the next
rsync, in a predictable way so that the timestamp of a file whose
contents did not change since the last rsync session would have the
same timestamp.  This may break your build next time you run "make"
in the source working tree (because your object files that are
excluded from your rsync may have newer timestamp than the
corresponding source even when they must be recompiled due to your
"touch"ing), but you are willing to pay the cost of say "make clean"
after "touch"ing.

Is that the kind of use case you have around "rsync"?

To the question "what is the time this file was last modified?",
there is no simple and cheap answer that is easy to explain to
end-users, unless your development is completely linear [*2*].

The loop you showed would be the right one in a linear history, and
with recent development to record which paths were changed in each
commit in the commit-graph data structure, the script should work a
lot faster than traditional git.


[Footnote]

*1* Otherwise, you'd be just mirror-fetching from the source
    repository.  If that can be arranged, running "git pull --ff-only"
    on the destination side to update from the source side would
    be a lot more efficient than running rsync, I would imagine.


*2* In a history with merges, because two or more branches can touch
    the file in parallel development at different times and then the
    resulting parallel histories get merged into a single history.
    When two or more of these parallel history gave the file in
    question an identical content at different times, and the merge
    result was recorded as the same content, you'd need to follow
    ALL the paths and compare the timestamp of these commits to pick
    one (which one?  the oldest one?  the newest one?  does the
    order of parents in the merge matter?)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fastest way to set files date and time to latest commit time of each one
  2020-08-29  1:36 Fastest way to set files date and time to latest commit time of each one Ivan Baldo
  2020-08-29  3:20 ` Junio C Hamano
@ 2020-08-29  4:48 ` Eric Wong
  2020-09-02 19:28   ` Ivan Baldo
  2020-08-29  6:46 ` Andreas Schwab
  2 siblings, 1 reply; 6+ messages in thread
From: Eric Wong @ 2020-08-29  4:48 UTC (permalink / raw)
  To: Ivan Baldo; +Cc: git

Ivan Baldo <ibaldo@gmail.com> wrote:
>   Hello.
>   I know this is not standard usage of git, but I need a way to have
> more stable dates and times in the files in order to avoid rsync
> checksumming.
>   So I found this
> https://stackoverflow.com/questions/2179722/checking-out-old-file-with-original-create-modified-timestamps/2179876#2179876
> and modified it a bit to run in CentOS 7:
> 
> IFS="
> "
> for FILE in $(git ls-files -z | tr '\0' '\n')
> do
>     TIME=$(git log --pretty=format:%cd -n 1 --date=iso -- "$FILE")
>     touch -c -m -d "$TIME" "$FILE"
> done
> 
>   Unfortunately it takes ages for a 84k files repo.
>   I see the CPU usage is dominated by the git log command.

running git log for each file isn't necessary.

On Debian, rsync actually ships the `git-set-file-times' script
in /usr/share/doc/rsync/scripts/ which only runs `git log' once
and parses it.

You can also get my (original) version from:
https://yhbt.net/git-set-file-times

>   I know a way I could use to split the work for all the CPU threads
> but anyway, I would like to know if you guys and girls know of a
> faster way to do this.

Much of your overhead is going to be from process spawning.
My Perl version reduces that significantly.

I haven't tried it with 84K files, but it'll have to keep all
those filenames in memory.  I'm not sure if parallelizing
utime() syscalls is worth it, either; maybe it helps on SSD
more than HDD.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fastest way to set files date and time to latest commit time of each one
  2020-08-29  3:20 ` Junio C Hamano
@ 2020-08-29  4:59   ` Raymond E. Pasco
  0 siblings, 0 replies; 6+ messages in thread
From: Raymond E. Pasco @ 2020-08-29  4:59 UTC (permalink / raw)
  To: Junio C Hamano, Ivan Baldo; +Cc: git

On Fri Aug 28, 2020 at 11:20 PM EDT, Junio C Hamano wrote:
> - The source of the rsync transfer is a git working tree. It often
> has the checkout of the latest and greatest version, but during
> development, it may switch to older commit (e.g. to find where
> regression occurred) or not-yet-ready commit (e.g. work in
> progress that is not given to upstream). You check out the
> version you want to sync to the destination before initiating
> rsync.

Assuming this is the case, perhaps a separate worktree that you touch
less often being the source for rsync might save rsync some
bandwidth/cpu checking things.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fastest way to set files date and time to latest commit time of each one
  2020-08-29  1:36 Fastest way to set files date and time to latest commit time of each one Ivan Baldo
  2020-08-29  3:20 ` Junio C Hamano
  2020-08-29  4:48 ` Eric Wong
@ 2020-08-29  6:46 ` Andreas Schwab
  2 siblings, 0 replies; 6+ messages in thread
From: Andreas Schwab @ 2020-08-29  6:46 UTC (permalink / raw)
  To: Ivan Baldo; +Cc: git

I'm using this script:

#!/bin/sh
git log --name-only --format=format:%n%ct -- "$@" |
perl -e 'my $do_date = 0; chomp(my $cdup = `git rev-parse --show-cdup`);
    while (<>) {
	chomp;
	if ($do_date) {
	    next if ($_ eq "");
	    die "Unexpected $_\n" unless /^[0-9]+$/;
	    $d = $_;
	    $do_date = 0;
	} elsif ($_ eq "") {
	    $do_date = 1;
	} elsif (!defined($seen{$_})) {
	    $seen{$_} = 1;
 	    utime $d, $d, "$cdup$_";
 	}
    }'

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fastest way to set files date and time to latest commit time of each one
  2020-08-29  4:48 ` Eric Wong
@ 2020-09-02 19:28   ` Ivan Baldo
  0 siblings, 0 replies; 6+ messages in thread
From: Ivan Baldo @ 2020-09-02 19:28 UTC (permalink / raw)
  To: Eric Wong; +Cc: git

  Hello everyone!
  I just now managed to get time to work again on this, sorry for
replying so late but wanted to do a single reply with the conclusion
if possible.
  Let me tell you that I feel very humbled by all your replies, thanks
a lot for your time and concern with my inquiry!
  Eric's script is not only in Debian but also in CentOS 7 (and I
guess Red Hat 7 too) in
/usr/share/doc/rsync-*/support/git-set-file-times.
  My use case is similar to his: a cluster of identically configured
web servers with autoscaling (tested up to 100 servers) which when
they boot (or there is a new version of any of the websites), rsync
the current version from another server.
  So currently if we build the system image of the web servers in
tandem with the central server everything works smoothly, the problem
is when we recreate from scratch any of the pre-saved images, in which
case we get the dates mismatch and unnecessary rsync checksumming when
put to production.
  Will use Eric's script from CentOS 7 as-is from now on, to avoid the
mismatch and mix pre-saved VM images without issues (slowness in
autoscaling).
  Thanks a lot to you all!
  Let me know if any of you comes to Uruguay, you got free beers here!
  Have a great day.


El sáb., 29 de ago. de 2020 a la(s) 01:48, Eric Wong (e@yhbt.net) escribió:
>
> Ivan Baldo <ibaldo@gmail.com> wrote:
> >   Hello.
> >   I know this is not standard usage of git, but I need a way to have
> > more stable dates and times in the files in order to avoid rsync
> > checksumming.
> >   So I found this
> > https://stackoverflow.com/questions/2179722/checking-out-old-file-with-original-create-modified-timestamps/2179876#2179876
> > and modified it a bit to run in CentOS 7:
> >
> > IFS="
> > "
> > for FILE in $(git ls-files -z | tr '\0' '\n')
> > do
> >     TIME=$(git log --pretty=format:%cd -n 1 --date=iso -- "$FILE")
> >     touch -c -m -d "$TIME" "$FILE"
> > done
> >
> >   Unfortunately it takes ages for a 84k files repo.
> >   I see the CPU usage is dominated by the git log command.
>
> running git log for each file isn't necessary.
>
> On Debian, rsync actually ships the `git-set-file-times' script
> in /usr/share/doc/rsync/scripts/ which only runs `git log' once
> and parses it.
>
> You can also get my (original) version from:
> https://yhbt.net/git-set-file-times
>
> >   I know a way I could use to split the work for all the CPU threads
> > but anyway, I would like to know if you guys and girls know of a
> > faster way to do this.
>
> Much of your overhead is going to be from process spawning.
> My Perl version reduces that significantly.
>
> I haven't tried it with 84K files, but it'll have to keep all
> those filenames in memory.  I'm not sure if parallelizing
> utime() syscalls is worth it, either; maybe it helps on SSD
> more than HDD.



-- 
Ivan Baldo - ibaldo@gmail.com - http://ibaldo.codigolibre.net/
Freelance C++/PHP programmer and GNU/Linux systems administrator.
The sky isn't the limit!

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-09-02 19:28 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-29  1:36 Fastest way to set files date and time to latest commit time of each one Ivan Baldo
2020-08-29  3:20 ` Junio C Hamano
2020-08-29  4:59   ` Raymond E. Pasco
2020-08-29  4:48 ` Eric Wong
2020-09-02 19:28   ` Ivan Baldo
2020-08-29  6:46 ` Andreas Schwab

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).