From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS22989 209.51.188.0/24 X-Spam-Status: No, score=-3.9 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.6 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 4BAD51F47C for ; Mon, 16 Jan 2023 08:52:06 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1pHLDU-0004If-4q; Mon, 16 Jan 2023 03:51:44 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pHLDS-0004II-BV for bug-gnulib@gnu.org; Mon, 16 Jan 2023 03:51:42 -0500 Received: from mail-lf1-f51.google.com ([209.85.167.51]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1pHLDQ-0005ay-15 for bug-gnulib@gnu.org; Mon, 16 Jan 2023 03:51:42 -0500 Received: by mail-lf1-f51.google.com with SMTP id bp15so41710463lfb.13 for ; Mon, 16 Jan 2023 00:51:39 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=fVm2yYWF/8NfcsgGQYd51MQihURAH+ZKtsyID2JpghI=; b=LNa6a/i9RE+PuL1xuMikysy0x4DtXgXj7NE9ZN2DrYMDGWZ+YM5tOj0P3pU9ROmdPC dl8FJG8mBifle/DgiYCBZpBogSrxiDXVKIVZ7TvMGzQjm3nhd/CTQ1yyx1+7/O5hMky/ lslDOT8B2A9g+4fW6tsb+IbfHgRcyIPbm2lh4FNFQjLCPlzgemdGlX6lwEBobiNnpVrH co8iHRfgQJBVbxubuPkfRXrjaxpPuH83KkKu2dsKtpOmDBKhchy78T1Z17mN7uki77BG WeGiWQW34HCCZATrTWIVPm6Ceyo1FoewlQRWLKzSCppxtKyaOjDucSqyb4nYvaPMiRnc 0l2Q== X-Gm-Message-State: AFqh2krMOIFGAXuU9VSiqQsamuvp72XO4qH1UNYXrLM+d/bniu3GGvq9 bOdZ6jTWgiZrFghmJpgYL1LhyJi03BDTclWVNho= X-Google-Smtp-Source: AMrXdXu7MQkfUyraJwidhQsoM1pNwV5iO5WVr+dRzkquhKbsuRAOewNt6CpIq4eZ3dvida0/KTb5DCz2mGR2oE+aCN4= X-Received: by 2002:a05:6512:1112:b0:4cb:44bc:980 with SMTP id l18-20020a056512111200b004cb44bc0980mr2146749lfg.47.1673859097903; Mon, 16 Jan 2023 00:51:37 -0800 (PST) MIME-Version: 1.0 References: <87h6wtgmhy.fsf__22556.7857896507$1673713908$gmane$org@redhat.com> <5459006.YCjZZlMYnJ@nimes> <2740098.11c6FMkHaZ@nimes> <875yd6dg8q.fsf@josefsson.org> In-Reply-To: <875yd6dg8q.fsf@josefsson.org> From: Jim Meyering Date: Mon, 16 Jan 2023 00:51:25 -0800 Message-ID: Subject: Re: RFC: git-commit based mtime-reproducible tarballs To: Simon Josefsson Cc: Bruno Haible , Paul Eggert , "bug-gnulib@gnu.org List" Content-Type: multipart/alternative; boundary="000000000000ef47ba05f25db05c" Received-SPF: pass client-ip=209.85.167.51; envelope-from=meyering@gmail.com; helo=mail-lf1-f51.google.com X-Spam_score_int: -15 X-Spam_score: -1.6 X-Spam_bar: - X-Spam_report: (-1.6 / 5.0 requ) BAYES_00=-1.9, FREEMAIL_FORGED_FROMDOMAIN=0.001, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.25, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: bug-gnulib@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gnulib discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnulib-bounces+normalperson=yhbt.net@gnu.org Sender: bug-gnulib-bounces+normalperson=yhbt.net@gnu.org --000000000000ef47ba05f25db05c Content-Type: text/plain; charset="UTF-8" On Mon, Jan 16, 2023, 12:41 AM Simon Josefsson via Gnulib discussion list < bug-gnulib@gnu.org> wrote: > Bruno Haible writes: > > > Paul Eggert wrote: > >> some users want to "trust but verify" and a reproducible > >> tarball is easier to audit than a non-reproducible one, so for these > >> users it can be a win to omit the irrelevant data from the tarball. > > > > Reproducibility can be implemented in different ways: > > - by omitting irrelevant data from the tarball, > > - by having a customized comparison program 'diff', such that > > "diff --ignore-irrelevant-metadata contents1 contents2" > > would ignore the irrelevant parts. > > The problem with a --ignore-irrelevant-metadata approach is that it will > be a judgement call what is irrelevant, and two projects may have > different philosophies that are mutually incompatible. > > A devils advocate case: consider a build-system that embeds the > source-code timestamp information in the binary, and the binary sends of > a hash of its executable binary to a remote server for verification > purposes. In some projects this may be what you want to achieve. Then > ignoring this particular metadata will be a critical failure for that > project. > > I think it is a worthy goal to reach a tarball that is deterministically > and one-way reproducable from git source code [for the same set of tool > versions]. > > >> when I do an 'ls > >> -l' of a source directory that I got from a distribution tarball, it's > >> useful to see the last time the contents of each source file was > changed > >> upstream. > > > > OK, now we're discussing different ways to make a tarball reproducible. > > That's nice, because Simon's proposal was to make all timestamps equal, > > and that puts me off. > > In binutils-2.40.tar.bz2 all files are from 2023-01-14. > > In android-studio-2021.3.1.17-linux.tar.gz all files are from 2010-01-01. > > It gives me as a user no idea whether this tarball is 13 years old, > > 2 years old, or from yesterday. > > > > I much prefer Paul's approach, since it still conveys meaningful > > timestamps: > > I agree! > > I even wonder if the binutils tarball build properly on say HP-UX then? > > >> For TZDB, where users have long wanted reproducibility, I use something > >> like this in a Makefile recipe for each source file $$file: > >> > >> time=`git log -1 --format='tformat:%ct' $$file` && > >> touch -cmd @$$time $$file > > > > That's good for the files that are under version control. > > > >> 2. What about platform-independent files that are automatically created > >> from source files from the repository, and that are shipped in the > >> release tarball? > > > > For these, you could unpack the tarball, see in which order the > timestamps > > are, and then assign artificial timestamps, in the same order but exactly > > 2 seconds apart. For example, if the tarball contains > > under version control: > > hello.c 2023-01-14 13:28:14 > > configure.ac 2023-01-01 14:03:07 > > and not under version control: > > configure 2023-01-15 04:09:10 > > config.h.in 2023-01-15 04:05:19 > > then you would determine the > > max_timestamp_under_vc = max { 2023-01-14 13:28:14, 2023-01-01 > 14:03:07 } > > = 2023-01-14 13:28:14 > > and then, since config.h.in is older than configure: > > touch -m (max_timestamp_under_vc + 2 seconds) config.h.in > > touch -m (max_timestamp_under_vc + 4 seconds) configure > > > > You can do this without knowing the Makefile rules or scripts which > created > > config.h.in and configure. > > > > The increment of 2 seconds is, of course, for VFAT file systems, which > have > > only 2 seconds of resolution for file modification times. > > Clever! > > To implement this we would need a dist-hook to do the 'touch -m ...' > dance on all files. > > I somewhat fear that the solution here will be more of a problem than > the original problem due to the complexity. > > Does anyone see a problem with this approach? Do you think it is a good > idea? I like it and don't see any further problems, except for the > complexity but I don't see a way to reduce it. > I like it, too. > --000000000000ef47ba05f25db05c Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Mon, Jan 16, 2023, 12:41 AM Simon Josefsson via Gnulib disc= ussion list <bug-gnulib@gnu.org> wrote:
Bruno Haible <bruno@cl= isp.org> writes:

> Paul Eggert wrote:
>> some users want to "trust but verify" and a reproducible=
>> tarball is easier to audit than a non-reproducible one, so for the= se
>> users it can be a win to omit the irrelevant data from the tarball= .
>
> Reproducibility can be implemented in different ways:
>=C2=A0 =C2=A0- by omitting irrelevant data from the tarball,
>=C2=A0 =C2=A0- by having a customized comparison program 'diff'= , such that
>=C2=A0 =C2=A0 =C2=A0"diff --ignore-irrelevant-metadata contents1 c= ontents2"
>=C2=A0 =C2=A0 =C2=A0would ignore the irrelevant parts.

The problem with a --ignore-irrelevant-metadata approach is that it will be a judgement call what is irrelevant, and two projects may have
different philosophies that are mutually incompatible.

A devils advocate case: consider a build-system that embeds the
source-code timestamp information in the binary, and the binary sends of a hash of its executable binary to a remote server for verification
purposes.=C2=A0 In some projects this may be what you want to achieve.=C2= =A0 Then
ignoring this particular metadata will be a critical failure for that
project.

I think it is a worthy goal to reach a tarball that is deterministically and one-way reproducable from git source code [for the same set of tool
versions].

>> when I do an 'ls
>> -l' of a source directory that I got from a distribution tarba= ll, it's
>> useful to see the last time the contents of each source file was c= hanged
>> upstream.
>
> OK, now we're discussing different ways to make a tarball reproduc= ible.
> That's nice, because Simon's proposal was to make all timestam= ps equal,
> and that puts me off.
> In binutils-2.40.tar.bz2 all files are from 2023-01-14.
> In android-studio-2021.3.1.17-linux.tar.gz all files are from 2010-01-= 01.
> It gives me as a user no idea whether this tarball is 13 years old, > 2 years old, or from yesterday.
>
> I much prefer Paul's approach, since it still conveys meaningful > timestamps:

I agree!

I even wonder if the binutils tarball build properly on say HP-UX then?

>> For TZDB, where users have long wanted reproducibility, I use some= thing
>> like this in a Makefile recipe for each source file $$file:
>>
>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 time=3D`git log -1 --form= at=3D'tformat:%ct' $$file` &&
>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 touch -cmd @$$time $$file=
>
> That's good for the files that are under version control.
>
>> 2. What about platform-independent files that are automatically cr= eated
>> from source files from the repository, and that are shipped in the=
>> release tarball?
>
> For these, you could unpack the tarball, see in which order the timest= amps
> are, and then assign artificial timestamps, in the same order but exac= tly
> 2 seconds apart. For example, if the tarball contains
> under version control:
>=C2=A0 =C2=A0hello.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A02023-01-14 13:28:= 14
>=C2=A0 =C2=A0configure.ac=C2=A0 =C2=A0 2023-01-01 14:03:07
> and not under version control:
>=C2=A0 =C2=A0configure=C2=A0 =C2=A0 =C2=A0 =C2=A02023-01-15 04:09:10 >=C2=A0 =C2=A0config.h.in=C2=A0 =C2=A0 =C2=A02023-01-15 04:05:19=
> then you would determine the
>=C2=A0 =C2=A0max_timestamp_under_vc =3D max { 2023-01-14 13:28:14, 2023= -01-01 14:03:07 }
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =3D 2023-01-14 13:28:14
> and then, since config.h.in is older than configure:
>=C2=A0 =C2=A0touch -m (max_timestamp_under_vc + 2 seconds) config.h= .in
>=C2=A0 =C2=A0touch -m (max_timestamp_under_vc + 4 seconds) configure >
> You can do this without knowing the Makefile rules or scripts which cr= eated
> config.h.in and configure.
>
> The increment of 2 seconds is, of course, for VFAT file systems, which= have
> only 2 seconds of resolution for file modification times.

Clever!

To implement this we would need a dist-hook to do the 'touch -m ...'= ;
dance on all files.

I somewhat fear that the solution here will be more of a problem than
the original problem due to the complexity.

Does anyone see a problem with this approach?=C2=A0 Do you think it is a go= od
idea?=C2=A0 I like it and don't see any further problems, except for th= e
complexity but I don't see a way to reduce it.

I like it, too.
--000000000000ef47ba05f25db05c--