git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Cross-referencing the Git mailing list archive with their corresponding commits in `pu`
@ 2017-02-06 15:34 Johannes Schindelin
  2017-02-06 19:10 ` Junio C Hamano
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Johannes Schindelin @ 2017-02-06 15:34 UTC (permalink / raw)
  To: Josh Triplett; +Cc: git

Hi Josh,

as discussed at the GitMerge, I am trying to come up with tooling that
will allow for substantially less tedious navigation between the local
repository, the mailing list, and what ends up in the `pu` branch.

That tooling would *still* not help lowering the barrier of entry for
contributing to Git by a lot, as it would *still* not address the problem
that mails sent from the most prevalent desktop mail client, as well as
mails sent from the most prevalent web mail client, are simply and
unceremoniously dropped. (This problem was acknowledged by quite a few
nods even at the Contributors' Summit...) But still, we decided to start
*somewhere* and this tooling is what we agreed on.

It is quite a bit harder going than I would like: as we have figured out,
the Subject: line is not a good way to link the commits with the original
mails containing the patches, as commit messages are modified before being
pushed often enough to make this a fragile matching.

So I thought maybe the From: line (from the body, if available, otherwise
from the header) in conjunction with the "Date:" header would work. But a
preliminary study shows that there are 336 From: + Date: combinations in
the Git mailing list archive that are not unique. 71 of these are shared
by three or more mails, even, and 9 are shared by more than 10 mails,
respectively. This is bad!

Unsurprisingly, the top 10 of these cases were obviously caused by the
builtin `git am` bug where it would not reset the author date properly.
Surprisingly, though, there were a few cases from 2005, too.

I had a quick look to find out what was the culprit (looking at the
17-strong patch series "Documentation fixes in response to my previous
listing" by Nikolai Weibull, but I am at a loss there: the mail claims to
be sent by git-send-email and the patches appear to be generated by
git-format-patch as of v0.99.9l, neither of which had a Date:-related bug
back in that time frame. My best guess is that the patches were mishandled
by a tool similar to rebase -i (which entered Git only at v1.5.3).

For details, see:
http://public-inbox.org/git/11340844841342-git-send-email-mailing-lists.git@rawuncut.elitemail.org/
(this is also an example where public-inbox' thread detection went utterly
wrong, including way too many mails in the "thread")

There was even a case of duplicated Date: headers in 2012. Now, this case
is very curious, as there have been 7 mails with identical Date: header,
but it was not a 6-strong patch series. Instead, it was a 4-strong patch
series that needed three iterations before it was accepted, and the
identical Date: header appears only in v2's patches (*not* in its cover
letter) and it *disappeared* in v3's 4/4, where it was set *back* by a
week (to the Date: it had in v1).

For details, see
http://public-inbox.org/git/cover.1354693001.git.Sebastian.Leske@sleske.name/
and
http://public-inbox.org/git/cover.1354324110.git.Sebastian.Leske@sleske.name/
and
http://public-inbox.org/git/b115a546fa783b4121d118bb8fdb9270443f90fa.1353691892.git.Sebastian.Leske@sleske.name/

This last example also demonstrates a very curious test case for a
different difficulty in trying to reconstruct lost correspondences: the
patch series was applied *twice*, independently of each other. First, on
the day v3 was submitted, it was applied on top of v1.8.1-rc0 (as commits
ee26a6e2b8..dd465ce66f), although it was not merged until v1.8.1-rc3. 22
days later, it was reapplied on top of maint so it could enter v1.8.0.3
(back then, Git still had "patchlevel" versions): c2999adcd5..008c208c2c.

As you can see, there is a many-to-many relationship here, even if you do
leave the *original* branch out of the picture entirely.

Will keep you posted,
Dscho

P.S.: I used public-inbox.org links instead of commit references to the
Git repository containing the mailing list archive, because the format of
said Git repository is so unfavorable that it was determined very quickly
in a discussion between Patrick Reynolds (GitHub) and myself that it would
put totally undue burden on GitHub to mirror it there (compare also Carlos
Nieto's talk at GitMerge titled "Top Ten Worst Repositories to host on
GitHub").

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cross-referencing the Git mailing list archive with their corresponding commits in `pu`
  2017-02-06 15:34 Cross-referencing the Git mailing list archive with their corresponding commits in `pu` Johannes Schindelin
@ 2017-02-06 19:10 ` Junio C Hamano
  2017-02-09 14:11   ` Lars Schneider
  2017-02-06 20:48 ` Eric Wong
  2017-02-17 17:50 ` Johannes Schindelin
  2 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2017-02-06 19:10 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Josh Triplett, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> So I thought maybe the From: line (from the body, if available, otherwise
> from the header) in conjunction with the "Date:" header would work.

FYI, I use a post-applypatch hook to populate refs/notes/amlog notes
tree when I queue a new patch; I am not sure how well the notes in
it are preserved across rebases, but it could be a good starting
point.  The notes tree is mirrored at git://github.com/git/gitster
repository.

E.g.

$ git show --notes=amlog --stat
commit 2488dcab22cee343fe35d9951160f0966a45fdb3
Author: Patrick Steinhardt <patrick.steinhardt@elego.de>
Date:   Mon Feb 6 14:13:59 2017 +0100

    worktree: fix option descriptions for `prune`
    
    The `verbose` and `expire` options of the `git worktree prune`
    subcommand have wrong descriptions in that they pretend to relate to
    objects. But as the git-worktree(1) correctly states, these options have
    nothing to do with objects but only with worktrees. Fix the description
    accordingly.
    
    Signed-off-by: Patrick Steinhardt <patrick.steinhardt@elego.de>
    Signed-off-by: Junio C Hamano <gitster@pobox.com>

Notes (amlog):
    Message-Id: <c2af75361b7b357fa905ab072bfdc45ad055ca49.1486386803.git.patrick.steinhardt@elego.de>

 builtin/worktree.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cross-referencing the Git mailing list archive with their corresponding commits in `pu`
  2017-02-06 15:34 Cross-referencing the Git mailing list archive with their corresponding commits in `pu` Johannes Schindelin
  2017-02-06 19:10 ` Junio C Hamano
@ 2017-02-06 20:48 ` Eric Wong
  2017-02-06 22:07   ` Jeff King
  2017-02-17 17:50 ` Johannes Schindelin
  2 siblings, 1 reply; 11+ messages in thread
From: Eric Wong @ 2017-02-06 20:48 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Josh Triplett, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> For details, see:
> http://public-inbox.org/git/11340844841342-git-send-email-mailing-lists.git@rawuncut.elitemail.org/
> (this is also an example where public-inbox' thread detection went utterly
> wrong, including way too many mails in the "thread")

Thanks, it should be fixed in an hour or two when reindexing
finishes...

<https://public-inbox.org/meta/20170206200216.GA26676@dcvr/>

but it looks like reindexing is a little buggy in that it reuses
thread IDs, too... (will fix)

The Tor .onion mirrors should be done, first, since they're on
better hardware:
http://hjrcffqmbrq6wope.onion/git/11340844841342-git-send-email-mailing-lists.git@rawuncut.elitemail.org/
http://czquwvybam4bgbro.onion/git/11340844841342-git-send-email-mailing-lists.git@rawuncut.elitemail.org/

> This last example also demonstrates a very curious test case for a
> different difficulty in trying to reconstruct lost correspondences: the
> patch series was applied *twice*, independently of each other. First, on
> the day v3 was submitted, it was applied on top of v1.8.1-rc0 (as commits
> ee26a6e2b8..dd465ce66f), although it was not merged until v1.8.1-rc3. 22
> days later, it was reapplied on top of maint so it could enter v1.8.0.3
> (back then, Git still had "patchlevel" versions): c2999adcd5..008c208c2c.
> 
> As you can see, there is a many-to-many relationship here, even if you do
> leave the *original* branch out of the picture entirely.

Fwiw, I've always seen the search ability of public-inbox as
analogous to rename detection in git; in that it can never be
perfect, but can still be tweaked and improved after-the-fact
and be used more flexibly.

Right now, the thread searching public-inbox is loose in that it
favors overmatching based on Subject in addition to References.
But the actual threading algorithm (for display) is strict,
relying only on References.  But yeah, there can be tweaks to
improve matching and introducing git (code) repository awareness
into the mail search...

> Will keep you posted,

Likewise :>

> P.S.: I used public-inbox.org links instead of commit references to the
> Git repository containing the mailing list archive, because the format of
> said Git repository is so unfavorable that it was determined very quickly
> in a discussion between Patrick Reynolds (GitHub) and myself that it would
> put totally undue burden on GitHub to mirror it there (compare also Carlos
> Nieto's talk at GitMerge titled "Top Ten Worst Repositories to host on
> GitHub").

Any suggestions on how the repository format can be improved?

I haven't hit insurmountable performance problems, even on
low-end hardware; especially since I started storing blob ids in
Xapian itself, avoiding the expensive tree lookup via git.

The main problem seems to be tree size.  Deepening (2/2/36 vs
2/38) might be an option (I think Peff brought that up); but it
might be easier to switch to YYYYMM refs (working like
logrotate) and rely on Xapian to tie the entire thing together.

Some change will definitely be needed for all LKML, but most
projects have less traffic than even git, and should be fine.


But, I am working to undermine centralized messaging systems
(which GitHub and GitLab both are), so they would be wise to
undermine public-inbox all the same ;>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cross-referencing the Git mailing list archive with their corresponding commits in `pu`
  2017-02-06 20:48 ` Eric Wong
@ 2017-02-06 22:07   ` Jeff King
  2017-02-07  0:14     ` Eric Wong
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff King @ 2017-02-06 22:07 UTC (permalink / raw)
  To: Eric Wong; +Cc: Johannes Schindelin, Josh Triplett, git

On Mon, Feb 06, 2017 at 08:48:20PM +0000, Eric Wong wrote:

> I haven't hit insurmountable performance problems, even on
> low-end hardware; especially since I started storing blob ids in
> Xapian itself, avoiding the expensive tree lookup via git.

The painful thing is traversing the object graph for clones and fetches.
Bitmaps help, but you still have to generate them.

> The main problem seems to be tree size.  Deepening (2/2/36 vs
> 2/38) might be an option (I think Peff brought that up); but it
> might be easier to switch to YYYYMM refs (working like
> logrotate) and rely on Xapian to tie the entire thing together.

Yes, the hashing is definitely one issue. Some numbers here:

  http://public-inbox.org/git/20160805092805.w3nwv2l6jkbuwlzf@sigill.intra.peff.net/

If you have C commits on a tree with T entries, you have to do C*T hash
lookups for a flat tree (for each commit, you have to see "yup, already
saw that object"). Sharding that across H entries at the top level drops
the tree cost from T to H + T/H (actually, it's a bit worse because we
have to read the secondary tree, too). Sharding again (at H') gets you
H + H' + T/H/H'.

Let's imagine you do one message per commit, so C=T. At 400K messages,
that's about 160 billion hash lookups flat. At H=256, it's about 700
million. If you shard again with H'=256, it's 200 million. After that,
the additive terms start to dominate, and it's not worth going any
further (and also, we're ignoring the extra-tree cost to each level).

At that point you're better off to start having fewer commits. I know
that the schema you use does put useful information into the commit
message, but it's also redundant with what's in the messages themselves.
And it sounds like you push most of that out to Xapian anyway.

Imagine your repo had one commit with 400K historical messages, and then
grouped the new messages so that on average we got about 10 messages per
commit (this doesn't seem unrealistic for something that commits every
few minutes; the messages tend to be bunched in time; I ran some
numbers against a 10-minute mark in the earlier message).

Then after another 100K messages, we'd have C=10,001 and T=500K. With
two levels of hashing at 256 each, that's ~5 million hash lookups to
walk the graph. And those numbers would be reasonable for a hosting site
like GitHub.

I don't know what C is for the kernel repo, but I suspect with the right
tuning it could be made into large-but-reasonable.

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cross-referencing the Git mailing list archive with their corresponding commits in `pu`
  2017-02-06 22:07   ` Jeff King
@ 2017-02-07  0:14     ` Eric Wong
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2017-02-07  0:14 UTC (permalink / raw)
  To: Jeff King; +Cc: Johannes Schindelin, Josh Triplett, git

Jeff King <peff@peff.net> wrote:
> On Mon, Feb 06, 2017 at 08:48:20PM +0000, Eric Wong wrote:
> 
> > I haven't hit insurmountable performance problems, even on
> > low-end hardware; especially since I started storing blob ids in
> > Xapian itself, avoiding the expensive tree lookup via git.
> 
> The painful thing is traversing the object graph for clones and fetches.
> Bitmaps help, but you still have to generate them.

Yep.  "public-inbox-init" defaults to enabling bitmaps in the
config for this reason.

> > The main problem seems to be tree size.  Deepening (2/2/36 vs
> > 2/38) might be an option (I think Peff brought that up); but it
> > might be easier to switch to YYYYMM refs (working like
> > logrotate) and rely on Xapian to tie the entire thing together.
> 
> Yes, the hashing is definitely one issue. Some numbers here:
> 
>   http://public-inbox.org/git/20160805092805.w3nwv2l6jkbuwlzf@sigill.intra.peff.net/
> 
> If you have C commits on a tree with T entries, you have to do C*T hash
> lookups for a flat tree (for each commit, you have to see "yup, already
> saw that object"). Sharding that across H entries at the top level drops
> the tree cost from T to H + T/H (actually, it's a bit worse because we
> have to read the secondary tree, too). Sharding again (at H') gets you
> H + H' + T/H/H'.
> 
> Let's imagine you do one message per commit, so C=T. At 400K messages,
> that's about 160 billion hash lookups flat. At H=256, it's about 700
> million. If you shard again with H'=256, it's 200 million. After that,
> the additive terms start to dominate, and it's not worth going any
> further (and also, we're ignoring the extra-tree cost to each level).

Just to make sure I'm following, here; the entire formulas are:

	C * H + H' + (T / H / H')     # 2/2/36
	C * H + (T / H)               # 2/38 (current)

Right?

> At that point you're better off to start having fewer commits. I know
> that the schema you use does put useful information into the commit
> message, but it's also redundant with what's in the messages themselves.
> And it sounds like you push most of that out to Xapian anyway.

Yeah, there's no benefit to Xapian users for having any info in
the commit.  However, keeping commit-per-message is still
important to me to for better robustness from hardware and
network failures.

But yes, historical stuff could be squashed into a single commit
(much like how linux.git started with v2.6.12-rc2 without
history).  Perhaps some folks will care about NNTP article
numbering being non-chronological...

> Imagine your repo had one commit with 400K historical messages, and then
> grouped the new messages so that on average we got about 10 messages per
> commit (this doesn't seem unrealistic for something that commits every
> few minutes; the messages tend to be bunched in time; I ran some
> numbers against a 10-minute mark in the earlier message).
> 
> Then after another 100K messages, we'd have C=10,001 and T=500K. With
> two levels of hashing at 256 each, that's ~5 million hash lookups to
> walk the graph. And those numbers would be reasonable for a hosting site
> like GitHub.
> 
> I don't know what C is for the kernel repo, but I suspect with the right
> tuning it could be made into large-but-reasonable.

LKML probably has an upper bound of 30K messages per month;
so it could hit 100K in less than 4 months.  Worst case might
be 360K messages a year

	360000 * (256 + 256 + ((360000 + old) / 256 / 256))

That's still at least 180 million hash lookups after a year or
so of real-time updates; right?  (But probably closer to 240
million if there's 10 million old messages in there.

Instead, I think I will add an option to support logrotate-style
monthly heads (YYYYMM); keeping 2/38 and C == T:

	30000 * (256 + (30000 / 256))               => 11 million
	30000 * (256 + 256 + (30000 / 256 / 256))   => 15 million

The monthly heads would each be discontiguous history-wise;
so Xapian would become a requirement for users of this option
for Message-ID lookups, but histories would still be readable
with "git log"

One good side-effect of using monthly heads is --single-branch
clones may be used if someone lacks the bandwidth or space to do
a full mirror.  I'm not sure if the server-side (pack reuse,
bitmaps) will benefit other aside from bandwidth reductions,
though.


A (far-fetched) option I've considered would be to store entire
messages in the commit and have no trees or blobs at all.  But
that would require a significant rework, and would also make
Xapian a hard requirement for even checking if a message is
deleted or not.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cross-referencing the Git mailing list archive with their corresponding commits in `pu`
  2017-02-06 19:10 ` Junio C Hamano
@ 2017-02-09 14:11   ` Lars Schneider
  2017-02-09 21:53     ` Johannes Schindelin
  0 siblings, 1 reply; 11+ messages in thread
From: Lars Schneider @ 2017-02-09 14:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, Josh Triplett, git


> On 06 Feb 2017, at 20:10, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
>> So I thought maybe the From: line (from the body, if available, otherwise
>> from the header) in conjunction with the "Date:" header would work.
> 
> FYI, I use a post-applypatch hook to populate refs/notes/amlog notes
> tree when I queue a new patch; I am not sure how well the notes in
> it are preserved across rebases, but it could be a good starting
> point.  The notes tree is mirrored at git://github.com/git/gitster
> repository.
> 
> E.g.
> 
> $ git show --notes=amlog --stat

That's super useful! Thanks for the pointer!
Wouldn't it make sense to push these notes to github.com/git/git ?

- Lars

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cross-referencing the Git mailing list archive with their corresponding commits in `pu`
  2017-02-09 14:11   ` Lars Schneider
@ 2017-02-09 21:53     ` Johannes Schindelin
  2017-02-09 22:18       ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Johannes Schindelin @ 2017-02-09 21:53 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Junio C Hamano, Josh Triplett, git

Hi Lars,

On Thu, 9 Feb 2017, Lars Schneider wrote:

> > On 06 Feb 2017, at 20:10, Junio C Hamano <gitster@pobox.com> wrote:
> > 
> > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> > 
> >> So I thought maybe the From: line (from the body, if available,
> >> otherwise from the header) in conjunction with the "Date:" header
> >> would work.
> > 
> > FYI, I use a post-applypatch hook to populate refs/notes/amlog notes
> > tree when I queue a new patch; I am not sure how well the notes in it
> > are preserved across rebases, but it could be a good starting point.
> > The notes tree is mirrored at git://github.com/git/gitster repository.
> > 
> > E.g.
> > 
> > $ git show --notes=amlog --stat
> 
> That's super useful! Thanks for the pointer!
> Wouldn't it make sense to push these notes to github.com/git/git ?

I am not quite sure about that. It is in a different namespace than what
is usually cloned, and it currently adds 8MB to the download (there are
"amlog" and "commits", the latter clearly being a sandbox).

While I am thankful that there is at least some information available for
patches integrated into `pu` since Nov 1 2016, the format is probably not
stable (we are talking about free-form notes, after all), and it still
does not help with catching the case where new patch series iterations (or
in some case, new patch series, period) are missed.

Make no mistake, it will be a huge undertaking to develop a tool that
helps with the management of patch series on top of the mailing list
driven patch review process. And even in the best case, it may be simply
too hard for an automated tool to figure things out e.g. when Peff or
Junio paste a tangentially related diff into a thread.

In the end, what I *really* would love to have is a system where you can
easily query "which reviewer comments on *any* of my patch series are new,
or still unaddressed?", and "in what way was my patch modified relative to
the latest version I submitted?". It may actually be impossible to create
such a tool, as it cannot invent information/cross-references that it does
not have nor can deduce from available data.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cross-referencing the Git mailing list archive with their corresponding commits in `pu`
  2017-02-09 21:53     ` Johannes Schindelin
@ 2017-02-09 22:18       ` Junio C Hamano
  0 siblings, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2017-02-09 22:18 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Lars Schneider, Josh Triplett, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> > E.g.
>> > 
>> > $ git show --notes=amlog --stat
>> 
>> That's super useful! Thanks for the pointer!
>> Wouldn't it make sense to push these notes to github.com/git/git ?
>
> I am not quite sure about that. It is in a different namespace than what
> is usually cloned, and it currently adds 8MB to the download (there are
> "amlog" and "commits", the latter clearly being a sandbox).

I do not think the public mirrors of the primary repository should
get amlog, either.  It is more suited for those who are interested
in broken-out topics, i.e. git://github.com/git/gitster.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cross-referencing the Git mailing list archive with their corresponding commits in `pu`
  2017-02-06 15:34 Cross-referencing the Git mailing list archive with their corresponding commits in `pu` Johannes Schindelin
  2017-02-06 19:10 ` Junio C Hamano
  2017-02-06 20:48 ` Eric Wong
@ 2017-02-17 17:50 ` Johannes Schindelin
  2017-02-20 19:33   ` Junio C Hamano
  2 siblings, 1 reply; 11+ messages in thread
From: Johannes Schindelin @ 2017-02-17 17:50 UTC (permalink / raw)
  To: Josh Triplett; +Cc: git

Hi Josh,

On Mon, 6 Feb 2017, Johannes Schindelin wrote:

> as discussed at the GitMerge, I am trying to come up with tooling that
> will allow for substantially less tedious navigation between the local
> repository, the mailing list, and what ends up in the `pu` branch.

I found a little bit more time last Friday to play with the
cross-correlation between commits in `pu` and mails in
public-inbox/git.git and it is worse than I previously assumed.

Just as a reminder: my plan was to start developing tools that will
ultimately help me as well as other contributors with the arcane mailing
list model of patch submission. And my first target was the seemingly
simple task of figuring out the mail corresponding to any given commit in
`pu` (i.e. the mail that contained the patch, and whose mail thread is
hence expected to have the entire patch review, and to which I would be
expected to respond if I find a problem with that commit).

And since it is all-too-common that the oneline is adjusted before
applying the patch, the Subject:/oneline pair is not a good candidate to
find matches.

My next best guess was that the author date would not be touched, so the
pair of Date: and authordate should make a good candidate.

My initial finding was that this is not without problems, as some mails
were sent with identical Date: lines (most likely due to bugs in the
tools, e.g. the well-known and already fixed bug in git-am, and hence
git-rebase, where it would apply all patches using the first patch's
author date), and worse: some of those mails contained actual patch series
that actually made it into Git's commit history.

But those are not the only problems.

For starters, I tried to cross-correlate *just* the commits that entered
`pu` since one week ago (git rev-list --since=1.week.ago upstream/pu) with
mails of the past month in the mailing list archive.

One obvious caveat is that RFC 2822 is ambiguous when it comes to the date
format. While it seems nice that you *can* write single-digit day numbers
as single digit if you want, or with a leading zero, or with a leading
space, it makes it impossible to get away with exact matching. I did not
really want to complicate my research by parsing the dates and normalizing
them to epoch + timezone, also because I wanted results quick, so I simply
normalized the dates to have leading zeroes for single-digit day numbers,
that seems to work for the moment).

The first category of problematic commits come as no surprise: merges. We
do not even have a way to represent them as mails. I simply excluded them
from the remainder of this study.

The second category should not be all that surprising, too: Junio often
adjusts the release notes without sending those patches out for review.
Those commits are:

363588f (### match next, Junio C Hamano 2017-02-17)
2076907 (Git 2.12-rc2, Junio C Hamano 2017-02-17)
076c053 (Hopefully the final batch of mini-topics before the final,
	Junio C Hamano 2017-02-16)
ae86372 (Revert "reset: add an example of how to split a commit into two",
	Junio C Hamano 2017-02-16)
d09b692 (A bit more for -rc2, Junio C Hamano 2017-02-15)

There is a third category, and this one *does* come as a surprise to me.
It appears that at least *some* patches' Date: lines are either ignored or
overridden or changed on their way from the mailing list into Git's commit
history. There was only one commit in that commit range:

3c0cb0c (read_loose_refs(): read refs using resolve_ref_recursively(),
	Michael Haggerty 2017-02-09)

This one was committed with an author date "Thu, 09 Feb 2017 21:53:52
+0100" but it appears that there was no mail sent to the Git mailing list
with that particular Date: header and the *actual* mail containing the
patch was sent with a Date: header "Fri, 10 Feb 2017 12:16:19 +0100"
(Message-ID:
d8e906d969700acbca8dc717673d0a9cdc910f62.1486724698.git.mhagger@alum.mit.edu).

It is labor-intensive, but possible to find the correlation manually in
this case because the Subject: line has been left intact.

However, this points to a serious problem with my approach: I try to
re-create information that is actually not available (which Message-ID
corresponds to a given commit name). Since that information is not
available, it is quite possible that this information cannot be retrieved
accurately (and Michael's commit demonstrates that this is not a merely
theoretic consideration). I do not know that I can fix this on my side.

> P.S.: I used public-inbox.org links instead of commit references to the
> Git repository containing the mailing list archive, because the format
> of said Git repository is so unfavorable that it was determined very
> quickly in a discussion between Patrick Reynolds (GitHub) and myself
> that it would put totally undue burden on GitHub to mirror it there
> (compare also Carlos Nieto's talk at GitMerge titled "Top Ten Worst
> Repositories to host on GitHub").

Since the main problem was the unfavorable commit history structure, I
*think* that it may be possible to auto-process public-inbox.org/git.git
into a frequently-rewritten branch that squashes all commits from past
years into single, per-year commits (and the same for recent months, the
past days, and a single commit accumulating the current day's commits) and
that that may solve the problematic structure. The blob names would remain
identical to what is on public-inbox, of course.

Ciao,
Johannes

P.S.: The *mini* scripts I used were

cat generate-date-index.sh <<\EOF
#! /bin/sh

cd public-inbox-git

since_commit="$1"
test -n "$since_commit" ||
since_commit=$(git rev-list --since=1.month.ago master --reverse | head -n 1)
for sha1 in $(git diff --raw --no-abbrev $since_commit..master | cut -f 4 -d \ )
do
	printf '%s\t%s\n' \
		"$(git cat-file blob $sha1 |
		sed -n \
			-e 's/^Date:[ 	]*\([^,]*,\) *\([1-9] .*\)/\1 0\2/p' \
			-e 's/^Date:[ 	]*\([^,]*,\) *\([0-9][0-9] .*\)/\1 \2/p' \
			-e '/^$/q')" \
		$sha1
done | less -S
EOF

to generate a file date-index.txt containing "date\tblob" pairs where the blob
refers to the SHA-1 of the mail in public-inbox/git.git, and

cat >match-pu.sh <<\EOF
#! /bin/sh

for commit in $(git rev-list --since=1.week.ago --no-merges upstream/pu)
do
	date="$(git show -s --format=%aD $commit |
		sed 's/, \([1-9]\) /, 0\1 /')" # fix up Git's idea of RFC 2822
	mail_id=$(grep "^$date" date-index.txt | sed 's/.*	//')
	case "$mail_id" in
	'')
		echo "ERROR: no mail found for $commit (date $date)" >&2
		git show -s --pretty='tformat:%h (%s, %an %ad)' --date=short \
			$commit >&2
		;;
	*' '*)
		echo "ERROR: multiple candidates found for $commit ($mail_id)" >&2
		;;
	*)
		echo "$date $mail_id"
		;;
	esac
done
EOF

to try to match the author dates with the ones in date-index.txt. The
obvious next improvement is to list also Message-ID in date-index.txt.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cross-referencing the Git mailing list archive with their corresponding commits in `pu`
  2017-02-17 17:50 ` Johannes Schindelin
@ 2017-02-20 19:33   ` Junio C Hamano
  2017-02-20 20:06     ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2017-02-20 19:33 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Josh Triplett, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> There is a third category, and this one *does* come as a surprise to me.
> It appears that at least *some* patches' Date: lines are either ignored or
> overridden or changed on their way from the mailing list into Git's commit
> history. There was only one commit in that commit range:
>
> 3c0cb0c (read_loose_refs(): read refs using resolve_ref_recursively(),
> 	Michael Haggerty 2017-02-09)
>
> This one was committed with an author date "Thu, 09 Feb 2017 21:53:52
> +0100" but it appears that there was no mail sent to the Git mailing list

I think this is this one:

    <ff0b0df6-9aed-9417-d9d4-1234d53f05c3@alum.mit.edu>

Recent "What's cooking" lists the topic this one is part with this
comment:

 The tip one is newer than the one posted to the list but was sent
 privately by the author via his GitHub repository.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cross-referencing the Git mailing list archive with their corresponding commits in `pu`
  2017-02-20 19:33   ` Junio C Hamano
@ 2017-02-20 20:06     ` Junio C Hamano
  0 siblings, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2017-02-20 20:06 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Josh Triplett, git

Junio C Hamano <gitster@pobox.com> writes:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
>> There is a third category, and this one *does* come as a surprise to me.
>> It appears that at least *some* patches' Date: lines are either ignored or
>> overridden or changed on their way from the mailing list into Git's commit
>> history. There was only one commit in that commit range:
>>
>> 3c0cb0c (read_loose_refs(): read refs using resolve_ref_recursively(),
>> 	Michael Haggerty 2017-02-09)
>>
>> This one was committed with an author date "Thu, 09 Feb 2017 21:53:52
>> +0100" but it appears that there was no mail sent to the Git mailing list
>
> I think this is this one:
>
>     <ff0b0df6-9aed-9417-d9d4-1234d53f05c3@alum.mit.edu>
>
> Recent "What's cooking" lists the topic this one is part with this
> comment:
>
>  The tip one is newer than the one posted to the list but was sent
>  privately by the author via his GitHub repository.

We didn't have any pull from sub-maintainers during the period you
checked, but when we do, those could also fall into the category.
Even though I see some l10n patches Cc'ed to the list, I won't be
surprised if not everything that is sent to Jiang Xin (i18n/l10n
coordinator) is, for example.  It also is OK for sub-maintainers to
have their own commit to describe or otherwise improve their area
and without sending a patch before doing so if they deem it
appropriate [*1*].

I actually think automation like yours would help another category:
There is a newer version of the series or an entirely new series on
the list, but the project's tree has not picked them up (yet).

I from time to time sweep my inbox in an attempt to find and pick up
leftover bits.  Sometimes the authors remind me by pinging [*2*],
which greatly helps.  But another set of eyeballs that may be
enhanced with a mechanised filter that catches "messages without
corresponding commits", which is the opposite of this "third"
category, would be of great help, too [*3*].


[Footnote]

*1* ... like trivial fixes, for example, at their discretion.  After
    all we entrusted their own area and we should give them the
    flexibility they can exercise with good taste ;-).

*2* e.g. <2f67fc21-92f9-a03e-1b09-a237af6dbc46@alum.mit.edu>

*3* ... even if a mechanised filter alone might strike too many
    false positives.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-02-20 20:06 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-06 15:34 Cross-referencing the Git mailing list archive with their corresponding commits in `pu` Johannes Schindelin
2017-02-06 19:10 ` Junio C Hamano
2017-02-09 14:11   ` Lars Schneider
2017-02-09 21:53     ` Johannes Schindelin
2017-02-09 22:18       ` Junio C Hamano
2017-02-06 20:48 ` Eric Wong
2017-02-06 22:07   ` Jeff King
2017-02-07  0:14     ` Eric Wong
2017-02-17 17:50 ` Johannes Schindelin
2017-02-20 19:33   ` Junio C Hamano
2017-02-20 20:06     ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).