git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jacob Keller <jacob.e.keller@intel.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
	Jacob Keller <jacob.keller@gmail.com>
Subject: [PATCH] mailsplit add option to include sanitized subject in filename
Date: Mon,  8 Apr 2024 17:05:46 -0700	[thread overview]
Message-ID: <20240409000546.3628898-1-jacob.e.keller@intel.com> (raw)

From: Jacob Keller <jacob.keller@gmail.com>

git-am makes use of git-mailsplit to split an mbox into individual files
before attempting to apply the patches. In most cases this works fine,
but it can fail to apply patches in cases where the mbox file is not
properly sorted. This can sometimes happen due to clock skew, or other
issues with the software which saved the mbox.

For example, if you download a t.mbox.gz from a public inbox server such
as lore.kernel.org it may sort the messages in the thread by arrival
time to the list. Due to clock skew or other issues this may not be the
correct order of the patches to apply.

A savvy user may then attempt to directly use git mailsplit to split the
mailbox, only to find that the files are unhelpfully named "0001",
"0002", etc. It requires further digging to figure out which message is
which patch.

Git has a format_sanitized_subject() function which is used by code to
generate a suitable filename from a subject. Add a new --name-by-subject
option to git mailsplit. If enabled, scan for lines beginning with the
"Subject:" header when splitting mail. If found, extract the subject and
pass it to format_sanitized_subject(). Use this to create a new filename
which appends the sanitized subject to the standard sequence number. A
savvy user can invoke git mailsplit with --name-by-subject to help
analyze why the mailbox was not split the intended way.

I originally wanted to avoid the need for an option, but git-am
currently depends on the strict sequence number filenames. It is unclear
how difficult it would be to refactor git-am to work with names that
include the extra subject data.

Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Jacob Keller <jacob.keller@gmail.com>
---
 Documentation/git-mailsplit.txt |  5 +++++
 builtin/mailsplit.c             | 25 ++++++++++++++++++++++++-
 t/t5100-mailinfo.sh             | 25 +++++++++++++++++++++++++
 3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-mailsplit.txt b/Documentation/git-mailsplit.txt
index 3f0a6662c81e..2e5ba45e1988 100644
--- a/Documentation/git-mailsplit.txt
+++ b/Documentation/git-mailsplit.txt
@@ -9,6 +9,7 @@ SYNOPSIS
 --------
 [verse]
 'git mailsplit' [-b] [-f<nn>] [-d<prec>] [--keep-cr] [--mboxrd]
+		[--name-by-subject]
 		-o<directory> [--] [(<mbox>|<Maildir>)...]
 
 DESCRIPTION
@@ -52,6 +53,10 @@ OPTIONS
 	Input is of the "mboxrd" format and "^>+From " line escaping is
 	reversed.
 
+--name-by-subject::
+	Include the sanitized subject in the generated filenames, in
+	addition to the sequence number.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/mailsplit.c b/builtin/mailsplit.c
index 3af9ddb8ae5c..df81782d05b3 100644
--- a/builtin/mailsplit.c
+++ b/builtin/mailsplit.c
@@ -8,9 +8,10 @@
 #include "gettext.h"
 #include "string-list.h"
 #include "strbuf.h"
+#include "pretty.h"
 
 static const char git_mailsplit_usage[] =
-"git mailsplit [-d<prec>] [-f<n>] [-b] [--keep-cr] -o<directory> [(<mbox>|<Maildir>)...]";
+"git mailsplit [-d<prec>] [-f<n>] [-b] [--keep-cr] [--name-by-subject] -o<directory> [(<mbox>|<Maildir>)...]";
 
 static int is_from_line(const char *line, int len)
 {
@@ -46,6 +47,7 @@ static int is_from_line(const char *line, int len)
 static struct strbuf buf = STRBUF_INIT;
 static int keep_cr;
 static int mboxrd;
+static int name_by_subject;
 
 static int is_gtfrom(const struct strbuf *buf)
 {
@@ -66,6 +68,9 @@ static int is_gtfrom(const struct strbuf *buf)
  */
 static int split_one(FILE *mbox, const char *name, int allow_bare)
 {
+	struct strbuf sanitized_filename = STRBUF_INIT;
+	const char *subject_start;
+	size_t subject_len;
 	FILE *output;
 	int fd;
 	int status = 0;
@@ -101,10 +106,26 @@ static int split_one(FILE *mbox, const char *name, int allow_bare)
 			}
 			die_errno("cannot read mbox");
 		}
+
+		/* Get a sanitized filename from the subject */
+		if (name_by_subject && !sanitized_filename.len &&
+		    skip_prefix_mem(buf.buf, buf.len, "Subject:",
+				    &subject_start, &subject_len)) {
+			strbuf_addf(&sanitized_filename, "%s-", name);
+			format_sanitized_subject(&sanitized_filename,
+						 subject_start,
+						 subject_len);
+		}
+
 		if (!is_bare && is_from_line(buf.buf, buf.len))
 			break; /* done with one message */
 	}
 	fclose(output);
+
+	if (name_by_subject && sanitized_filename.len)
+		rename(name, sanitized_filename.buf);
+	strbuf_release(&sanitized_filename);
+
 	return status;
 }
 
@@ -296,6 +317,8 @@ int cmd_mailsplit(int argc, const char **argv, const char *prefix)
 			usage(git_mailsplit_usage);
 		} else if ( arg[1] == 'b' && !arg[2] ) {
 			allow_bare = 1;
+		} else if (!strcmp(arg, "--name-by-subject")) {
+			name_by_subject = 1;
 		} else if (!strcmp(arg, "--keep-cr")) {
 			keep_cr = 1;
 		} else if ( arg[1] == 'o' && arg[2] ) {
diff --git a/t/t5100-mailinfo.sh b/t/t5100-mailinfo.sh
index c8d06554541c..4826735c6033 100755
--- a/t/t5100-mailinfo.sh
+++ b/t/t5100-mailinfo.sh
@@ -44,6 +44,31 @@ do
 	'
 done
 
+test_expect_success 'split sample box with --name-by-subject' '
+	mkdir name-by-subject &&
+	git mailsplit --name-by-subject -oname-by-subject "$DATA/sample.mbox" >last &&
+	last=$(cat last) &&
+	echo total is $last &&
+	test $(cat last) = 18
+'
+
+check_mailinfo_name_by_subject () {
+	mail=$1
+	mo="$(basename "$mail" | cut -c1-4)"
+	echo "$(basename "$mail")" >"sanitized$mo" &&
+	git mailinfo -u "msg$mo" "patch$mo" <"$mail" >"info$mo" &&
+	test_cmp "$DATA/msg$mo" "msg$mo" &&
+	test_cmp "$DATA/patch$mo" "patch$mo" &&
+	test_cmp "$DATA/info$mo" "info$mo" &&
+	test_cmp "$DATA/sanitized$mo" "sanitized$mo"
+}
+
+for mail in name-by-subject/00*
+do
+	test_expect_success "check --name-by-subject $mail" '
+		check_mailinfo_name_by_subject "$mail"
+	'
+done
 
 test_expect_success 'split box with rfc2047 samples' \
 	'mkdir rfc2047 &&
-- 
2.44.0.53.g0f9d4d28b7e6



             reply	other threads:[~2024-04-09  0:06 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-09  0:05 Jacob Keller [this message]
2024-04-09  1:55 ` [PATCH] mailsplit add option to include sanitized subject in filename Junio C Hamano
2024-04-11  3:22   ` Jacob Keller
2024-04-11  5:29     ` Junio C Hamano
2024-04-11  5:55     ` Junio C Hamano
2024-04-11 18:52       ` Jacob Keller
2024-04-11 21:25         ` Junio C Hamano
2024-04-13  0:07           ` Jacob Keller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240409000546.3628898-1-jacob.e.keller@intel.com \
    --to=jacob.e.keller@intel.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jacob.keller@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).