git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Victoria Dye via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: peff@peff.net, derrickstolee@github.com, gitster@pobox.com,
	Phillip Wood <phillip.wood123@gmail.com>,
	Victoria Dye <vdye@github.com>, Victoria Dye <vdye@github.com>
Subject: [PATCH v2] read-cache: avoid misaligned reads in index v4
Date: Wed, 28 Sep 2022 17:19:00 +0000	[thread overview]
Message-ID: <pull.1366.v2.git.1664385541084.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.1366.git.1663962236069.gitgitgadget@gmail.com>

From: Victoria Dye <vdye@github.com>

The process for reading the index into memory from disk is to first read its
contents into a single memory-mapped file buffer (type 'char *'), then
sequentially convert each on-disk index entry into a corresponding incore
'cache_entry'. To access the contents of the on-disk entry for processing, a
moving pointer within the memory-mapped file is cast to type 'struct
ondisk_cache_entry *'.

In index v4, the entries in the on-disk index file are written *without*
aligning their first byte to a 4-byte boundary; entries are a variable
length (depending on the entry name and whether or not extended flags are
used). As a result, casting the 'char *' buffer pointer to 'struct
ondisk_cache_entry *' then accessing its contents in a 'SANITIZE=undefined'
build can trigger the following error:

  read-cache.c:1886:46: runtime error: member access within misaligned
  address <address> for type 'struct ondisk_cache_entry', which requires 4
  byte alignment

Avoid this error by reading fields directly from the 'char *' buffer, using
the 'offsetof' individual fields in 'struct ondisk_cache_entry'.
Additionally, add documentation describing why the new approach avoids the
misaligned address error, as well as advice on how to improve the
implementation in the future.

Reported-by: Jeff King <peff@peff.net>
Signed-off-by: Victoria Dye <vdye@github.com>
---
    read-cache: avoid misaligned reads in index v4
    
    This fixes the bug reported in [1], where unaligned index entries in the
    memory-mapped index file triggered a 'SANITIZE=undefined' error due to
    casting to & accessing unaligned data a 4 byte-aligned 'struct
    ondisk_cache_entry *' type.
    
    In addition to the originally-reported 't9210-scalar.sh' now passing in
    a 'SANITIZE=undefined' build, I did some light testing by first writing
    a v4 index with a released version of Git (v2.37), then running some
    index-modifying operations ('git status', 'git add') with this patch's
    changes, then again running 'git status' with the stable version. I
    didn't see anything out of the ordinary but, considering how critical
    "reading the index" is, I'd very much appreciate some extra-thorough
    reviews on this patch. :)
    
    
    Changes since V1
    ================
    
     * Added a comment explaining the somewhat unintuitive use of the
       'ondisk' buffer in 'create_from_disk()', including why the 'get_be*'
       functions do not run into the same misaligned address error this
       patch is fixing.
     * Added a 'NEEDSWORK' comment recommending a future refactor to avoid
       the need for 'struct ondisk_cache_entry' entirely.
    
    Thanks!
    
     * Victoria
    
    [1]
    https://lore.kernel.org/git/YywzNTzd72tox8Z+@coredump.intra.peff.net/

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1366%2Fvdye%2Fbugfix%2Findex-v4-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1366/vdye/bugfix/index-v4-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1366

Range-diff vs v1:

 1:  be9039c810b ! 1:  0f0fe7cabfd read-cache: avoid misaligned reads in index v4
     @@ Commit message
      
          Avoid this error by reading fields directly from the 'char *' buffer, using
          the 'offsetof' individual fields in 'struct ondisk_cache_entry'.
     +    Additionally, add documentation describing why the new approach avoids the
     +    misaligned address error, as well as advice on how to improve the
     +    implementation in the future.
      
          Reported-by: Jeff King <peff@peff.net>
          Signed-off-by: Victoria Dye <vdye@github.com>
      
       ## read-cache.c ##
      @@ read-cache.c: static int read_index_extension(struct index_state *istate,
     + 	return 0;
     + }
       
     ++/*
     ++ * Parses the contents of the cache entry contained within the 'ondisk' buffer
     ++ * into a new incore 'cache_entry'.
     ++ *
     ++ * Note that 'char *ondisk' may not be aligned to a 4-byte address interval in
     ++ * index v4, so we cannot cast it to 'struct ondisk_cache_entry *' and access
     ++ * its members. Instead, we use the byte offsets of members within the struct to
     ++ * identify where 'get_be16()', 'get_be32()', and 'oidread()' (which can all
     ++ * read from an unaligned memory buffer) should read from the 'ondisk' buffer
     ++ * into the corresponding incore 'cache_entry' members.
     ++ */
       static struct cache_entry *create_from_disk(struct mem_pool *ce_mem_pool,
       					    unsigned int version,
      -					    struct ondisk_cache_entry *ondisk,
     @@ read-cache.c: static struct cache_entry *create_from_disk(struct mem_pool *ce_me
      -	ce->ce_stat_data.sd_uid   = get_be32(&ondisk->uid);
      -	ce->ce_stat_data.sd_gid   = get_be32(&ondisk->gid);
      -	ce->ce_stat_data.sd_size  = get_be32(&ondisk->size);
     ++	/*
     ++	 * NEEDSWORK: using 'offsetof()' is cumbersome and should be replaced
     ++	 * with something more akin to 'load_bitmap_entries_v1()'s use of
     ++	 * 'read_be16'/'read_be32'. For consistency with the corresponding
     ++	 * ondisk entry write function ('copy_cache_entry_to_ondisk()'), this
     ++	 * should be done at the same time as removing references to
     ++	 * 'ondisk_cache_entry' there.
     ++	 */
      +	ce->ce_stat_data.sd_ctime.sec = get_be32(ondisk + offsetof(struct ondisk_cache_entry, ctime)
      +							+ offsetof(struct cache_time, sec));
      +	ce->ce_stat_data.sd_mtime.sec = get_be32(ondisk + offsetof(struct ondisk_cache_entry, mtime)


 read-cache.c | 61 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 42 insertions(+), 19 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index b09128b1884..32024029274 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1873,9 +1873,20 @@ static int read_index_extension(struct index_state *istate,
 	return 0;
 }
 
+/*
+ * Parses the contents of the cache entry contained within the 'ondisk' buffer
+ * into a new incore 'cache_entry'.
+ *
+ * Note that 'char *ondisk' may not be aligned to a 4-byte address interval in
+ * index v4, so we cannot cast it to 'struct ondisk_cache_entry *' and access
+ * its members. Instead, we use the byte offsets of members within the struct to
+ * identify where 'get_be16()', 'get_be32()', and 'oidread()' (which can all
+ * read from an unaligned memory buffer) should read from the 'ondisk' buffer
+ * into the corresponding incore 'cache_entry' members.
+ */
 static struct cache_entry *create_from_disk(struct mem_pool *ce_mem_pool,
 					    unsigned int version,
-					    struct ondisk_cache_entry *ondisk,
+					    const char *ondisk,
 					    unsigned long *ent_size,
 					    const struct cache_entry *previous_ce)
 {
@@ -1883,7 +1894,7 @@ static struct cache_entry *create_from_disk(struct mem_pool *ce_mem_pool,
 	size_t len;
 	const char *name;
 	const unsigned hashsz = the_hash_algo->rawsz;
-	const uint16_t *flagsp = (const uint16_t *)(ondisk->data + hashsz);
+	const char *flagsp = ondisk + offsetof(struct ondisk_cache_entry, data) + hashsz;
 	unsigned int flags;
 	size_t copy_len = 0;
 	/*
@@ -1901,15 +1912,15 @@ static struct cache_entry *create_from_disk(struct mem_pool *ce_mem_pool,
 
 	if (flags & CE_EXTENDED) {
 		int extended_flags;
-		extended_flags = get_be16(flagsp + 1) << 16;
+		extended_flags = get_be16(flagsp + sizeof(uint16_t)) << 16;
 		/* We do not yet understand any bit out of CE_EXTENDED_FLAGS */
 		if (extended_flags & ~CE_EXTENDED_FLAGS)
 			die(_("unknown index entry format 0x%08x"), extended_flags);
 		flags |= extended_flags;
-		name = (const char *)(flagsp + 2);
+		name = (const char *)(flagsp + 2 * sizeof(uint16_t));
 	}
 	else
-		name = (const char *)(flagsp + 1);
+		name = (const char *)(flagsp + sizeof(uint16_t));
 
 	if (expand_name_field) {
 		const unsigned char *cp = (const unsigned char *)name;
@@ -1935,20 +1946,32 @@ static struct cache_entry *create_from_disk(struct mem_pool *ce_mem_pool,
 
 	ce = mem_pool__ce_alloc(ce_mem_pool, len);
 
-	ce->ce_stat_data.sd_ctime.sec = get_be32(&ondisk->ctime.sec);
-	ce->ce_stat_data.sd_mtime.sec = get_be32(&ondisk->mtime.sec);
-	ce->ce_stat_data.sd_ctime.nsec = get_be32(&ondisk->ctime.nsec);
-	ce->ce_stat_data.sd_mtime.nsec = get_be32(&ondisk->mtime.nsec);
-	ce->ce_stat_data.sd_dev   = get_be32(&ondisk->dev);
-	ce->ce_stat_data.sd_ino   = get_be32(&ondisk->ino);
-	ce->ce_mode  = get_be32(&ondisk->mode);
-	ce->ce_stat_data.sd_uid   = get_be32(&ondisk->uid);
-	ce->ce_stat_data.sd_gid   = get_be32(&ondisk->gid);
-	ce->ce_stat_data.sd_size  = get_be32(&ondisk->size);
+	/*
+	 * NEEDSWORK: using 'offsetof()' is cumbersome and should be replaced
+	 * with something more akin to 'load_bitmap_entries_v1()'s use of
+	 * 'read_be16'/'read_be32'. For consistency with the corresponding
+	 * ondisk entry write function ('copy_cache_entry_to_ondisk()'), this
+	 * should be done at the same time as removing references to
+	 * 'ondisk_cache_entry' there.
+	 */
+	ce->ce_stat_data.sd_ctime.sec = get_be32(ondisk + offsetof(struct ondisk_cache_entry, ctime)
+							+ offsetof(struct cache_time, sec));
+	ce->ce_stat_data.sd_mtime.sec = get_be32(ondisk + offsetof(struct ondisk_cache_entry, mtime)
+							+ offsetof(struct cache_time, sec));
+	ce->ce_stat_data.sd_ctime.nsec = get_be32(ondisk + offsetof(struct ondisk_cache_entry, ctime)
+							 + offsetof(struct cache_time, nsec));
+	ce->ce_stat_data.sd_mtime.nsec = get_be32(ondisk + offsetof(struct ondisk_cache_entry, mtime)
+							 + offsetof(struct cache_time, nsec));
+	ce->ce_stat_data.sd_dev   = get_be32(ondisk + offsetof(struct ondisk_cache_entry, dev));
+	ce->ce_stat_data.sd_ino   = get_be32(ondisk + offsetof(struct ondisk_cache_entry, ino));
+	ce->ce_mode  = get_be32(ondisk + offsetof(struct ondisk_cache_entry, mode));
+	ce->ce_stat_data.sd_uid   = get_be32(ondisk + offsetof(struct ondisk_cache_entry, uid));
+	ce->ce_stat_data.sd_gid   = get_be32(ondisk + offsetof(struct ondisk_cache_entry, gid));
+	ce->ce_stat_data.sd_size  = get_be32(ondisk + offsetof(struct ondisk_cache_entry, size));
 	ce->ce_flags = flags & ~CE_NAMEMASK;
 	ce->ce_namelen = len;
 	ce->index = 0;
-	oidread(&ce->oid, ondisk->data);
+	oidread(&ce->oid, (const unsigned char *)ondisk + offsetof(struct ondisk_cache_entry, data));
 
 	if (expand_name_field) {
 		if (copy_len)
@@ -2117,12 +2140,12 @@ static unsigned long load_cache_entry_block(struct index_state *istate,
 	unsigned long src_offset = start_offset;
 
 	for (i = offset; i < offset + nr; i++) {
-		struct ondisk_cache_entry *disk_ce;
 		struct cache_entry *ce;
 		unsigned long consumed;
 
-		disk_ce = (struct ondisk_cache_entry *)(mmap + src_offset);
-		ce = create_from_disk(ce_mem_pool, istate->version, disk_ce, &consumed, previous_ce);
+		ce = create_from_disk(ce_mem_pool, istate->version,
+				      mmap + src_offset,
+				      &consumed, previous_ce);
 		set_index_entry(istate, i, ce);
 
 		src_offset += consumed;

base-commit: 1b3d6e17fe83eb6f79ffbac2f2c61bbf1eaef5f8
-- 
gitgitgadget

  parent reply	other threads:[~2022-09-28 17:19 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-23 19:43 [PATCH] read-cache: avoid misaligned reads in index v4 Victoria Dye via GitGitGadget
2022-09-23 21:39 ` Jeff King
2022-09-23 22:04   ` Junio C Hamano
2022-09-26 15:39   ` Victoria Dye
2022-09-26 17:35     ` Jeff King
2022-09-26 19:08   ` Jeff King
2022-09-26 19:31     ` Jeff King
2022-09-26 23:02       ` Junio C Hamano
2022-09-25  8:25 ` Phillip Wood
2022-09-26 17:54   ` Junio C Hamano
2022-09-28 17:19 ` Victoria Dye via GitGitGadget [this message]
2022-09-28 17:34   ` [PATCH v2] " Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pull.1366.v2.git.1664385541084.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=phillip.wood123@gmail.com \
    --cc=vdye@github.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).