git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: David Turner <dturner@twopensource.com>
Cc: "René Scharfe" <l.s.r@web.de>,
	git@vger.kernel.org, "Keith McGuigan" <kmcguigan@twitter.com>
Subject: Re: [PATCH v3] merge: fix cache_entry use-after-free
Date: Mon, 19 Oct 2015 19:13:51 -0700	[thread overview]
Message-ID: <xmqqh9lm9ti8.fsf@gitster.mtv.corp.google.com> (raw)
In-Reply-To: <1445293649.3418.16.camel@twopensource.com> (David Turner's message of "Mon, 19 Oct 2015 18:27:29 -0400")

David Turner <dturner@twopensource.com> writes:

> The problem is not the name_hash, but with the dir_hash.  The dir_hash
> is only even populated on case-insensitive filesystems (which is why you
> and Linus don't see this bug).  The dir_hash is not designed to catch
> d/f conflicts, but rather case conflicts (one side of a merge has
> foo/bar; the other has FOO/bar).  For each directory, it maintains a
> pointer to a cache_entry.  Then add_to_index uses that cache_entry to 
> rewrite incoming entries' parent directories to the expected case.
> Weirdly, that part of add_to_index is, so far as I can tell, never
> called during a merge.  So where's are we using the freed value?
> Probably in dir_entry_cmp, while adding another entry to the hashmap;
> that's where our crash seems to have happened.  And our failure depended
> on the details of the hash function as well; if a certain collision did
> not happen, we would not see it.
>
> There is, I think, another way to handle this: we could *copy* the path
> into the struct dir_entry instead of pointing to a cache_entry.  But
> this seems like a bunch of extra work; the refcounting is simpler.  

This codepath is a mess.  Whoever wrote hash_dir_entry() seems to
have already known that the code is buggy by leaving this comment
there:

 * Note that the cache_entry stored with the dir_entry merely
 * supplies the name of the directory (up to dir_entry.namelen). We
 * track the number of 'active' files in a directory in dir_entry.nr,
 * so we can tell if the directory is still relevant, e.g. for git
 * status. However, if cache_entries are removed, we cannot pinpoint
 * an exact cache_entry that's still active. It is very possible that
 * multiple dir_entries point to the same cache_entry.

If you add "a/frotz" to the index, the code will somehow create a
dir_entry to represent that the canonical case for directory "a/" is
such, so that when you later try to add "A/nitfol", the code can
grab the canonical path "a/" from dir_entry hash and turn it into
"a/nitfol".  That is what happens in add_to_index in read-cache.c

But what happens if, after adding "a/frotz" and "a/nitfol" to the
index like that, you remove "a/frotz"?  I do not see any code that
notices the poitner to "a/frotz" will become stale and replace it
with a pointer to "a/nitfol" that is still in the system.  The next
time you try to add "a/xyzzy", find_dir_entry() will try to see if
there is an existing entry that matches "a/xyzzy"'s directory, and
one of the dir_entries has a stale pointer to ce for "a/frotz" that
is already gone.

I think the root cause of the bug is the external interface to the
index_dir_exists() function.  For the above processing, there is no
reason for dir_entry() to hold a pointer to ce for "a/frotz" or
"a/nitfol".  All it needs to have is an embedded string in the
structure itself, i.e.

        struct dir_entry {
                struct hashmap_entry ent;
                struct dir_entry *parent;
                int nr;
                unsigned int namelen;
                char name[FLEX_ARRAY];
        };

The caller of index_dir_exists() in add_to_index() can be adjusted
to work with a new function in name-hash.c that does this part:

	const char *startPtr = ce->name;
	const char *ptr = startPtr;
	while (*ptr) {
		while (*ptr && *ptr != '/')
			++ptr;
		if (*ptr == '/') {
			struct cache_entry *foundce;
			++ptr;
			foundce = index_dir_exists(istate, ce->name, ptr - ce->name - 1);
			if (foundce) {
				memcpy((void *)startPtr, foundce->name + (startPtr - ce->name), ptr - startPtr);
				startPtr = ptr;
			}
		}
	}

Perhaps name it "adjust_dirname_case(istate, ce->name)" and have it
do the whole while loop above, all inside name-hash.c.  That would
make this caller a lot easier to read and understand what is going
on.

The other one, directory_exists_in_index_icase() in dir.c, expects
that a ce is returned, but the way it uses the returned value does
not really require the function to return a ce.  It does look at the
ce->ce_mode but that is only because the way index_dir_exists()
tells its caller if there is a directory (hence some files inside
it) or if there is a submodule (hence there is nothing below it) by
making a fallback call internally to index_file_exists() and returns
the ce for gitlink only when (1) there is no directory in dir_hash
and (2) file_hash as a submodule in that path.

A more cleaner helper function to give what this caller really wants
to know that name-hash.c can provide would be a function that takes
pathname and len and returns an enum: "there is nothing there",
"there is a directory", or "there is a submodule".  For completeness
of the API, even though this sole caller does not need it, we could
throw in "there is a regular file" there, too, if we wanted to.

If I were to clean this up, I would probably use the above updated
definition of struct dir_entry and:

 * drop the fallback "return gitlink if there is one" from
   index_dir_exists(), and make index_dir_exists() return 0 (false)
   or 1 (true).

 * have directory_exists_in_index() first call the updated
   index_dir_exists() via the cache_dir_exists() convenience macro.
   If it returned true, it should return index_direcgtory.  If it
   returned false, make a call to cache_file_exists() to see if
   there is a gitlink and rturn index_gitdir when that is the case.
   When all of the above fails, return index_nonexistent.

That way, we do not have to worry about "this ce no longer exists in
the main index but we need to keep it around", saving 8-bytes from
each ce and refcounting mess in the code.

      reply	other threads:[~2015-10-20  2:14 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-14 22:07 [PATCH v3] merge: fix cache_entry use-after-free David Turner
2015-10-15  3:35 ` René Scharfe
2015-10-15 19:02   ` David Turner
2015-10-15 20:38     ` René Scharfe
2015-10-15 20:51     ` Junio C Hamano
2015-10-16  7:05       ` David Turner
2015-10-16 16:04         ` Junio C Hamano
2015-10-19 22:27           ` David Turner
2015-10-20  2:13             ` Junio C Hamano [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqh9lm9ti8.fsf@gitster.mtv.corp.google.com \
    --to=gitster@pobox.com \
    --cc=dturner@twopensource.com \
    --cc=git@vger.kernel.org \
    --cc=kmcguigan@twitter.com \
    --cc=l.s.r@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).