git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Cc: "Shawn O. Pearce" <spearce@spearce.org>,
	Junio C Hamano <gitster@pobox.com>,
	Nicolas Pitre <nico@fluxnic.net>
Subject: duplicate objects in packfile
Date: Wed, 14 Aug 2013 14:17:18 -0400	[thread overview]
Message-ID: <20130814181718.GA7911@sigill.intra.peff.net> (raw)

I'm tracking down a rather odd problem in a packfile I found on GitHub.
This particular packfile contains the same object at various offsets
within the file. In fact there are several packfiles that exhibit this
behavior, all in the same repository, and in each one there are several
duplicated objects (some of which are present 3 or even 4 times).

index-pack is happy to index these packfiles, and just creates multiple
entries for the object. The usual binary search in find_pack_entry_one
will find one of them (though of course which depends on the exact
layout of the index). But curiously, the experimental sha1_entry_pos
lookup does not.  It hits an assert() that can only be triggered in the
face of duplicate objects. For example:

  $ git cat-file -t 4ea4acbcb0930ac42acc87a0d203864dec1a9697
  commit

  $ GIT_USE_LOOKUP=1 git cat-file -t 4ea4acbcb0930ac42acc87a0d203864dec1a9697
  git: sha1-lookup.c:222: sha1_entry_pos: Assertion `lov < hiv' failed.
  Aborted

It's easy enough to fix with a repack, but I am curious:

  1. Is sha1_entry_pos wrong to barf on duplicate items in the index? If
     so, do we want to fix it, or simply retire GIT_USE_LOOKUP?

     Related, should we consider duplicate items in a packfile to be a
     bogus packfile (and consequently notice and complain during
     indexing)? I don't think it _hurts_ anything (aside from the assert
     above), though it is of course wasteful.

     I didn't go into detail on how the assertion above is triggered,
     but I can break it down line by line if anybody cares; the short of
     it is that it can only be caused by an unsorted index or a
     duplicate entry.

  2. How can duplicate entries get into a packfile?

     Git itself should not generate duplicate entries (pack-objects is
     careful to remove duplicates). Since these packs almost certainly
     were pushed by a client, I wondered if "index-pack --fix-thin"
     might accidentally add multiple copies of an object when it is the
     preferred base for multiple objects, but it specifically avoids
     doing so.

     The packs in question were received by us in 2010. Did we ever have
     bugs in this area? I don't recall any, nor could I find any in the
     history.

     That makes me suspect the user may have been using an alternate git
     implementation. libgit2 did not know how to pack in 2010, and Grit
     still doesn't.  JGit did, and I don't know offhand about Dulwich.
     Does anyone know of bugs in those implementations that might have
     caused this?

The packs in question are public, so I can share them if anybody is
curious to look (but the whole repository is on the order of 700M).

Given the dates on the packs and how rare this is, I'm pretty much
willing to chalk it up to a random bug (in git or otherwise) that does
not any longer exist. But I was curious if anybody else knew anything,
and we may want to fix sha1_entry_pos to behave more like the regular
binary search.

-Peff

             reply	other threads:[~2013-08-14 18:17 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-14 18:17 Jeff King [this message]
2013-08-14 18:29 ` duplicate objects in packfile Junio C Hamano
2013-08-14 18:39   ` Junio C Hamano
2013-08-14 18:54     ` Nicolas Pitre
2013-08-16 15:01     ` Jeff King
2013-08-21 20:49       ` [RFC/PATCH 0/4] duplicate objects in packfiles Jeff King
2013-08-21 20:51         ` [PATCH 1/4] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
2013-08-21 20:52         ` [PATCH 2/4] index-pack: optionally reject packs with duplicate objects Jeff King
2013-08-22 13:45           ` Duy Nguyen
2013-08-22 14:43             ` Jeff King
2013-08-22 23:12               ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Jeff King
2013-08-22 23:12                 ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King
2013-08-22 23:14                 ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
2013-08-23 16:41                   ` Junio C Hamano
2013-08-23 18:24                     ` Jeff King
2013-08-23 18:54                       ` Nicolas Pitre
2013-08-23 18:56                         ` Jeff King
2013-08-24  0:01                       ` [PATCHv3 0/6] duplicate objects and delta cycles Jeff King
2013-08-24  0:01                         ` [PATCH 1/6] test-sha1: add a binary output mode Jeff King
2013-08-24  0:02                         ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Jeff King
2013-08-24  0:02                         ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King
2013-08-24  0:02                         ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King
2013-08-24  0:02                         ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King
2013-08-24  0:02                         ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King
2013-08-23 19:41                   ` [PATCH 2/6] sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP Johannes Sixt
2013-08-23 23:37                     ` Jeff King
2013-08-22 23:14                 ` [PATCH 3/6] add tests for indexing packs with delta cycles Jeff King
2013-08-22 23:15                 ` [PATCH 4/6] test index-pack on packs with recoverable " Jeff King
2013-08-22 23:15                 ` [PATCH 5/6] index-pack: optionally reject packs with duplicate objects Jeff King
2013-08-22 23:16                 ` [PATCH 6/6] default pack.indexDuplicates to false Jeff King
2013-08-22 23:35                 ` [PATCHv2 0/6] duplicate objects and delta cycles, oh my! Nicolas Pitre
2013-08-21 20:53         ` [PATCH 3/4] reject duplicates when indexing a packfile we created Jeff King
2013-08-21 20:55         ` [DO NOT APPLY PATCH 4/4] index-pack: optionally skip duplicate packfile entries Jeff King
2013-08-21 23:20           ` Junio C Hamano
2013-08-22  0:47             ` Jeff King
2013-08-21 22:17         ` [RFC/PATCH 0/4] duplicate objects in packfiles Junio C Hamano
2013-08-14 18:50 ` duplicate objects in packfile Nicolas Pitre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130814181718.GA7911@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=nico@fluxnic.net \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).