From: Thomas Gummerer <t.gummerer@gmail.com>
To: git@vger.kernel.org
Cc: Robin Rosenberg <robin.rosenberg@dewire.com>,
Junio C Hamano <gitster@pobox.com>,
mhagger@alum.mit.edu, pclouds@gmail.com, trast@student.ethz.ch,
Johannes Sixt <j.sixt@viscovery.net>
Subject: [GSoC] Designing a faster index format - Progress report week 15
Date: Mon, 30 Jul 2012 22:20:11 +0200 [thread overview]
Message-ID: <20120730202011.GC1006@tgummerer> (raw)
== Work done in the previous 14 weeks ==
- Definition of a tentative index file v5 format [1]. This differs
from the proposal in making it possible to bisect the directory
entries and file entries, to do a binary search. The exact bits
for each section were also defined. To further compress the index,
along with prefix compression, the stat data is hashed, since
it's only used for equality comparison, but the plain data is
never used.
Thanks to Michael Haggerty, Nguyen Thai Ngoc Duy, Thomas Rast
and Robin Rosenberg for feedback.
- Prototype of a converter from the index format v2/v3 to the index
format v5. [2] The converter reads the index from a git repository,
can output parts of the index (header, index entries as in
git ls-files --debug, cache tree as in test-dump-cache-tree, or
the reuc data). Then it writes the v5 index file format to
.git/index-v5. Thanks to Michael Haggerty for the code review.
- Prototype of a reader for the new index file format. [3] The
reader has mainly the purpose to show the algorithm used to read
the index lexicographically sorted after the full name which is
required by the current internal memory format. Big thanks for
reviewing this code and giving me advice on refactoring goes
to Michael Haggerty.
- Read the on-disk index file format and translate it to the current
in memory format. This doesn't include reading any of the current
extensions, which are now part of the main index. The code again
is on github. [4] Thanks for reviewing the first steps to Thomas
Rast.
- Read the cache-tree data (formerly an extension, now it's integrated
with the rest of the directory data) from the new ondisk format.
There are still a few optimizations to do in this algorithm.
- Started implementing the API (suggested by Duy), but it's still
in the very early stages. There is one commit for this on GitHub [1],
but it's a very early work in progress.
- Started implementing the writer, which extracts the directories from
the in-memory format, and writes the header and the directories to
disk. The algorithm uses a hash-table instead of a simple list,
to avoid many corner cases.
- Implemented writing the file block to disk, and basic tests from the
test suite are running fine, not including tests that require
conflicted data or the cache-tree to work, which both are not
implemented yet.
- Started implementing a patch to introduce a ce_namelen field in
struct cache_entry and drop the name length from the flags. [5]
Thanks to Junio, Duy and Thomas for reviews and suggestions for
improving it.
- Implemented the cache-tree and conflict data writing to the
index-v5 file.
- Implemented the rest of the index-v5 code, so that it passes the
test suite.
- Added a hack for partial loading of index-v5 for git ls-files and
git grep. For performance results of this hack see: [6]
== Work done in the last week ==
- Lots of refactoring of the index-v5 code
- Some slight optimizations of the code
- Brought the python reader up to date, and added the possibility to
update a single index entry, to test the re-reading code when updating
single index entry. (The updating of a single index entry is not
implemented in C yet)
- Make the reader re-read a single index entry, if the crc code is wrong.
- Implement the new racy code for git, along the lines of what Thomas
posted at [7]. The code also addresses the concerns of Johannes and
Junio, by using the timestamp of the index, that is already written
instead of the time of the index file that will be written. The checking
if the index entry really changed will be left to the reader. If anyone
is interested in the code it's at [8].
== Outlook for the next week ==
- Fix some minor nits in the code, which are still remaining.
- Bring the history to release format, and send the patches to the list.
[1] https://github.com/tgummerer/git/wiki/Index-file-format-v5
[2] https://github.com/tgummerer/git/blob/pythonprototype/git-convert-index.py
[3] https://github.com/tgummerer/git/blob/pythonprototype/git-read-index-v5.py
[4] https://github.com/tgummerer/git/tree/index-v5
[5] http://thread.gmane.org/gmane.comp.version-control.git/200997
[6] http://thread.gmane.org/gmane.comp.version-control.git/201964
[7] http://thread.gmane.org/gmane.comp.version-control.git/199309
[8] https://github.com/tgummerer/git/tree/racy-WIP
reply other threads:[~2012-07-30 20:20 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120730202011.GC1006@tgummerer \
--to=t.gummerer@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=j.sixt@viscovery.net \
--cc=mhagger@alum.mit.edu \
--cc=pclouds@gmail.com \
--cc=robin.rosenberg@dewire.com \
--cc=trast@student.ethz.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).