Possible improvement in DB structure

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Possible improvement in DB structure
@ 2019-12-23 13:00 Arnaud Bertrand
  2019-12-23 19:09 ` brian m. carlson
  0 siblings, 1 reply; 4+ messages in thread
From: Arnaud Bertrand @ 2019-12-23 13:00 UTC (permalink / raw)
  To: git

Hello,

According to my understanding, git has only 3 kinds of objects:
(excluding the packed version)
- the blobs
- the trees
- the commits

Today to parse all objects of the same type, it is necessary to parse
all the objects and test them one by one.

It should be so simple to organize objects in
.git/objects/blobs
.git/objects/trees
.git/object/commits

May be due to my limited knowledge of git, I don't see any advantage
to put everything together.
By splitting the objects directory, the gain in performance could be
important, the scripts simplified, the representation more clear.

To be backward compatible, we can imagine a get-object() function that parses
.git/objects/blobs
.git/objects/trees
.git/object/commits
and, when not found
.git/objects

A get-tree() function that first parses
git/objects/trees
and when not found
.git/objects

idem for getblob() and getcommit()

Is there a reason that I don't understand behind the decision to put
everything together ?

Best regards,

Arnaud Bertrand

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Possible improvement in DB structure
  2019-12-23 13:00 Possible improvement in DB structure Arnaud Bertrand
@ 2019-12-23 19:09 ` brian m. carlson
  2019-12-23 20:46   ` Arnaud Bertrand
  0 siblings, 1 reply; 4+ messages in thread
From: brian m. carlson @ 2019-12-23 19:09 UTC (permalink / raw)
  To: Arnaud Bertrand; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1718 bytes --]

On 2019-12-23 at 13:00:46, Arnaud Bertrand wrote:
> Hello,
> 
> According to my understanding, git has only 3 kinds of objects:
> (excluding the packed version)
> - the blobs
> - the trees
> - the commits

There are also tags.

> Today to parse all objects of the same type, it is necessary to parse
> all the objects and test them one by one.

This isn't a behavior we often want.  Can you say more about why you
want to do this?

> May be due to my limited knowledge of git, I don't see any advantage
> to put everything together.
> By splitting the objects directory, the gain in performance could be
> important, the scripts simplified, the representation more clear.

Oftentimes, we want to look up an item that we would refer to as a
tree-ish.  That means that any tag, commit, or tree can be used in this
case and it will automatically be resolved to the appropriate tree.

Currently, we can look for any loose object, and then look for any
packed object, which is a limited number of lookups (at most, the number
of packs plus one).  Your proposal would have us look up at most the
number of packs plus six.

In addition, we sometimes know that we need to look up an object, but
don't know its type.  We would incur additional costs in this case as
well.

I'm not sure that we would gain a lot other than conceptual tidiness,
but we would incur additional performance costs.  We can currently
distinguish between the type of all of these objects by simply reading
the object header, which on a 64-bit system cannot exceed 28 bytes,
which we do in some cases, such as `git cat-file --batch`.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Possible improvement in DB structure
  2019-12-23 19:09 ` brian m. carlson
@ 2019-12-23 20:46   ` Arnaud Bertrand
  2019-12-23 21:41     ` Jonathan Nieder
  0 siblings, 1 reply; 4+ messages in thread
From: Arnaud Bertrand @ 2019-12-23 20:46 UTC (permalink / raw)
  To: brian m. carlson, Arnaud Bertrand, git

Hello Brian,

Today, I think that tags are not located in objects directory but in
refs/tags which is a good idea.;-)

The origin of my reflection was that I wanted to find an old file.

I knew that in the past of my project, we had started to write a
driver for a device and it was abandoned. I wanted to find this file.
I knew a "key line" to search for and I knew the file was a .c file
but I didn't know the exact name.

So, the goal was to parse all the database, find all the different .c
files and grep it to find the the driver.

And there began the problems.... I had a huge database and I've
written a script that had to:
1. Identify all the trees (straight forward if all trees are in objects/trees)
2. In each trees, identify all different *.c files
3. grep "key line" in them

Well, as I said, I had a huge database and I took a long time to get
the information.

If the objects had been separated directly, it would have been much simpler.

It is just an example, finally, I've written a cron job that unpacks
everything and saves all the trees sha in a file that can be parsed by
scripts.

So, a small change in the db structure could be very helpful for this
kind of needs.

About the fact that searching for an arbitrary object will consume
more time... It's very rare to look for an object without knowing it's
type, and parsing 3 subdirs instead of one is not so time consuming by
comparison of the operation described above.

Arnaud Bertrand, Belgium

Le lun. 23 déc. 2019 à 20:10, brian m. carlson
<sandals@crustytoothpaste.net> a écrit :
>
> On 2019-12-23 at 13:00:46, Arnaud Bertrand wrote:
> > Hello,
> >
> > According to my understanding, git has only 3 kinds of objects:
> > (excluding the packed version)
> > - the blobs
> > - the trees
> > - the commits
>
> There are also tags.
>
> > Today to parse all objects of the same type, it is necessary to parse
> > all the objects and test them one by one.
>
> This isn't a behavior we often want.  Can you say more about why you
> want to do this?
>
> > May be due to my limited knowledge of git, I don't see any advantage
> > to put everything together.
> > By splitting the objects directory, the gain in performance could be
> > important, the scripts simplified, the representation more clear.
>
> Oftentimes, we want to look up an item that we would refer to as a
> tree-ish.  That means that any tag, commit, or tree can be used in this
> case and it will automatically be resolved to the appropriate tree.
>
> Currently, we can look for any loose object, and then look for any
> packed object, which is a limited number of lookups (at most, the number
> of packs plus one).  Your proposal would have us look up at most the
> number of packs plus six.
>
> In addition, we sometimes know that we need to look up an object, but
> don't know its type.  We would incur additional costs in this case as
> well.
>
> I'm not sure that we would gain a lot other than conceptual tidiness,
> but we would incur additional performance costs.  We can currently
> distinguish between the type of all of these objects by simply reading
> the object header, which on a 64-bit system cannot exceed 28 bytes,
> which we do in some cases, such as `git cat-file --batch`.
> --
> brian m. carlson: Houston, Texas, US
> OpenPGP: https://keybase.io/bk2204

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Possible improvement in DB structure
  2019-12-23 20:46   ` Arnaud Bertrand
@ 2019-12-23 21:41     ` Jonathan Nieder
  0 siblings, 0 replies; 4+ messages in thread
From: Jonathan Nieder @ 2019-12-23 21:41 UTC (permalink / raw)
  To: Arnaud Bertrand; +Cc: brian m. carlson, git

Hi Arnaud,

Arnaud Bertrand wrote:

> Today, I think that tags are not located in objects directory but in
> refs/tags which is a good idea.;-)

Not precisely.  See "git help repository-layout" for more details, or
https://www.kernel.org/pub/software/scm/git/docs/user-manual.html#hacking-git
or the "git internals" chapter of https://git-scm.com/book/.

> The origin of my reflection was that I wanted to find an old file.
>
> I knew that in the past of my project, we had started to write a
> driver for a device and it was abandoned. I wanted to find this file.
> I knew a "key line" to search for and I knew the file was a .c file
> but I didn't know the exact name.

Thanks for this context!  It's very helpful.

> So, the goal was to parse all the database, find all the different .c
> files and grep it to find the the driver.

Git intends to make this kind of history mining not too difficult.
You can run a command like

	git log --all -S'the key line' -- '*.c'

and it should do the right thing.  Or you can do something more
complex using something like "git rev-list --all | git diff-tree
--stdin --name-only --diff-filter=D" (to show deleted files).

Is the problem that that command is too slow?

Hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-12-23 21:41 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-23 13:00 Possible improvement in DB structure Arnaud Bertrand
2019-12-23 19:09 ` brian m. carlson
2019-12-23 20:46   ` Arnaud Bertrand
2019-12-23 21:41     ` Jonathan Nieder

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).