Question: How range-diff lapjv algorithm work

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Question: How range-diff lapjv algorithm work
@ 2023-03-08  6:37 ZheNing Hu
  2023-03-08  7:47 ` Martin Ågren
  0 siblings, 1 reply; 4+ messages in thread
From: ZheNing Hu @ 2023-03-08  6:37 UTC (permalink / raw)
  To: Git List; +Cc: Johannes Schindelin, Junio C Hamano, t.gummerer

I was recently reading the source code of git-range-diff.

Its Steps:
1. Use git log <range> to generate two collections of patches for
ranges a and b.
2. Use the hash table on both sets to find the commit with the exact
same patch as an "exact match".
3. A cost matrix whose length and width are a.nr + b.br is generated.
The cost of a completely matched item is 0, and the cost between two
sets of patches is represented by the diffsize(a, b) of the patch.
4. Use the LAPJV algorithm to find the best match between multiple
patches of the two sets.
5. Output results.

My question is:
1. In step 3, why is the matrix size (a.nr + b.br) * (a.nr + b.br)
instead of a.nr * b.nr?

2. Why the cost(x,y) which satisfies "x ∈ [a.nr, a.nr + b.nr) y ∈ [0,
b.nr) || x ∈ [0, a.nr) y ∈ [b.nr, b. The cost of nr + a.nr)" is set to
"diffsize(a or b) * creation_factor / 100 : COST_MAX"? What is the
role of creation_factor? [1]

3. How does LAPJV work here? [2]

[1]: https://github.com/git/git/blob/725f57037d81e24eacfda6e59a19c60c0b4c8062/range-diff.c#L310
[2]: https://github.com/git/git/blob/725f57037d81e24eacfda6e59a19c60c0b4c8062/linear-assignment.c#L15

Thanks.
--
ZheNing Hu

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Question: How range-diff lapjv algorithm work
  2023-03-08  6:37 Question: How range-diff lapjv algorithm work ZheNing Hu
@ 2023-03-08  7:47 ` Martin Ågren
  2023-03-09 10:38   ` ZheNing Hu
  0 siblings, 1 reply; 4+ messages in thread
From: Martin Ågren @ 2023-03-08  7:47 UTC (permalink / raw)
  To: ZheNing Hu; +Cc: Git List, Johannes Schindelin, Junio C Hamano, t.gummerer

On Wed, 8 Mar 2023 at 07:50, ZheNing Hu <adlternative@gmail.com> wrote:
>
> My question is:
> 1. In step 3, why is the matrix size (a.nr + b.br) * (a.nr + b.br)
> instead of a.nr * b.nr?

There's some explanation of that in the man page for `git range-diff`,
under "ALGORITHM". Look for "To explain also new commits, we introduce
dummy nodes on both sides:".

> 2. Why the cost(x,y) which satisfies "x ∈ [a.nr, a.nr + b.nr) y ∈ [0,
> b.nr) || x ∈ [0, a.nr) y ∈ [b.nr, b. The cost of nr + a.nr)" is set to
> "diffsize(a or b) * creation_factor / 100 : COST_MAX"? What is the
> role of creation_factor? [1]

The `--creation-factor` command line option is also described in the
manpage.  There was a thread on the mailing list with various
discussions around this creation factor a while back. Maybe there's
something interesting there [1].

> 3. How does LAPJV work here? [2]
>
> [1]: https://github.com/git/git/blob/725f57037d81e24eacfda6e59a19c60c0b4c8062/range-diff.c#L310
> [2]: https://github.com/git/git/blob/725f57037d81e24eacfda6e59a19c60c0b4c8062/linear-assignment.c#L15

This appears to be based on work by Jonker and Volgenant. Maybe
searching for those names online could find something. Maybe not
git-specific, but even if it's just the general algorithm as such, it
might be possible to find different explanations of the algorithm.

I haven't really studied this algorithm, but since it's faster than the
Hungarian algorithm, I could imagine that either

  * it's super-useful to first understand the Hungarian algorithm, then
    understand why and how the Jonker-Volgenant algorithm does better,
    or,

  * it's completely useless to first understand the Hungarian algorithm,
    since they're so different.

:-)

[1] https://lore.kernel.org/git/1196830250.20220726145447@yandex.ru/

Martin

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Question: How range-diff lapjv algorithm work
  2023-03-08  7:47 ` Martin Ågren
@ 2023-03-09 10:38   ` ZheNing Hu
  2023-03-28 13:22     ` Johannes Schindelin
  0 siblings, 1 reply; 4+ messages in thread
From: ZheNing Hu @ 2023-03-09 10:38 UTC (permalink / raw)
  To: Martin Ågren
  Cc: Git List, Johannes Schindelin, Junio C Hamano, t.gummerer

Martin Ågren <martin.agren@gmail.com> 于2023年3月8日周三 15:47写道：


>
> On Wed, 8 Mar 2023 at 07:50, ZheNing Hu <adlternative@gmail.com> wrote:
> >
> > My question is:
> > 1. In step 3, why is the matrix size (a.nr + b.br) * (a.nr + b.br)
> > instead of a.nr * b.nr?
>
> There's some explanation of that in the man page for `git range-diff`,
> under "ALGORITHM". Look for "To explain also new commits, we introduce
> dummy nodes on both sides:".
>

Thanks, I can understand why the length of the matrix is "a.nr + b.nr" now.
Patches in one collection may have no matching patches in the other
collection, this mismatch situation ("o--C" in the documentation) should
also count the cost.

> > 2. Why the cost(x,y) which satisfies "x ∈ [a.nr, a.nr + b.nr) y ∈ [0,
> > b.nr) || x ∈ [0, a.nr) y ∈ [b.nr, b. The cost of nr + a.nr)" is set to
> > "diffsize(a or b) * creation_factor / 100 : COST_MAX"? What is the
> > role of creation_factor? [1]
>
> The `--creation-factor` command line option is also described in the
> manpage.  There was a thread on the mailing list with various
> discussions around this creation factor a while back. Maybe there's
> something interesting there [1].
>

I understand it now. Because mismatch "o--C" "1--0" cost are generally
greater than the cost of two completely different patches "1--C" "0--0".
Use the creation-factor to reduce the cost of "0-C" "1--0" make
"o--C", "1--0" as matching result.

> > 3. How does LAPJV work here? [2]
> >
> > [1]: https://github.com/git/git/blob/725f57037d81e24eacfda6e59a19c60c0b4c8062/range-diff.c#L310
> > [2]: https://github.com/git/git/blob/725f57037d81e24eacfda6e59a19c60c0b4c8062/linear-assignment.c#L15
>
> This appears to be based on work by Jonker and Volgenant. Maybe
> searching for those names online could find something. Maybe not
> git-specific, but even if it's just the general algorithm as such, it
> might be possible to find different explanations of the algorithm.
>
> I haven't really studied this algorithm, but since it's faster than the
> Hungarian algorithm, I could imagine that either
>
>   * it's super-useful to first understand the Hungarian algorithm, then
>     understand why and how the Jonker-Volgenant algorithm does better,
>     or,
>
>   * it's completely useless to first understand the Hungarian algorithm,
>     since they're so different.
>
> :-)
>

Ah, I had a look at the Hungarian algorithm earlier, because it is the most
typical algorithm in linear assignment problem, it can still be understood.
I didn't read that paper by Jonker and Volgenant, but I should try to read
it later.

> [1] https://lore.kernel.org/git/1196830250.20220726145447@yandex.ru/
>
> Martin

Thanks for the answer!

--
ZheNing Hu

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Question: How range-diff lapjv algorithm work
  2023-03-09 10:38   ` ZheNing Hu
@ 2023-03-28 13:22     ` Johannes Schindelin
  0 siblings, 0 replies; 4+ messages in thread
From: Johannes Schindelin @ 2023-03-28 13:22 UTC (permalink / raw)
  To: ZheNing Hu; +Cc: Martin Ågren, Git List, Junio C Hamano, t.gummerer

[-- Attachment #1: Type: text/plain, Size: 5924 bytes --]

Hi,

On Thu, 9 Mar 2023, ZheNing Hu wrote:

> Martin Ågren <martin.agren@gmail.com> 于2023年3月8日周三 15:47写道：
>
> > On Wed, 8 Mar 2023 at 07:50, ZheNing Hu <adlternative@gmail.com> wrote:
> > >
> > > 3. How does LAPJV work here? [2]
> > >
> > > [1]: https://github.com/git/git/blob/725f57037d81e24eacfda6e59a19c60c0b4c8062/range-diff.c#L310
> > > [2]: https://github.com/git/git/blob/725f57037d81e24eacfda6e59a19c60c0b4c8062/linear-assignment.c#L15
> >
> > This appears to be based on work by Jonker and Volgenant. Maybe
> > searching for those names online could find something. Maybe not
> > git-specific, but even if it's just the general algorithm as such, it
> > might be possible to find different explanations of the algorithm.
> >
> > I haven't really studied this algorithm, but since it's faster than the
> > Hungarian algorithm, I could imagine that either
> >
> >   * it's super-useful to first understand the Hungarian algorithm, then
> >     understand why and how the Jonker-Volgenant algorithm does better,
> >     or,
> >
> >   * it's completely useless to first understand the Hungarian algorithm,
> >     since they're so different.
> >
> > :-)
> >
>
> Ah, I had a look at the Hungarian algorithm earlier, because it is the most
> typical algorithm in linear assignment problem, it can still be understood.
> I didn't read that paper by Jonker and Volgenant, but I should try to read
> it later.

Oh wow, what a blast from the past.

As I had briefly mentioned in
https://lore.kernel.org/git/nycvar.QRO.7.76.6.1805032227520.77@tvgsbejvaqbjf.bet/,
the code I contributed in
https://lore.kernel.org/git/3f51970cbc44bfe34133c48c0844ed3723e83808.1525361419.git.johannes.schindelin@gmx.de/
is a port from some Java code I had written earlier in my life which was
based on the Pascal code provided in the cited paper.

My recollection has faded somewhat in the past 9 years. But maybe the
following helps, or provides at least some entertainment.

My first version of a "linear assignment problem" solver _did_ implement
the Hungarian algorithm, and I cannot even find it anymore, it must have
been in 2010 or 2011. I only remember that the O(n^4) runtime was
prohibitively expensive: I wanted to automatically track cells in
microscope recordings as they migrated over a plate of agar. I _seem_ to
remember that it was kind of okay for up to a couple dozen, but we wanted
to track hundreds, even thousands of cells, for extended amounts of time.
We _could_ do it, kind of, on a ridiculously expensive machine with 128GB
(I _think_), but naturally time on that machine was limited.

So I set out to implement the Munkres & Kuhn algorithm (see
https://github.com/fiji/fiji/blob/7d683d2ad46d22ec2e0bdffe1aa02072ab62e359/src-plugins/TrackMate_/fiji/plugin/trackmate/tracking/hungarian/MunkresKuhnAlgorithm.java)
because it promised O(n^3) complexity, and it did allow us to track the
amount of cells we were studying at the time.

It took a _week_ to wrap my head around that paper, even if being somewhat
familiar with the Hungarian algorithm probably helped. You will see
extensive notes in the initial code comment that I wrote for my future
self if I ever needed to understand the algorithm _again_. Reading over
those notes after more than a decade, I realize that while they are
helpful for _me_, they probably help others only in conjunction with some
quality time studying the paper.

However, later on our tests revealed that The Munkres & Kuhn algorithm,
while cubic in complexity, was still much slower than the Jonker &
Volgenant algorithm. This algorithm is also cubic in complexity, but with
a lower constant factor, and I could implement much quicker than the
Munkres & Kuhn algorithm because I essentially could cheat a bit by
studying the the paper while taking frequent peeks at the code. That also
explains why I did not take extensive notes this time and only added a
terse code comment, and the commit message is not much better, either:
https://github.com/trackmate-sc/TrackMate/commit/e306a02f76a6b6e383c0d32efaba565aca01a773

My tests must have been quite convincing back then because I already
switched to Jonker & Volgenant on the same day that I finished the
implementation:
https://github.com/trackmate-sc/TrackMate/commit/ad5012827417a40673901a37af44904a2de73095

_Years_ later, I wanted to run `git tbdiff` and couldn't because it
uses not only Python but `numpy`. And `numpy` is largely native code, not
Python, and I had the choice of either getting `numpy` to build on
Windows, or reimplement `tbdiff` in pure C as a new Git built-in, I chose
the latter (because it seemed much easier). And since `tbdiff` used the
`hungarian` algorithm (https://pypi.org/project/hungarian/) that
implements Munkres' algorithm (which I assumed was a slight variation of
Munkres & Kuhn, with the same performance characteristics), I undusted my
Java code and ported it. When I say "undusted", I lie, because it is still
in use today. But it basically went unchanged for years because it is bug
free ;-)

So now that you know about the provenance of the code we're discussing, I
have to admit that I can contribute only very little to understand it,
other than providing a couple of pointers:

https://en.wikipedia.org/wiki/Hungarian_algorithm has a pretty good, if
quite mathematical description of the algorithm. (And now that I read it,
I realize that I probably inadvertently attributed Tomizawa's improvements
to Munkres & Kuhn.)

It is amazingly hard to find good explanations on the 'net that aren't
using hard-code mathematical symbols, but I think I found one that
illustrates how the algorithm works:
https://medium.com/@rajneeshtiwari_22870/linear-assignment-problem-in-metric-learning-for-computer-vision-eba7d637c5d4

Hopefully that helps.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-03-28 13:23 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-08  6:37 Question: How range-diff lapjv algorithm work ZheNing Hu
2023-03-08  7:47 ` Martin Ågren
2023-03-09 10:38   ` ZheNing Hu
2023-03-28 13:22     ` Johannes Schindelin

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).