git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Comments on Presentation Notes Request.
@ 2009-01-06 22:33 Tim Visher
  2009-01-07  6:36 ` Jeff King
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Tim Visher @ 2009-01-06 22:33 UTC (permalink / raw
  To: git

Hello Everyone,

I'm putting together a little 15 minute presentation for my company
regarding SCMSes in an attempt to convince them to at the very least
use a Distributed SCMS and at best to use git.  I put together all my
notes, although I didn't put together the actual presentation yet.  I
figured I'd post them here and maybe get some feedback about it.  Let
me know what you think.

Thanks in advance!

Notes
---------

SCM: Distributed, Centralized, and Everything in Between.

* What is SCM and Why is it Useful?

** Definition

SCM is the practice of committing the revisions of your source code to
a system that can faithfully reproduce historical snap shots of your
source code.

** Advantages of SCM

*** One Source to Rule Them All.

Instead of having a bunch of source files spread across multiple
developers machines with multiple versions on each machine that may or
may not be labeled correctly, you have one repository containing every
artifact that your project includes.

*** Unlimited Undo/Redo.

Not only is it unlimited, but it's random access.  If you changed a
function a week ago, continued to work, and then decide that you want
the function back the way it was, it's as simple as pulling the
function back out of the SCMS.

*** Safe Concurrent Editing.

Many people can edit the same code base at the same time and know,
without a doubt, that when they pull all those changes together, the
system will merge the content intelligently or inform you of the
conflict and let you merge it.  You don't need to lock files.
Obviously, if there is bad coordination then the possibilities of
conflicts rise, but this should not happen regularly.

*** Diff Debugging

You can find where a bug was introduced by learning how to reproduce
the bug and then doing a binary chop search back through the History
to come to the exact commit that introduced the bug.

* SCM Best Practices

** Commit Early, Commit Often

The more you commit, the more fine grained control you have over the
undo feature of SCM.  Most documents that I have read suggested a TDD
approach wherein you commit whenever you have written just enough code
for your test to pass. But...

** Don't Commit Broken Code (To the Public Tree)

Of primary concern is the fact that your central HEAD should _always_
build.  This is why practices like Continuous Integration and TDD are
so important.  TDD gives you the freedom to be sure that a change you
made hasn't broken anything you weren't expecting it to break.
Continuous Integration allows you to be sure that your whole system
will build every time.  Thus, you should _never_ commit broken code to
the (public) tree.

Of course, in a centralized system, committing is intrinsically
public.  Even on branches, every time you commit any sort of change,
everyone is able to see it and so you could be breaking the build for
someone (even if it's just yourself and the build system).  One of the
nice features of a distributed system is that your public/private
ontology is much richer and thus allows you to have broken code in
your SCMS.

** Whole Hog

You should put everything necessary to build your system into SCM.
This includes user documentation, requirements documentation, software
tools, build tools, etc.  The only artifacts that don't need to be
managed are auto-generated artifacts such as javadocs, jar files, exe
files, etc.  This is so that you can reproduce entire releases using
only a simple checkout.

** Perform Updates and Commits on the Whole Tree

Updates and Commits should always be done on the whole tree so that
you're sure you have the latest source.  Never assume that nothing has
changed elsewhere.

** Allow and Encourage Customer Participation

Most shops seem to attempt to funnel customer participation through
the developers.  This is a cache miss for many operations such as
developing the user manual by a design team external to the
development team.  Basic operations such as commit and update are
fairly simple to grasp and can even be simplified further through
scripts and other such tools that non-developers can quickly be taught
to use.

Of note is the Tortoise family of tools which integrate directly into
Windows Explorer.  This makes it fairly easy for anyone who is
familiar with Windows Explorer to get into using any of the tools that
there is a Tortoise implementation for.

* The Centralized Model

** We Know About This One

This is traditional, plain vanilla, ubiquitous SCM.

The great majority of the SCMSes out there are centralized.

Closely resembles the Client/Server system model.

** Work Flow

<http://whygitisbetterthanx.com/#any-workflow>

*** 2 basic models: 'Lock, Modify, Unlock' and 'Copy, Modify, Merge'.

Older systems were primarily Lock, Modify, Unlock implementations.
You would checkout a file that you intended to work on, and no one
else would be able to check it out until you unlocked it, signaling
that you were done editing it.  This is inherently inefficient as on a
team of developers, the chances that two are working on the exact same
part of a system without knowing it and coordinating are fairly low.
Also, any disparate features that still touch the same files in the
system cannot be worked on simultaneously.

The answer to this is Copy, Modify, Merge.  In this system, every
developer gets a complete copy of the HEAD.  Everyone changes the HEAD
concurrently.  When commits happen, the system attempts to
intelligently merge them.  If it fails (usually doesn't happen unless
there is bad coordination), then it asks you to merge them.  This has
been proven to work well.

** Key Properties

*** Only One Repository

In centralized systems, there is only one global, public repository.
This has certain significant effects, such as an intrinsically global
name space for branches and tags, a restrictive public/private concept
(no such thing as committed but private), need for a backup process
aware of the possibility of in-progress commits, etc.

Since the repository only exists in a single location, the developers
only have copies of a specific revision and any uncommitted changes
they've made to that copy.

*** All Committed Changes Are Public

This includes regular commits (what we'd typically think of commits),
branches, and tags.

As previously mentioned, in centralized systems, all committed changes
are public.  Even if you are working on a private branch (which you
typically wouldn't be because branches are expensive in centralized
systems), the changes you are making are still visible publicly
because your branch exists in the global, public repository.

*** Intrinsically Uses the Network.

Because you must have a single repository that all developers are
accessing, you must use the network for many common operations.
Commits must be made to the central repository, Logs live centrally,
branches live centrally, diffing between revisions is a network
operation, blaming is a network operation, etc.

*** Backup Becomes A Separate Process

Because there is only a single repository, you need a back-up strategy
or else you are exposing yourself to a single point of failure.
Unfortunately, this is not as simple as it sounds.  The global, public
nature of the repository makes the chances of creating a corrupt back
up very high.  Because of this, tools have grown up around and in many
centralized systems that automate the process of backing it up while
remaining aware of the problems that can arise.  However, the point
remains that there is no intrinsic back up of a centralized system.

*** Need A Repository Admin.

Because the system is centralized, you need a repository
administrator.  This is true in most modern centralized systems where
new repositories are created on a per project basis (as in, not VSS).
In other words, when you want a new repository, you need to go through
some sort of admin interface or through the administrator of the
repository server to make it happen.

* The Distributed Model

** This Ones New

At least new as in unfamiliar.  The concept is over a decade old.

There are a few different popular distributed SCMSes (Git, Mercurial
(hg), Bazaar (bzr), Bitkeeper)

Very closely resembles a peer-to-peer network and the organic
relationships that evolve in that space.

In a distributed system, there is no one point where all development
comes together to for any reason other than policy.  Everyone who is
working on a system intrinsically has their own copy of the entire
repository.  All of the history, all of the source code, all of the
public branches, all of the public tags, etc.  Because of this,
developers can also have private branches, private tags, private
commits, private history.  The distinction between public and private
is very important in this context.  This has several distinct features
which I'll go into now.

** Work Flow (Pick Your Poison)

<http://whygitisbetterthanx.com/#any-workflow>

** Key Properties

*** Private/Public Concept

Distributed SCMSes Private/Public ontology is __much__ richer.
Whereas in a central system, private means only what you have yet to
commit or what you are leaving untracked, in a distributed system,
private means anything that you have not yet _chosen_ to make public.
In other words, you can have private branches, private tags, private
committed changes to your copy of the head, etc.  Anything that you do
not specifically publish to a location that others can access is
intrinsically private.

In other words, you can finally SCM your sandbox!  You can commit as
many broken things as you want to a private repository, giving you the
ability to have a nearly infinite set of undoable and recoverable
changes, without breaking anyone else's build.  Or, you can just as
easily ignore TDD, never commit anything for 3 weeks and then do a
big, massive commit and as long as your final product is tested and
merges with the rest of the tree, you're good to go and no one cares.

Because you have a rich ontology for private/public data, you can also
do crazy things like rewriting your local history before anyone else
sees it.  Because your repository is the only one that has to know
about the history as long as you're dealing with private data, this is
a completely safe (although policy debatable) operation.  Of course,
once data has been published, you really shouldn't mess with its
history anymore.

*** Network(less)

In distributed systems, networks are optional for almost every
operation (and indeed, every operation prior to publishing).  Of
course, you could put your repository on a network drive and then
you'd be doing everything over the network like you would in a
centralized system, but if you put your repository clone on your local
system, then everything you do in that repository is local.  Viewing
your history, committing, branching, merging, everything.

Once you've published, however, not much changes.  Almost everything
except updating and publishing (_not_ committing) remains local.
Remember that committing no longer means publicly publishing.  You can
commit many revisions, even to the master HEAD and nothing at all has
been published until you push those changes to your public HEAD.

*** Natural Backup

Because every developer has a copy of the repository, every developer
you add adds an extra failure point.  The more developers you have,
the more backups you have of the repository.

*** Must Learn New Work Flows.

In order to fully experience the advantages of distributed systems,
new work flows must be learned.  In other words, it's possible to use
distributed systems nearly the exact same way as you use a centralized
system (you just need to learn new commands), but you don't get many
of the benefits except the speed improvements.  The real game change
happens when you realize that you can keep things private until their
finished.  Once you realize that, new branching patterns emerge, new
work flows happen, you commit more often, and have the ability to
become much looser and freer in your development process.

*** Impossible To Completely Enforce A Single, Canonical
Representation of the Code Base.

By nature, a distributed system cannot enforce a single canonical
representation of the code base except by policy, and policies can
always be broken.  Also, any intentionally private data is not backed
up because it is not shared.  However, backup becomes much simpler
because you know that no one else is committing to your repository.

This bears some explanation.  Within a distributed system, you can
have a single official release point that everyone has blessed (or the
company has blessed, or the original developer has blessed, or
whatever).  However, you cannot _stop_ someone else from making a
release point because their repository is just as valid as yours.  You
cannot _stop_ developers from sharing code between themselves without
going out to the official central location.  All you can do is ask
them not to.

* Why Git is the Best Choice

** Fast

Git's implementation just happens to be wickedly fast.  It's faster
than mercurial, it's faster than bazaar, etc.  Everything, committing,
merging, viewing history, branching, and even updating and and pushing
are all faster.

** Tracks Content, not Files

Git tracks content, not files, and it's the only SCMS at the moment
that does this.  This has many effects internally, but the most
apparent effect I know of is that for the first time Git can easily
tell you the history of even a function in a file because Git can tell
you which files that function existed (or does exist) in over the
course of development.

** Extremely Efficient.

Because Git tracks content, it can also be extremely efficient
spacewise, simplifying the files to be nothing but pointers to a set
of objects in Git's internal file system.  Thus, if you have
duplicated hunks, git uses a single object to represent them.  Git has
been proven to be more efficient space wise than any other system out
there.

** (Un)Staged Changes

Git employs the concept of the Index or Cache or Commit Stage.  This
is also unique to Git, and it's pretty strange for developers coming
from a system without it.

Basically, There are 4 states that any content can be in under Git.

1. Untracked: This is content that Git is completely unaware of.
2. Tracked but Unstaged: This is content that has changed that Git is
aware of but will not commit on the next commit command.
3. Tracked and Staged: This is the same as unstaged except that this
content will be committed on the next commit.
4. Tracked and Committed:  This is content that has not changed since
the previous commit that Git is aware of.

This is very powerful yet somewhat awkward to grasp.  Basically, the
upshot of this feature is that you can manually build commits if you
want to.  Say you were working on feature foo and then made some other
changes because you came across feature bar and thought it would be
quick to do.  In any other system, the only way you could commit parts
of what you'd changed is if you were lucky enough for the disparate
changes to be in different files.  In that case, you could commit only
the files that you wanted to change for the different features.
However, if you made disparate changes to the same file, you were
stuck.  In Git, you can stage only parts of the files to an extreme
degree.  This allows you to create as many commits as you want out of
a single change set until the whole change set is committed.

I've found this to be particularly useful when working with an
existing code base that was not properly formatted.  Often, I'll come
to a file that has a bunch of wonky white space choices and improperly
indented logical constructs and I'll just quickly run through it
correcting that stuff before continuing with the feature I was working
on.  Afterwords, I'll stage the formatting and commit it, and then
stage the feature I was working on and commit that.  You may not want
that kind of control (and if you don't, you don't need to use it), but
I like it.

** Excellent Merge algorithms

Git has excellent merge algorithms.  This is widely attributed and
doesn't require much explanation.  It was one of Git's original design
goals, and it has been proven by Git's implementation.  Merging in Git
is _much_ less painful than in other systems.

** Has powerful 'maintainer tools'

Beyond the basics of committing, updating, pushing, viewing logs, etc.
Git is known to have very powerful tools maintainer level tools.  You
can modify your history, you can automatically perform binary searches
to locate errors, you can communicate via patches, it's highly
customizable, has the concept of submodules (projects within
projects), etc.  It gets complicated, but at this level of SCM it is
complicated.

** Cryptographically Guarantees Content

One of the most surprising things I learned as I was researching this
was that most SCMSes do not guarantee that your content does not get
corrupted.  In other words, if the repository's disk doesn't fail but
instead just gets corrupted, you'll never know unless you actually
notice the corruption in the files.  If you have memory corruption
locally and commit your changes, you just won't know.

Git guarantees absolutely that if corruption happens, you will know
about it.  It does this by creating SHA-1 hashes of your content and
then checking to make sure that the SHA-1 hash does not change for an
object.  The details of this aren't as important as the fact that Git
is one of the very few systems that do this and it's obviously
desirable.

* References

- <http://git-scm.com/> - The Git homepage
- <http://whygitisbetterthanx.com/> - An excellent resource explaining
the benefits of using git in relation to other common SCMSes
- <http://www.youtube.com/watch?v=4XpnKHJAok8> - Linus Torvalds's
Google Talk on Git.  Covers mainly what Git is Not and Why
Distribution is the model that works.
- <http://www.youtube.com/watch?v=8dhZ9BXQgc4> - Randal Schwartz's
Google Talk on Git.  Covers what Git is, including some implementation
details, some use case scenarios, and the like.
- <http://book.git-scm.com/> - A community written book on Git with
video tutorials about many of Git's features.
- <http://subversion.tigris.org/> - Subversion's homepage.  An
extremely popular open source centralized system.
- <http://svnbook.red-bean.com/> - Rolling publish book on Subversion.
 Chapter 1 is a good introduction to general centralized SCM concepts
and principles.
- <http://www.perforce.com/perforce/bestpractices.html> - An excellent
set of best practices from the Perforce team.  Some of it (especially
the branches) has a distinct centralized lean, but most of it is quite
good.
- <http://www.bobev.com/PresentationsAndPapers/Common%20SCM%20Patterns.pdf>
- Interesting presentation by Pretzel Logic from 2001 attempting to
outline some common SCM best practices as Patterns.

---------------
End Notes

-- 

In Christ,

Timmy V.

http://burningones.com/
http://five.sentenc.es/ - Spend less time on e-mail

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Comments on Presentation Notes Request.
  2009-01-06 22:33 Comments on Presentation Notes Request Tim Visher
@ 2009-01-07  6:36 ` Jeff King
  2009-01-07 22:30   ` Daniel Barkalow
  2009-01-07  8:33 ` david
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2009-01-07  6:36 UTC (permalink / raw
  To: Tim Visher; +Cc: git

On Tue, Jan 06, 2009 at 05:33:02PM -0500, Tim Visher wrote:

> ** Advantages of SCM
> *** One Source to Rule Them All.
> *** Unlimited Undo/Redo.
> *** Safe Concurrent Editing.
> *** Diff Debugging

I would add to this metadata and "software archeology": finding the
author of a change or piece of code, the motivation behind it, related
changes (by position within history, by content, or by commit message),
etc.

I think people who have not used an SCM before, and people coming from
SCMs where it is painful to look at history (like CVS) undervalue this
because it's not part of their workflow.  But having used git for a few
years now, it is an integral part of how I develop (especially when
doing maintenance or bugfixes).

You touch on this in "Diff Debugging", but I think bisection is just a
part of it.

> * SCM Best Practices
>
> ** Commit Early, Commit Often
> ** Don't Commit Broken Code (To the Public Tree)

People talk a lot about using their SCM on a plane, but I think these
two seemingly opposite commands highlight the _real_ useful thing about
a distributed system for most people: commit and publish are two
separate actions.

So I think it might be better to say "Commit Early, Commit Often" but
"Don't _Publish_ Broken Code". Which is what you end up saying in the
discussion, but I think using that terminology makes clear the important
distinction between two actions that are convoluted in centralized
systems.

> *** Backup Becomes A Separate Process
> Because there is only a single repository, you need a back-up strategy
> or else you are exposing yourself to a single point of failure.
> [...]
> *** Natural Backup
> Because every developer has a copy of the repository, every developer
> you add adds an extra failure point.  The more developers you have,
> the more backups you have of the repository.

The "natural backup" thing gets brought out a lot for DVCS. And it is
sort of true: instead of each developer having a backup of the latest
version (or some recent version which they checked out), they have a
backup of the whole history. But they still might not have everything.
Developers might not clone all branches. They might not be up to date
with some "master" repository. Useful work might be unpublished in the
master repo (e.g., I am working on feature X which is 99% complete, but
not ready for me to merge into master and push).

So yes, you are much more likely to salvage useful (if not all) data
from developer repositories in the event of a crash. But I still think
it's crazy not to have a backup strategy for your DVCS repo.

> ** Fast
> 
> Git's implementation just happens to be wickedly fast.  It's faster
> than mercurial, it's faster than bazaar, etc.  Everything, committing,
> merging, viewing history, branching, and even updating and and pushing
> are all faster.

A lot of people say "So what? System X is fast enough for me already."
And I used to be one of them. But one point I have made in similar talks
is that it isn't just about shaving a few seconds off your task. It's
about being able to ask fundamentally different questions because they
can be answered in seconds, not minutes or hours. I haven't benchmarked,
but I shudder at the thought of pickaxe (git log -S), code movement in
blame, or bisecting in CVS.

> ** Excellent Merge algorithms
> 
> Git has excellent merge algorithms.  This is widely attributed and
> doesn't require much explanation.  It was one of Git's original design
> goals, and it has been proven by Git's implementation.  Merging in Git
> is _much_ less painful than in other systems.

Actually, git has a really _stupid_ merge algorithm that has been around
forever: the 3-way merge. And by stupid I don't mean bad, but just
simple and predictable. I think the git philosophy is more about making
it easy to merge often, and about making sure conflicts are simple to
understand and fix, than it is about being clever.

Which isn't to say there aren't systems with less clever merge
algorithms. CVS doesn't even do a 3-way merge, since it doesn't bother
to remember where the last branch intersection was.

BTW, I think Junio's 2006 OLS talk has some nice pictures of a 3-way
merge which help to explain it (see slides 23-32):

  http://members.cox.net/junkio/200607-ols.pdf


That's just my two cents from skimming over your notes. Hope it helps.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Comments on Presentation Notes Request.
  2009-01-06 22:33 Comments on Presentation Notes Request Tim Visher
  2009-01-07  6:36 ` Jeff King
@ 2009-01-07  8:33 ` david
  2009-01-07 16:11   ` Tim Visher
  2009-01-08  0:14 ` Daniel Barkalow
  2009-01-09 13:50 ` Jakub Narebski
  3 siblings, 1 reply; 10+ messages in thread
From: david @ 2009-01-07  8:33 UTC (permalink / raw
  To: Tim Visher; +Cc: git

On Tue, 6 Jan 2009, Tim Visher wrote:

> *** Natural Backup
>
> Because every developer has a copy of the repository, every developer
> you add adds an extra failure point.  The more developers you have,
> the more backups you have of the repository.

this needs to be re-worded. 'extra failure point' can be read to mean 
redundancy in what would otherwide be a single point of failure, but it 
can also mean another point where things can fail.

something like 'every developer adds an extra layer of redundancy' would 
be much less ambiguous.

David Lang

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Comments on Presentation Notes Request.
  2009-01-07  8:33 ` david
@ 2009-01-07 16:11   ` Tim Visher
  0 siblings, 0 replies; 10+ messages in thread
From: Tim Visher @ 2009-01-07 16:11 UTC (permalink / raw
  To: david; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 8698 bytes --]

Thanks for the suggestions so far.  I've updated the notes.

@Peff: Thanks especially for pointing me towards Junio's
presentatation.  That's an excellent source.

Here's the patch for your suggestions:

diff --git a/scmOutline.txt b/scmOutline.txt
index 1791fa0..d25198c 100644
--- a/scmOutline.txt
+++ b/scmOutline.txt
@@ -1,4 +1,4 @@
-SCM: Distributed, Centralized, and Everything in Between.

+SCM: Centralized, Distributed, and Everything in Between.



 * What is SCM and Why is it Useful?



@@ -20,7 +20,11 @@ Not only is it unlimited, but it's random access.
If you changed a function a w


 Many people can edit the same code base at the same time and know,
without a doubt, that when they pull all those changes together, the
system will merge the content intelligently or inform you of the
conflict and let you merge it.  You don't need to lock files.
Obviously, if there is bad coordination then the possibilities of
conflicts rise, but this should not happen regularly.



-*** Diff Debugging

+*** Software Archeology

+

+With a proper SCMS, it becomes a somewhat trivial operation to
discover the author and reasons for a given change.  This is because
of the rich metadata associated with commits (author, date, complete
change set, diffs, and commentary).  So rather than wandering asking
if anyone remembers doing something and why, you simply commit that
information into the system and then refer to it when you need to.

+

+**** Diff Debugging



 You can find where a bug was introduced by learning how to reproduce
the bug and then doing a binary chop search back through the History
to come to the exact commit that introduced the bug.



@@ -30,11 +34,11 @@ You can find where a bug was introduced by
learning how to reproduce the bug and


 The more you commit, the more fine grained control you have over the
undo feature of SCM.  Most documents that I have read suggested a TDD
approach wherein you commit whenever you have written just enough code
for your test to pass. But...



-** Don't Commit Broken Code (To the Public Tree)

+** Don't _Publish_ Broken Code



 Of primary concern is the fact that your central HEAD should _always_
build.  This is why practices like Continuous Integration and TDD are
so important.  TDD gives you the freedom to be sure that a change you
made hasn't broken anything you weren't expecting it to break.
Continuous Integration allows you to be sure that your whole system
will build every time.  Thus, you should _never_ commit broken code to
the (public) tree.



-Of course, in a centralized system, committing is intrinsically
public.  Even on branches, every time you commit any sort of change,
everyone is able to see it and so you could be breaking the build for
someone (even if it's just yourself and the build system).  One of the
nice features of a distributed system is that your public/private
ontology is much richer and thus allows you to have broken code in
your SCMS.

+Of course, in a centralized system, committing is intrinsically
public.  Even on branches, every time you commit any sort of change,
everyone is able to see it and so you could be breaking the build for
someone (even if it's just yourself and the build system).  One of the
nice features of a distributed system is that your public/private
ontology is much richer and thus allows you to have broken code in
your SCMS, so long as you haven't published it, at no penalty to
anyone but yourself.



 ** Whole Hog



@@ -130,7 +134,9 @@ Once you've published, however, not much changes.
Almost everything except upda


 *** Natural Backup



-Because every developer has a copy of the repository, every developer
you add adds an extra failure point.  The more developers you have,
the more backups you have of the repository.

+Because every developer has a copy of the repository, every developer
you add adds an extra layer of redundancy.  The more developers you
have, the more backups you have of the repository.

+

+An important point to make clear here is that you only are backing up
what everyone is duplicating.  If you have 10 unpublished branches
that no one else has cloned, then those are obviously not backed up.
However, the idea here would be that anything that is being developed
actively by multiple people is backed up by as many developers.  Other
than that, your private data must be backed up by you (which is what
you do anyway, right? ;).



 *** Must Learn New Work Flows.



@@ -148,6 +154,8 @@ This bears some explanation.  Within a distributed
system, you can have a single


 Git's implementation just happens to be wickedly fast.  It's faster
than mercurial, it's faster than bazaar, etc.  Everything, committing,
merging, viewing history, branching, and even updating and and pushing
are all faster.



+This is much more important than just shaving a few seconds off the
operations.  Because Git is so much faster, you begin to do things
differently because of how fast it is.  Git's blazing fast branching
and merging wouldn't matter at all if you never branched and merged
(which is possible), but because their blazing fast you _should_ begin
to branch and merge much more often, which __does__ fundamentally
change the way you develop your code (hopefully for the better).

+

 ** Tracks Content, not Files



 Git tracks content, not files, and it's the only SCMS at the moment
that does this.  This has many effects internally, but the most
apparent effect I know of is that for the first time Git can easily
tell you the history of even a function in a file because Git can tell
you which files that function existed (or does exist) in over the
course of development.

@@ -171,9 +179,9 @@ This is very powerful yet somewhat awkward to
grasp.  Basically, the upshot of t


 I've found this to be particularly useful when working with an
existing code base that was not properly formatted.  Often, I'll come
to a file that has a bunch of wonky white space choices and improperly
indented logical constructs and I'll just quickly run through it
correcting that stuff before continuing with the feature I was working
on.  Afterwords, I'll stage the formatting and commit it, and then
stage the feature I was working on and commit that.  You may not want
that kind of control (and if you don't, you don't need to use it), but
I like it.



-** Excellent Merge algorithms

+** Stupid but _Fast_ Merge Algorithms



-Git has excellent merge algorithms.  This is widely attributed and
doesn't require much explanation.  It was one of Git's original design
goals, and it has been proven by Git's implementation.  Merging in Git
is _much_ less painful than in other systems.

+Merging in Git is _much_ less painful than in other systems.  This is
mainly because of how fast it is and how much data it remembers when
it does a merge.  As opposed to CVS which can't merge a branch twice
because it doesn't remember where the last merge happened, Git keeps
track of that information so you can merge between branches as much as
you want.  Git's philosophy is to make merging as fast and painless as
possible so that you merge early and often enough to not develop
really bad conflicts that are nearly impossible to resolve.



 ** Has powerful 'maintainer tools'



@@ -196,3 +204,4 @@ Git guarantees absolutely that if corruption
happens, you will know about it.  I
 - <http://svnbook.red-bean.com/> - Rolling publish book on
Subversion.  Chapter 1 is a good introduction to general centralized
SCM concepts and principles.

 - <http://www.perforce.com/perforce/bestpractices.html> - An
excellent set of best practices from the Perforce team.  Some of it
(especially the branches) has a distinct centralized lean, but most of
it is quite good.

 - <http://www.bobev.com/PresentationsAndPapers/Common%20SCM%20Patterns.pdf>
- Interesting presentation by Pretzel Logic from 2001 attempting to
outline some common SCM best practices as Patterns.

+- <http://members.cox.net/junkio/200607-ols.pdf> - A presentation by
Junio Hamano (the Git maintainer) at a Linux symposium on what Git is
with some tutorials.


I've also attached it as a file.  It was generated by `git diff -p`.

I'm also looking for anyplace where I'm technically inaccurate.
Unfortunately, I've written a lot of this from things that I've either
read or heard.  I'm mainly experienced with VSS and Subversion (and
both of those to a very small degree), and making a lot of progress
with Git.  I've kind of been swept away by all the energy surrounding
git right now, though, so I'm sure my judgement is somewhat clouded.

Thanks again for your help!

-- 

In Christ,

Timmy V.

http://burningones.com/
http://five.sentenc.es/ - Spend less time on e-mail

[-- Attachment #2: suggestionsPatch01 --]
[-- Type: application/octet-stream, Size: 7915 bytes --]

diff --git a/scmOutline.txt b/scmOutline.txt
index 1791fa0..d25198c 100644
--- a/scmOutline.txt
+++ b/scmOutline.txt
@@ -1,4 +1,4 @@
-SCM: Distributed, Centralized, and Everything in Between.
+SCM: Centralized, Distributed, and Everything in Between.
 
 * What is SCM and Why is it Useful?
 
@@ -20,7 +20,11 @@ Not only is it unlimited, but it's random access.  If you changed a function a w
 
 Many people can edit the same code base at the same time and know, without a doubt, that when they pull all those changes together, the system will merge the content intelligently or inform you of the conflict and let you merge it.  You don't need to lock files.  Obviously, if there is bad coordination then the possibilities of conflicts rise, but this should not happen regularly.
 
-*** Diff Debugging
+*** Software Archeology
+
+With a proper SCMS, it becomes a somewhat trivial operation to discover the author and reasons for a given change.  This is because of the rich metadata associated with commits (author, date, complete change set, diffs, and commentary).  So rather than wandering asking if anyone remembers doing something and why, you simply commit that information into the system and then refer to it when you need to.
+
+**** Diff Debugging
 
 You can find where a bug was introduced by learning how to reproduce the bug and then doing a binary chop search back through the History to come to the exact commit that introduced the bug.
 
@@ -30,11 +34,11 @@ You can find where a bug was introduced by learning how to reproduce the bug and
 
 The more you commit, the more fine grained control you have over the undo feature of SCM.  Most documents that I have read suggested a TDD approach wherein you commit whenever you have written just enough code for your test to pass. But...
 
-** Don't Commit Broken Code (To the Public Tree)
+** Don't _Publish_ Broken Code
 
 Of primary concern is the fact that your central HEAD should _always_ build.  This is why practices like Continuous Integration and TDD are so important.  TDD gives you the freedom to be sure that a change you made hasn't broken anything you weren't expecting it to break.  Continuous Integration allows you to be sure that your whole system will build every time.  Thus, you should _never_ commit broken code to the (public) tree.
 
-Of course, in a centralized system, committing is intrinsically public.  Even on branches, every time you commit any sort of change, everyone is able to see it and so you could be breaking the build for someone (even if it's just yourself and the build system).  One of the nice features of a distributed system is that your public/private ontology is much richer and thus allows you to have broken code in your SCMS.
+Of course, in a centralized system, committing is intrinsically public.  Even on branches, every time you commit any sort of change, everyone is able to see it and so you could be breaking the build for someone (even if it's just yourself and the build system).  One of the nice features of a distributed system is that your public/private ontology is much richer and thus allows you to have broken code in your SCMS, so long as you haven't published it, at no penalty to anyone but yourself.
 
 ** Whole Hog
 
@@ -130,7 +134,9 @@ Once you've published, however, not much changes.  Almost everything except upda
 
 *** Natural Backup
 
-Because every developer has a copy of the repository, every developer you add adds an extra failure point.  The more developers you have, the more backups you have of the repository.  
+Because every developer has a copy of the repository, every developer you add adds an extra layer of redundancy.  The more developers you have, the more backups you have of the repository.  
+
+An important point to make clear here is that you only are backing up what everyone is duplicating.  If you have 10 unpublished branches that no one else has cloned, then those are obviously not backed up.  However, the idea here would be that anything that is being developed actively by multiple people is backed up by as many developers.  Other than that, your private data must be backed up by you (which is what you do anyway, right? ;).
 
 *** Must Learn New Work Flows.
 
@@ -148,6 +154,8 @@ This bears some explanation.  Within a distributed system, you can have a single
 
 Git's implementation just happens to be wickedly fast.  It's faster than mercurial, it's faster than bazaar, etc.  Everything, committing, merging, viewing history, branching, and even updating and and pushing are all faster.
 
+This is much more important than just shaving a few seconds off the operations.  Because Git is so much faster, you begin to do things differently because of how fast it is.  Git's blazing fast branching and merging wouldn't matter at all if you never branched and merged (which is possible), but because their blazing fast you _should_ begin to branch and merge much more often, which __does__ fundamentally change the way you develop your code (hopefully for the better).
+
 ** Tracks Content, not Files
 
 Git tracks content, not files, and it's the only SCMS at the moment that does this.  This has many effects internally, but the most apparent effect I know of is that for the first time Git can easily tell you the history of even a function in a file because Git can tell you which files that function existed (or does exist) in over the course of development.
@@ -171,9 +179,9 @@ This is very powerful yet somewhat awkward to grasp.  Basically, the upshot of t
 
 I've found this to be particularly useful when working with an existing code base that was not properly formatted.  Often, I'll come to a file that has a bunch of wonky white space choices and improperly indented logical constructs and I'll just quickly run through it correcting that stuff before continuing with the feature I was working on.  Afterwords, I'll stage the formatting and commit it, and then stage the feature I was working on and commit that.  You may not want that kind of control (and if you don't, you don't need to use it), but I like it.
 
-** Excellent Merge algorithms
+** Stupid but _Fast_ Merge Algorithms
 
-Git has excellent merge algorithms.  This is widely attributed and doesn't require much explanation.  It was one of Git's original design goals, and it has been proven by Git's implementation.  Merging in Git is _much_ less painful than in other systems.
+Merging in Git is _much_ less painful than in other systems.  This is mainly because of how fast it is and how much data it remembers when it does a merge.  As opposed to CVS which can't merge a branch twice because it doesn't remember where the last merge happened, Git keeps track of that information so you can merge between branches as much as you want.  Git's philosophy is to make mergining as fast and painless as possible so that you merge early and often enough to not develop really bad conflicts that are nearly impossible to resolve.
 
 ** Has powerful 'maintainer tools'
 
@@ -196,3 +204,4 @@ Git guarantees absolutely that if corruption happens, you will know about it.  I
 - <http://svnbook.red-bean.com/> - Rolling publish book on Subversion.  Chapter 1 is a good introduction to general centralized SCM concepts and principles.
 - <http://www.perforce.com/perforce/bestpractices.html> - An excellent set of best practices from the Perforce team.  Some of it (especially the branches) has a distinct centralized lean, but most of it is quite good.
 - <http://www.bobev.com/PresentationsAndPapers/Common%20SCM%20Patterns.pdf> - Interesting presentation by Pretzel Logic from 2001 attempting to outline some common SCM best practices as Patterns.
+- <http://members.cox.net/junkio/200607-ols.pdf> - A presentation by Junio Hamano (the Git maintainer) at a Linux symposium on what Git is with some tutorials.

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: Comments on Presentation Notes Request.
  2009-01-07  6:36 ` Jeff King
@ 2009-01-07 22:30   ` Daniel Barkalow
  2009-01-07 22:40     ` Boyd Stephen Smith Jr.
  2009-01-08  9:56     ` Jeff King
  0 siblings, 2 replies; 10+ messages in thread
From: Daniel Barkalow @ 2009-01-07 22:30 UTC (permalink / raw
  To: Jeff King; +Cc: Tim Visher, git

On Wed, 7 Jan 2009, Jeff King wrote:

> On Tue, Jan 06, 2009 at 05:33:02PM -0500, Tim Visher wrote:
> 
> > ** Advantages of SCM
> > *** One Source to Rule Them All.
> > *** Unlimited Undo/Redo.
> > *** Safe Concurrent Editing.
> > *** Diff Debugging
> 
> I would add to this metadata and "software archeology": finding the
> author of a change or piece of code, the motivation behind it, related
> changes (by position within history, by content, or by commit message),
> etc.

If you look at the git source code, the comments in the code are almost 
never sufficient to really understand the code, because a full 
line-by-line explanation would make it hard to find the code under the 
comments. On the other hand, if you take "git blame" in one window and a 
series of "git show"s in another window, and look at the commit messages 
for the commits that introduced each of those lines, you get really 
detailed and in-depth documentation of the subtle changes.

> I think people who have not used an SCM before, and people coming from
> SCMs where it is painful to look at history (like CVS) undervalue this
> because it's not part of their workflow.  But having used git for a few
> years now, it is an integral part of how I develop (especially when
> doing maintenance or bugfixes).
> 
> You touch on this in "Diff Debugging", but I think bisection is just a
> part of it.
> 
> > * SCM Best Practices
> >
> > ** Commit Early, Commit Often
> > ** Don't Commit Broken Code (To the Public Tree)
> 
> People talk a lot about using their SCM on a plane, but I think these
> two seemingly opposite commands highlight the _real_ useful thing about
> a distributed system for most people: commit and publish are two
> separate actions.
> 
> So I think it might be better to say "Commit Early, Commit Often" but
> "Don't _Publish_ Broken Code". Which is what you end up saying in the
> discussion, but I think using that terminology makes clear the important
> distinction between two actions that are convoluted in centralized
> systems.
> 
> > *** Backup Becomes A Separate Process
> > Because there is only a single repository, you need a back-up strategy
> > or else you are exposing yourself to a single point of failure.
> > [...]
> > *** Natural Backup
> > Because every developer has a copy of the repository, every developer
> > you add adds an extra failure point.  The more developers you have,
> > the more backups you have of the repository.
> 
> The "natural backup" thing gets brought out a lot for DVCS. And it is
> sort of true: instead of each developer having a backup of the latest
> version (or some recent version which they checked out), they have a
> backup of the whole history. But they still might not have everything.
> Developers might not clone all branches. They might not be up to date
> with some "master" repository. Useful work might be unpublished in the
> master repo (e.g., I am working on feature X which is 99% complete, but
> not ready for me to merge into master and push).

It is the case that everything in the central repo (including speculative 
stuff) will also be on its author's machine, with the metadata needed to 
identify that it's not in the main history and how everything is supposed 
to be arranged. This is likely to be particularly helpful for the work 
that everybody did between the last backup and the crash.

> So yes, you are much more likely to salvage useful (if not all) data
> from developer repositories in the event of a crash. But I still think
> it's crazy not to have a backup strategy for your DVCS repo.

I think it's very important to have a backup strategy, but it's nice that 
the developers can get work done while the server is still down.

> > ** Excellent Merge algorithms
> > 
> > Git has excellent merge algorithms.  This is widely attributed and
> > doesn't require much explanation.  It was one of Git's original design
> > goals, and it has been proven by Git's implementation.  Merging in Git
> > is _much_ less painful than in other systems.
> 
> Actually, git has a really _stupid_ merge algorithm that has been around
> forever: the 3-way merge. And by stupid I don't mean bad, but just
> simple and predictable. I think the git philosophy is more about making
> it easy to merge often, and about making sure conflicts are simple to
> understand and fix, than it is about being clever.

Git is clever about finding the 3 inputs to the 3-way merge, particularly 
the common ancestor of commits that don't have a common ancestor. I think 
merge-recursive is novel to git, and may not be available anywhere else.

> Which isn't to say there aren't systems with less clever merge
> algorithms. CVS doesn't even do a 3-way merge, since it doesn't bother
> to remember where the last branch intersection was.

CVS did do 3-way merge, but only between your uncommited changes, the 
latest commit, and the common ancestor (the commit that you started 
changing). IIRC, arch actually didn't support 3-way merge at all.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Comments on Presentation Notes Request.
  2009-01-07 22:30   ` Daniel Barkalow
@ 2009-01-07 22:40     ` Boyd Stephen Smith Jr.
  2009-01-08  0:28       ` Daniel Barkalow
  2009-01-08  9:56     ` Jeff King
  1 sibling, 1 reply; 10+ messages in thread
From: Boyd Stephen Smith Jr. @ 2009-01-07 22:40 UTC (permalink / raw
  To: Daniel Barkalow; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 436 bytes --]

On Wednesday 2009 January 07 16:30:04 Daniel Barkalow wrote:
> Git is clever about finding [...]
> the common ancestor of commits that don't have a common ancestor.

*confused*

Please elaborate.
-- 
Boyd Stephen Smith Jr.                     ,= ,-_-. =. 
bss@iguanasuicide.net                     ((_/)o o(\_))
ICQ: 514984 YM/AIM: DaTwinkDaddy           `-'(. .)`-' 
http://iguanasuicide.net/                      \_/     

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Comments on Presentation Notes Request.
  2009-01-06 22:33 Comments on Presentation Notes Request Tim Visher
  2009-01-07  6:36 ` Jeff King
  2009-01-07  8:33 ` david
@ 2009-01-08  0:14 ` Daniel Barkalow
  2009-01-09 13:50 ` Jakub Narebski
  3 siblings, 0 replies; 10+ messages in thread
From: Daniel Barkalow @ 2009-01-08  0:14 UTC (permalink / raw
  To: Tim Visher; +Cc: git

On Tue, 6 Jan 2009, Tim Visher wrote:

> Hello Everyone,
> 
> I'm putting together a little 15 minute presentation for my company
> regarding SCMSes in an attempt to convince them to at the very least
> use a Distributed SCMS and at best to use git.  I put together all my
> notes, although I didn't put together the actual presentation yet.  I
> figured I'd post them here and maybe get some feedback about it.  Let
> me know what you think.
> 
> Thanks in advance!
> 
> Notes
> ---------
> 
> SCM: Distributed, Centralized, and Everything in Between.
> 
> * SCM Best Practices
> 
> ** Allow and Encourage Customer Participation
> 
> Most shops seem to attempt to funnel customer participation through
> the developers.  This is a cache miss for many operations such as
> developing the user manual by a design team external to the
> development team.  Basic operations such as commit and update are
> fairly simple to grasp and can even be simplified further through
> scripts and other such tools that non-developers can quickly be taught
> to use.
> 
> Of note is the Tortoise family of tools which integrate directly into
> Windows Explorer.  This makes it fairly easy for anyone who is
> familiar with Windows Explorer to get into using any of the tools that
> there is a Tortoise implementation for.

I still want an office software package with "commit" instead of "save" 
(when in a repository), and a mail program with "push" instead of "attach" 
and "fetch" instead of "open". (See below)

I think that the sales department should be using distributed version 
control, neatly packaged up.

> * The Centralized Model
> 
> ** We Know About This One
> 
> This is traditional, plain vanilla, ubiquitous SCM.
> 
> The great majority of the SCMSes out there are centralized.
> 
> Closely resembles the Client/Server system model.
> 
> ** Work Flow
> 
> <http://whygitisbetterthanx.com/#any-workflow>
> 
> *** 2 basic models: 'Lock, Modify, Unlock' and 'Copy, Modify, Merge'.
> 
> Older systems were primarily Lock, Modify, Unlock implementations.
> You would checkout a file that you intended to work on, and no one
> else would be able to check it out until you unlocked it, signaling
> that you were done editing it.  This is inherently inefficient as on a
> team of developers, the chances that two are working on the exact same
> part of a system without knowing it and coordinating are fairly low.
> Also, any disparate features that still touch the same files in the
> system cannot be worked on simultaneously.
> 
> The answer to this is Copy, Modify, Merge.  In this system, every
> developer gets a complete copy of the HEAD.  Everyone changes the HEAD
> concurrently.  When commits happen, the system attempts to
> intelligently merge them.  If it fails (usually doesn't happen unless
> there is bad coordination), then it asks you to merge them.  This has
> been proven to work well.

Git is almost unique in that, at the point where the user is asked to do a 
merge, the user's work is already preserved.

That is, most systems are: Copy, Modify, Merge, Commit. Git is: Copy, 
Modify, Commit, Merge.

> * The Distributed Model
> 
> ** This Ones New
> 
> At least new as in unfamiliar.  The concept is over a decade old.

In some fundamental ways, this actually resembles the "broadcast email" 
collaboration method. That is, a group is writing a document. Someone 
writes a skeleton, and emails it to everybody else. They make changes to 
different sections. When each person has changed something, they email the 
full document to everybody else. Before people send out their 
versions, they check their email and (painfully) merge the changes into 
what they've done.

This evolved into having a certain location to avoid the painful merge, 
and then to version control. Distributed systems go back to this model, 
except without the "(painfully)" and with all the other benefits of 
version control.

> There are a few different popular distributed SCMSes (Git, Mercurial
> (hg), Bazaar (bzr), Bitkeeper)
> 
> Very closely resembles a peer-to-peer network and the organic
> relationships that evolve in that space.
> 
> In a distributed system, there is no one point where all development
> comes together to for any reason other than policy.  Everyone who is
> working on a system intrinsically has their own copy of the entire
> repository.  All of the history, all of the source code, all of the
> public branches, all of the public tags, etc.  Because of this,
> developers can also have private branches, private tags, private
> commits, private history.  The distinction between public and private
> is very important in this context.  This has several distinct features
> which I'll go into now.
> 
> ** Work Flow (Pick Your Poison)
> 
> <http://whygitisbetterthanx.com/#any-workflow>
> 
> ** Key Properties
> 
> *** Private/Public Concept
> 
> Distributed SCMSes Private/Public ontology is __much__ richer.
> Whereas in a central system, private means only what you have yet to
> commit or what you are leaving untracked, in a distributed system,
> private means anything that you have not yet _chosen_ to make public.
> In other words, you can have private branches, private tags, private
> committed changes to your copy of the head, etc.  Anything that you do
> not specifically publish to a location that others can access is
> intrinsically private.
> 
> In other words, you can finally SCM your sandbox!  You can commit as
> many broken things as you want to a private repository, giving you the
> ability to have a nearly infinite set of undoable and recoverable
> changes, without breaking anyone else's build.  Or, you can just as
> easily ignore TDD, never commit anything for 3 weeks and then do a
> big, massive commit and as long as your final product is tested and
> merges with the rest of the tree, you're good to go and no one cares.

Although you'll be really sad if you accidentally wipe out your work after 
2 1/2 weeks...

> Because you have a rich ontology for private/public data, you can also
> do crazy things like rewriting your local history before anyone else
> sees it.  Because your repository is the only one that has to know
> about the history as long as you're dealing with private data, this is
> a completely safe (although policy debatable) operation.  Of course,
> once data has been published, you really shouldn't mess with its
> history anymore.

You can also see this as writing a new history. If you knew starting out 
everything that you knew when you finished, you might do things 
differently, and the results would likely be more useful. Writing a new 
history lets you start over from where you started, while being able to 
refer to the final working state that you came up with.

> *** Must Learn New Work Flows.
> 
> In order to fully experience the advantages of distributed systems,
> new work flows must be learned.  In other words, it's possible to use
> distributed systems nearly the exact same way as you use a centralized
> system (you just need to learn new commands), but you don't get many
> of the benefits except the speed improvements.  The real game change
> happens when you realize that you can keep things private until their
> finished.  Once you realize that, new branching patterns emerge, new
> work flows happen, you commit more often, and have the ability to
> become much looser and freer in your development process.

My experience bring git to a small company is that people don't need to 
learn new workflows. They can go on with their old workflows and develop 
new ones as they streamline their work. The one exception is really that 
they have to be told that, in git, you commit before merging instead of 
merging before committing.

> *** Impossible To Completely Enforce A Single, Canonical
> Representation of the Code Base.
> 
> By nature, a distributed system cannot enforce a single canonical
> representation of the code base except by policy, and policies can
> always be broken.  Also, any intentionally private data is not backed
> up because it is not shared.  However, backup becomes much simpler
> because you know that no one else is committing to your repository.
> 
> This bears some explanation.  Within a distributed system, you can
> have a single official release point that everyone has blessed (or the
> company has blessed, or the original developer has blessed, or
> whatever).  However, you cannot _stop_ someone else from making a
> release point because their repository is just as valid as yours.  You
> cannot _stop_ developers from sharing code between themselves without
> going out to the official central location.  All you can do is ask
> them not to.

And you might not want to ask them not to. It's really nice to be able to 
reassign a developer to a different task and pass that developer's 
incomplete and not-ready-for-prime-time work to somebody else.

> * Why Git is the Best Choice
> 
> ** (Un)Staged Changes
> 
> Git employs the concept of the Index or Cache or Commit Stage.  This
> is also unique to Git, and it's pretty strange for developers coming
> from a system without it.
> 
> Basically, There are 4 states that any content can be in under Git.
> 
> 1. Untracked: This is content that Git is completely unaware of.
> 2. Tracked but Unstaged: This is content that has changed that Git is
> aware of but will not commit on the next commit command.
> 3. Tracked and Staged: This is the same as unstaged except that this
> content will be committed on the next commit.
> 4. Tracked and Committed:  This is content that has not changed since
> the previous commit that Git is aware of.

1, 4, and something in between are normal; the only extra is 
distinguishing 2 and 3.

> This is very powerful yet somewhat awkward to grasp.  Basically, the
> upshot of this feature is that you can manually build commits if you
> want to.  Say you were working on feature foo and then made some other
> changes because you came across feature bar and thought it would be
> quick to do.  In any other system, the only way you could commit parts
> of what you'd changed is if you were lucky enough for the disparate
> changes to be in different files.  In that case, you could commit only
> the files that you wanted to change for the different features.
> However, if you made disparate changes to the same file, you were
> stuck.  In Git, you can stage only parts of the files to an extreme
> degree.  This allows you to create as many commits as you want out of
> a single change set until the whole change set is committed.

It's pretty common for a system to support:

$ (sys) commit <filenames...>

At its core, the index just lets you tell git about those files on 
multiple command lines instead of just one. And it lets you make 
unincluded changes after you give it a file but before you commit. And it 
lets you fabricate the contents that you're putting in. But really, it's 
about being able to list the things to include one-by-one. (Well, really, 
it's about being able to make 100 commits of a 30000-file project in under 
a second, but that's just the original inspiration.)

> I've found this to be particularly useful when working with an
> existing code base that was not properly formatted.  Often, I'll come
> to a file that has a bunch of wonky white space choices and improperly
> indented logical constructs and I'll just quickly run through it
> correcting that stuff before continuing with the feature I was working
> on.  Afterwords, I'll stage the formatting and commit it, and then
> stage the feature I was working on and commit that.  You may not want
> that kind of control (and if you don't, you don't need to use it), but
> I like it.
> 
> ** Cryptographically Guarantees Content
> 
> One of the most surprising things I learned as I was researching this
> was that most SCMSes do not guarantee that your content does not get
> corrupted.  In other words, if the repository's disk doesn't fail but
> instead just gets corrupted, you'll never know unless you actually
> notice the corruption in the files.  If you have memory corruption
> locally and commit your changes, you just won't know.
>
> Git guarantees absolutely that if corruption happens, you will know
> about it.  It does this by creating SHA-1 hashes of your content and
> then checking to make sure that the SHA-1 hash does not change for an
> object.  The details of this aren't as important as the fact that Git
> is one of the very few systems that do this and it's obviously
> desirable.

You can still get a situation where the content gets corrupted before it 
gets into git, and git happily tracks your corrupt content. But that's 
pretty obvious.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Comments on Presentation Notes Request.
  2009-01-07 22:40     ` Boyd Stephen Smith Jr.
@ 2009-01-08  0:28       ` Daniel Barkalow
  0 siblings, 0 replies; 10+ messages in thread
From: Daniel Barkalow @ 2009-01-08  0:28 UTC (permalink / raw
  To: Boyd Stephen Smith Jr.; +Cc: git

On Wed, 7 Jan 2009, Boyd Stephen Smith Jr. wrote:

> On Wednesday 2009 January 07 16:30:04 Daniel Barkalow wrote:
> > Git is clever about finding [...]
> > the common ancestor of commits that don't have a common ancestor.
> 
> *confused*
> 
> Please elaborate.

I meant to say "a *unique* closest common ancestor". The clever trick is 
that, if there are multiple common ancestors which aren't closer than each 
other, you can merge those ancestors (based, recursively, on their common 
ancestors) to generate a new commit with merge conflicts in it. You then 
pretend that this commit is the unique common ancestor for 3-way merge. 
This works because the merge conflicts in the commit all seem to have been 
replaced in each branch, and the conflict region is some arbitrary chunk 
of text in between other context, and the 3-way merge output doesn't show 
the original text (which would be weird junk in this case: a merge 
conflict that didn't really happen in the middle of other merge 
conflicts), but only the text from the two sides being merged, so it's not 
necessary to resolve the old merge that didn't happen.

I think all of the other systems, if you have crossing history such that 
there isn't a unique common ancestor do one of: (a) give up, (b) generate 
conflicts between your change as it stayed in your branch and the same 
change as it went out and came back, or (c) mishandle some cases involving 
reverts.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Comments on Presentation Notes Request.
  2009-01-07 22:30   ` Daniel Barkalow
  2009-01-07 22:40     ` Boyd Stephen Smith Jr.
@ 2009-01-08  9:56     ` Jeff King
  1 sibling, 0 replies; 10+ messages in thread
From: Jeff King @ 2009-01-08  9:56 UTC (permalink / raw
  To: Daniel Barkalow; +Cc: Tim Visher, git

On Wed, Jan 07, 2009 at 05:30:04PM -0500, Daniel Barkalow wrote:

> > So yes, you are much more likely to salvage useful (if not all) data
> > from developer repositories in the event of a crash. But I still think
> > it's crazy not to have a backup strategy for your DVCS repo.
> 
> I think it's very important to have a backup strategy, but it's nice that 
> the developers can get work done while the server is still down.

I think everything you said in your email was correct, and I agree with
it, but I just wanted to clarify one thing about what I said.

I really _do_ think you are better off in a disaster or backup situation
with a DVCS. Both this past year and 2007, Junio dropped off the face of
the git planet for a few weeks, and everyone seamlessly switched to
Shawn as maintainer. So I think of the DVCS model almost more as "high
availablity": even if you model your workflow around a central server,
it's easy to route around the failure.

It's just that I don't think these features totally _replace_ backups as
a concept. And I feel like that notion creeps up now and again in the
centralized versus distributed holy wars.

So I think we agree; I just wasn't sure if I gave the wrong impression
from my first email.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Comments on Presentation Notes Request.
  2009-01-06 22:33 Comments on Presentation Notes Request Tim Visher
                   ` (2 preceding siblings ...)
  2009-01-08  0:14 ` Daniel Barkalow
@ 2009-01-09 13:50 ` Jakub Narebski
  3 siblings, 0 replies; 10+ messages in thread
From: Jakub Narebski @ 2009-01-09 13:50 UTC (permalink / raw
  To: Tim Visher; +Cc: git

"Tim Visher" <tim.visher@gmail.com> writes:

> Hello Everyone,
> 
> I'm putting together a little 15 minute presentation for my company
> regarding SCMSes in an attempt to convince them to at the very least
> use a Distributed SCMS and at best to use git.  I put together all my
> notes, although I didn't put together the actual presentation yet.  I
> figured I'd post them here and maybe get some feedback about it.  Let
> me know what you think.
> 
> Thanks in advance!

Take a look at the following links:
 * "Understanding Version-Control Systems (DRAFT)" by Eric Raymond
   http://www.catb.org/esr/writings/version-control/version-control.html
 * "Version Control Habits of Effective Developers" at The Daily Build
   http://blog.bstpierre.org/version-control-habits

Note that the first one is DRAFT; on the other hand it explains
lock-edit, merge-then-commit, and commit-then-merge workflows quite
well, and has a host of links.
   
-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2009-01-09 13:52 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-06 22:33 Comments on Presentation Notes Request Tim Visher
2009-01-07  6:36 ` Jeff King
2009-01-07 22:30   ` Daniel Barkalow
2009-01-07 22:40     ` Boyd Stephen Smith Jr.
2009-01-08  0:28       ` Daniel Barkalow
2009-01-08  9:56     ` Jeff King
2009-01-07  8:33 ` david
2009-01-07 16:11   ` Tim Visher
2009-01-08  0:14 ` Daniel Barkalow
2009-01-09 13:50 ` Jakub Narebski

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).