Part 2: Checkouts, Branching, & Merging¶
This section deals with navigating repository history, creating & merging branches, and understanding conflicts
Creating A Branch¶
The hangar workflow is intended to mimic common git
workflows in
which small incremental changes are made and committed on dedicated
topic
branches. After the topic
has been adequatly set,
topic
branch is merged
into a seperate branch (commonly refered
to as master
, though it need not be the actual branch named
"master"
), where well vetted and more permenant changes are kept.
Create Branch -> Checkout Branch -> Make Changes -> Commit
Let’s initialize a new repository and see how branching works in Hangar
from hangar import Repository
import numpy as np
repo = Repository(path='/Users/rick/projects/tensorwerk/hangar/dev/mnist/')
repo.init(user_name='Rick Izzo', user_email='rick@tensorwerk.com', remove_old=True)
Hangar Repo initialized at: /Users/rick/projects/tensorwerk/hangar/dev/mnist/__hangar
'/Users/rick/projects/tensorwerk/hangar/dev/mnist/__hangar'
When a repository is first initialized, it has no history, no commits.
repo.log() # -> returns None
Though the repository is essentially empty at this point in time, there
is one thing which is present: A branch with the name: "master"
.
repo.list_branch_names()
['master']
This "master"
branch is the branch we make our first commit on;
until we do, the repository is in a semi-unstable state, and will
generally flat out refuse to perform otherwise standard
operations/behaviors.
Since the only option available at this point in time is to create a
write-enabled checkout of this "master"
branch so we can add data
and make a commit, let’s do that now.
co = repo.checkout(write=True)
As expected, there are no datasets or metadata samples recorded in the checkout
print(f'number of metadata keys: {len(co.metadata)}')
print(f'number of datasets: {len(co.datasets)}')
number of metadata keys: 0
number of datasets: 0
Let’s add a dummy array just to put something in the repository, and we will commit & close the checkout.
dummy = np.arange(10, dtype=np.uint16)
dset = co.datasets.init_dataset(name='dummy_dataset', prototype=dummy)
dset['0'] = dummy
initialCommitHash = co.commit('first commit with a single sample added to a dummy dataset')
co.close()
Dataset Specification:: Name: dummy_dataset, Initialization style: prototype, Shape: (10,), DType: uint16, Samples Named: True, Variable Shape: False, Max Shape: (10,) Dataset Initialized: dummy_dataset Commit operation requested with message: first commit with a single sample added to a dummy dataset (288, 222, 288) removing all stage hash records Commit completed. Commit hash: b21ebbeeece723bf7aa2157eb2e8742a043df7d0 writer checkout of master closed
If we check the history now, we can see our first commit hash, and that
it is labeled with the branch name "master"
repo.log()
* b21ebbeeece723bf7aa2157eb2e8742a043df7d0 ([1;31mmaster[m) : first commit with a single sample added to a dummy dataset
So now our repository contains: - A commit: a fully independent
description of the entire repository state as it existed at some point
in time. A commit is identified by a commit_hash
- A branch: a label
pointing to a particular commit
/ commit_hash
Once committed, it is not possible to remove, modify, or otherwise tamper with the contents of a commit in any way. It is a permenant record, which Hangar has no method to change once written to disk.
In addition, as a commit_hash
is not only calculated from the
commit
’s contents, but from the commit_hash
of its parents
(more on this to follow), knowing a single top-level commit_hash
allows us to verify the integrity of the entire repository history. This
fundumental behavior holds even in cases of disk-corruption or malicious
use.
All about Checkouts¶
Checking out a branch/commit for reading: is the process of retriving records describing repository state at some point in time, and setting up access to the referenced data.
Any number of read checkout processes can operate on a repository (on any number of commits) at the same time.
Checking out a branch for writing: is the process of setting up a
(mutable) staging area
to temporarily gather record references /
data before all changes have been made and the content’s of the staging
area are committed
in a new commit
Only one write-enabled checkout can ever be operating in a repository at a time
When initially creating the checkout, the
staging area
is not actually “empty”. Instead, it has the full contents of the lastcommit
referenced by a branch’sHEAD
. These records can be removed/mutated/added to in any way to form the nextcommit
. The newcommit
retains a permenant reference identifying the previousHEAD
commit
was used as it’s basestaging area
On commit, the branch which was checked out has it’s
HEAD
pointer value updated to the newcommit
’scommit_hash
. A write-enabled checkout starting from the same branch will now use thatcommit
’s record content as the base for it’sstaging area
.
Creating a branch¶
A branch must always have a name
and a base_commit
.
However, If no base_commit
is specified, the current writer branch
HEAD
commit
is used as the base_commit
hash for the branch
branch_1 = repo.create_branch(branch_name='testbranch')
branch_1
'testbranch'
viewing the log, we see that a new branch named: testbranch
is
pointing to our initial commit
print(f'branch names: {repo.list_branch_names()} \n')
repo.log()
branch names: ['master', 'testbranch']
* b21ebbeeece723bf7aa2157eb2e8742a043df7d0 ([1;31mmaster[m) ([1;31mtestbranch[m) : first commit with a single sample added to a dummy dataset
If instead, we do actually specify the base commit (with a different
branch name) we see we do actually get a third branch. pointing to the
same commit as "master"
and "testbranch"
branch_2 = repo.create_branch(branch_name='new', base_commit=initialCommitHash)
branch_2
'new'
repo.log()
* b21ebbeeece723bf7aa2157eb2e8742a043df7d0 ([1;31mmaster[m) ([1;31mnew[m) ([1;31mtestbranch[m) : first commit with a single sample added to a dummy dataset
Making changes on a branch¶
Let’s make some changes on the "new"
branch to see how things might
change
co = repo.checkout(write=True, branch_name='new')
We can see that the data we added previously is still here (dummy
dataset containing one sample labeled 0
)
co.datasets
Hangar Datasets
Writeable: True
Dataset Names:
- dummy_dataset
co.datasets['dummy_dataset']
Hangar DatasetDataWriter
Dataset Name : dummy_dataset
Schema UUID : d82cddc07e0211e9a08a8c859047adef
Schema Hash : 43edf7aa314c
Variable Shape : False
(max) Shape : (10,)
Datatype : <class 'numpy.uint16'>
Named Samples : True
Access Mode : a
Num Samples : 1
co.datasets['dummy_dataset']['0']
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint16)
Let’s add another sample to the dummy_dataset
called 1
arr = np.arange(10, dtype=np.uint16)
# let's increment values so that `0` and `1` aren't set to the same thing
arr += 1
co.datasets['dummy_dataset']['1'] = arr
We can see that in this checkout, there are indeed, two samples in the
dummy_dataset
len(co.datasets['dummy_dataset'])
2
That’s all, let’s commit this and be done with this branch
co.commit('commit on `new` branch adding a sample to dummy_dataset')
co.close()
Commit operation requested with message: commit on new branch adding a sample to dummy_dataset (350, 255, 350) removing all stage hash records Commit completed. Commit hash: 0cdd8c833f654d18ddc2b089fabee93c32c9c155 writer checkout of new closed
How do changes appear when made on a branch?¶
If we look at the log, we see that the branch we were on (new
) is a
commit ahead of master
and testbranch
repo.log()
* 0cdd8c833f654d18ddc2b089fabee93c32c9c155 ([1;31mnew[m) : commit on new branch adding a sample to dummy_dataset * b21ebbeeece723bf7aa2157eb2e8742a043df7d0 ([1;31mmaster[m) ([1;31mtestbranch[m) : first commit with a single sample added to a dummy dataset
The meaning is exactally what one would intuit. we made some changes,
they were reflected on the new
branch, but the master
and
testbranch
branches were not impacted at all, nor were any of the
commits!
Merging (Part 1) Fast-Forward Merges¶
Say we like the changes we made on the new
branch so much that we
want them to be included into our master
branch! How do we make this
happen for this scenario??
Well, the history between the HEAD
of the "new"
and the HEAD
of the "master"
branch is perfectly linear. In fact, when we began
making changes on "new"
, our staging area was identical to what
the "master"
HEAD
commit references are right now!
If you’ll remember that a branch is just a pointer which assigns some
name
to a commit_hash
, it becomes apparent that a merge in this
case really doesn’t involve any work at all. With a linear history
between "master"
and "new"
, any commits
exsting along the
path between the HEAD
of "new"
and "master"
are the only
changes which are introduced, and we can be sure that this is the only
view of the data records which can exist!
What this means in practice is that for this type of merge, we can just
update the HEAD
of "master"
to point to the "HEAD"
of
"new"
, and the merge is complete.
This situation is reffered to as a Fast Forward (FF) Merge. A FF
merge is safe to perform any time a linear history lies between the
"HEAD"
of some topic
and base
branch, regardless of how many
commits or changes which were introduced.
For other situations, a more complicated Three Way Merge is required. This merge method will be explained a bit more later in this tutorail
co = repo.checkout(write=True, branch_name='master')
Performing the Merge¶
In practice, you’ll never need to know the details of the merge theory explained above (or even remember it exists). Hangar automatically figures out which merge algorithms should be used and then performes whatever calculations are needed to compute the results.
As a user, merging in Hangar is a one-liner!
co.merge(message='message for commit (not used for FF merge)', dev_branch='new')
Selected Fast-Forward Merge Stratagy
removing all stage hash records
'0cdd8c833f654d18ddc2b089fabee93c32c9c155'
Let’s check the log!¶
repo.log()
* 0cdd8c833f654d18ddc2b089fabee93c32c9c155 ([1;31mmaster[m) ([1;31mnew[m) : commit on new branch adding a sample to dummy_dataset * b21ebbeeece723bf7aa2157eb2e8742a043df7d0 ([1;31mtestbranch[m) : first commit with a single sample added to a dummy dataset
co.branch_name
'master'
co.commit_hash
'0cdd8c833f654d18ddc2b089fabee93c32c9c155'
co.datasets['dummy_dataset']
Hangar DatasetDataWriter
Dataset Name : dummy_dataset
Schema UUID : d82cddc07e0211e9a08a8c859047adef
Schema Hash : 43edf7aa314c
Variable Shape : False
(max) Shape : (10,)
Datatype : <class 'numpy.uint16'>
Named Samples : True
Access Mode : a
Num Samples : 2
Making a changes to introduce diverged histories¶
Let’s now go back to our "testbranch"
branch and make some changes
there so we can see what happens when changes don’t follow a linear
history.
co = repo.checkout(write=True, branch_name='testbranch')
co.datasets
Hangar Datasets
Writeable: True
Dataset Names:
- dummy_dataset
co.datasets['dummy_dataset']
Hangar DatasetDataWriter
Dataset Name : dummy_dataset
Schema UUID : d82cddc07e0211e9a08a8c859047adef
Schema Hash : 43edf7aa314c
Variable Shape : False
(max) Shape : (10,)
Datatype : <class 'numpy.uint16'>
Named Samples : True
Access Mode : a
Num Samples : 1
We will start by mutating sample 0
in dummy_dataset
to a
different value
dummy_dset = co.datasets['dummy_dataset']
old_arr = dummy_dset['0']
new_arr = old_arr + 50
new_arr
array([50, 51, 52, 53, 54, 55, 56, 57, 58, 59], dtype=uint16)
dummy_dset['0'] = new_arr
let’s make a commit here, then add some metadata and make a new commit
(all on the testbranch
branch)
co.commit('mutated sample `0` of `dummy_dataset` to new value')
Commit operation requested with message: mutated sample 0 of dummy_dataset to new value (288, 222, 288) removing all stage hash records Commit completed. Commit hash: 4fdb96afed4ec62e9fc80328abccae6bf6774fea
'4fdb96afed4ec62e9fc80328abccae6bf6774fea'
repo.log()
* 4fdb96afed4ec62e9fc80328abccae6bf6774fea ([1;31mtestbranch[m) : mutated sample 0 of dummy_dataset to new value * b21ebbeeece723bf7aa2157eb2e8742a043df7d0 : first commit with a single sample added to a dummy dataset
co.metadata['hello'] = 'world'
co.commit('added hellow world metadata')
Commit operation requested with message: added hellow world metadata
(348, 260, 348)
removing all stage hash records
Commit completed. Commit hash: ce8a9198d638b8fd89a175486d21d2bb2efabc91
'ce8a9198d638b8fd89a175486d21d2bb2efabc91'
co.close()
writer checkout of testbranch closed
Looking at our history how, we see that none of the original branches reference our first commit anymore
repo.log()
* ce8a9198d638b8fd89a175486d21d2bb2efabc91 ([1;31mtestbranch[m) : added hellow world metadata * 4fdb96afed4ec62e9fc80328abccae6bf6774fea : mutated sample 0 of dummy_dataset to new value * b21ebbeeece723bf7aa2157eb2e8742a043df7d0 : first commit with a single sample added to a dummy dataset
We can check the history of the "master"
branch by specifying it as
an argument to the log()
method
repo.log('master')
* 0cdd8c833f654d18ddc2b089fabee93c32c9c155 ([1;31mmaster[m) ([1;31mnew[m) : commit on new branch adding a sample to dummy_dataset * b21ebbeeece723bf7aa2157eb2e8742a043df7d0 : first commit with a single sample added to a dummy dataset
Merging (Part 2) Three Way Merge¶
If we now want to merge the changes on "testbranch"
into
"master"
, we can’t just follow a simple linear history; the
branches have diverged.
For this case, Hangar implements a Three Way Merge algorithm which
does the following: - Find the most recent common ancestor commit
present in both the "testbranch"
and "master"
branches - Compute
what changed between the common ancestor and each branch’s HEAD
commit - Check if any of the changes conflict with eachother (more on
this in a later tutorial) - If no conflicts are present, compute the
results of the merge between the two sets of changes - Create a new
commit
containing the merge results reference both branch
HEAD
s as parents of the new commit
, and update the base
branch HEAD
to that new commit
’s commit_hash
co = repo.checkout(write=True, branch_name='master')
Once again, as a user, the details are completly irrelevent, and the operation occurs from the same one-liner call we used before for the FF Merge.
co.merge(message='merge of testbranch into master', dev_branch='testbranch')
Selected 3-Way Merge Strategy
(410, 293, 410)
removing all stage hash records
'dea1aa627933b3efffa03c743c201ee1b41142c8'
If we now look at the log, we see that this has a much different look
then before. The three way merge results in a history which references
changes made in both diverged branches, and unifies them in a single
commit
repo.log()
* dea1aa627933b3efffa03c743c201ee1b41142c8 ([1;31mmaster[m) : merge of testbranch into master [1;31m|[m[1;32m[m [1;31m|[m * ce8a9198d638b8fd89a175486d21d2bb2efabc91 ([1;31mtestbranch[m) : added hellow world metadata [1;31m|[m * 4fdb96afed4ec62e9fc80328abccae6bf6774fea : mutated sample 0 of dummy_dataset to new value * [1;32m|[m 0cdd8c833f654d18ddc2b089fabee93c32c9c155 ([1;31mnew[m) : commit on new branch adding a sample to dummy_dataset [1;32m|[m[1;32m/[m * b21ebbeeece723bf7aa2157eb2e8742a043df7d0 : first commit with a single sample added to a dummy dataset
Manually inspecting the merge result to verify it matches our expectations¶
dummy_dataset
should contain two arrays, key 1
was set in the
previous commit originally made in "new"
and merged into
"master"
. Key 0
was mutated in "testbranch"
and unchanged in
"master"
, so the update from "testbranch"
is kept.
There should be one metadata sample with they key "hello"
and the
value "world"
co.datasets
Hangar Datasets
Writeable: True
Dataset Names:
- dummy_dataset
co.datasets['dummy_dataset']
Hangar DatasetDataWriter
Dataset Name : dummy_dataset
Schema UUID : d82cddc07e0211e9a08a8c859047adef
Schema Hash : 43edf7aa314c
Variable Shape : False
(max) Shape : (10,)
Datatype : <class 'numpy.uint16'>
Named Samples : True
Access Mode : a
Num Samples : 2
co.datasets['dummy_dataset']['0']
array([50, 51, 52, 53, 54, 55, 56, 57, 58, 59], dtype=uint16)
co.datasets['dummy_dataset']['1']
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=uint16)
co.metadata
Hangar Metadata
Writeable: True
Number of Keys: 1
co.metadata['hello']
'world'
Conflicts¶
Now that we’ve seen merging in action, the next step is to talk about conflicts.
How Are Conflicts Detected?¶
Any merge conflicts can be identified and addressed ahead of running a
merge
command by using the built in diff
tools. When diffing
commits, Hangar will provide a list of conflicts which it identifies. In
general these fall into 4 catagories:
Additions in both branches which created new keys (samples / datasets / metadata) with non-compatible values. For samples & metadata, the hash of the data is compared, for datasets, the schema specification is checked for compatibility in a method custom to the internal workings of Hangar.
Removal in
Master Commit/Branch
& Mutation inDev Commit/Branch
. Applies for samples, datasets, and metadata identically.Mutation in
Dev Commit/Branch
& Removal inMaster Commit/Branch
. Applies for samples, datasets, and metadata identically.Mutations on keys both branches to non-compatible values. For samples & metadata, the hash of the data is compared, for datasets, the schema specification is checked for compatibility in a method custom to the internal workings of Hangar.
Let’s make a merge conflict¶
To force a conflict, we are going to checkout the "new"
branch and
set the metadata key "hello"
to the value
"foo conflict... BOO!"
. If we then try to merge this into the
"testbranch"
branch (which set "hello"
to a value of
"world"
) we see how hangar will identify the conflict and halt
without making any changes.
Automated conflict resolution will be introduced in a future version of Hangar, for now it is up to the user to manually resolve conflicts by making any necessary changes in each branch before reattempting a merge operation.
co = repo.checkout(write=True, branch_name='new')
co.metadata['hello'] = 'foo conflict... BOO!'
co.commit ('commit on new branch to hello metadata key so we can demonstrate a conflict')
Commit operation requested with message: commit on new branch to hello metadata key so we can demonstrate a conflict
(410, 294, 410)
removing all stage hash records
Commit completed. Commit hash: 5e76faba059c156bc9ed181446e104765cb471c3
'5e76faba059c156bc9ed181446e104765cb471c3'
repo.log()
* 5e76faba059c156bc9ed181446e104765cb471c3 ([1;31mnew[m) : commit on new branch to hello metadata key so we can demonstrate a conflict * 0cdd8c833f654d18ddc2b089fabee93c32c9c155 : commit on new branch adding a sample to dummy_dataset * b21ebbeeece723bf7aa2157eb2e8742a043df7d0 : first commit with a single sample added to a dummy dataset
When we attempt the merge, an exception is thrown telling us there is a conflict¶
co.merge(message='this merge should not happen', dev_branch='testbranch')
Selected 3-Way Merge Strategy
HANGAR VALUE ERROR:: Merge ABORTED with conflict: {'dset': ConflictRecords(t1=(), t21=(), t22=(), t3=(), conflict=False), 'meta': ConflictRecords(t1=('hello',), t21=(), t22=(), t3=(), conflict=True), 'sample': {'dummy_dataset': ConflictRecords(t1=(), t21=(), t22=(), t3=(), conflict=False)}, 'conflict_found': True}
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-59-1a98dce1852b> in <module>
----> 1 co.merge(message='this merge should not happen', dev_branch='testbranch')
~/projects/tensorwerk/hangar/hangar-py/src/hangar/checkout.py in merge(self, message, dev_branch)
392 dev_branch_name=dev_branch,
393 repo_path=self._repo_path,
--> 394 writer_uuid=self._writer_lock)
395
396 for dsetHandle in self._datasets.values():
~/projects/tensorwerk/hangar/hangar-py/src/hangar/merger.py in select_merge_algorithm(message, branchenv, stageenv, refenv, stagehashenv, master_branch_name, dev_branch_name, repo_path, writer_uuid)
125
126 except ValueError as e:
--> 127 raise e from None
128
129 finally:
~/projects/tensorwerk/hangar/hangar-py/src/hangar/merger.py in select_merge_algorithm(message, branchenv, stageenv, refenv, stagehashenv, master_branch_name, dev_branch_name, repo_path, writer_uuid)
122 refenv=refenv,
123 stagehashenv=stagehashenv,
--> 124 repo_path=repo_path)
125
126 except ValueError as e:
~/projects/tensorwerk/hangar/hangar-py/src/hangar/merger.py in _three_way_merge(message, master_branch_name, masterHEAD, dev_branch_name, devHEAD, ancestorHEAD, branchenv, stageenv, refenv, stagehashenv, repo_path)
239 except ValueError as e:
240 logger.error(e, exc_info=False)
--> 241 raise e from None
242
243 fmtCont = _merge_dict_to_lmdb_tuples(patchedRecs=mergeContents)
~/projects/tensorwerk/hangar/hangar-py/src/hangar/merger.py in _three_way_merge(message, master_branch_name, masterHEAD, dev_branch_name, devHEAD, ancestorHEAD, branchenv, stageenv, refenv, stagehashenv, repo_path)
236
237 try:
--> 238 mergeContents = _compute_merge_results(a_cont=aCont, m_cont=mCont, d_cont=dCont)
239 except ValueError as e:
240 logger.error(e, exc_info=False)
~/projects/tensorwerk/hangar/hangar-py/src/hangar/merger.py in _compute_merge_results(a_cont, m_cont, d_cont)
333 if confs['conflict_found'] is True:
334 msg = f'HANGAR VALUE ERROR:: Merge ABORTED with conflict: {confs}'
--> 335 raise ValueError(msg) from None
336
337 # merging: dataset schemas
ValueError: HANGAR VALUE ERROR:: Merge ABORTED with conflict: {'dset': ConflictRecords(t1=(), t21=(), t22=(), t3=(), conflict=False), 'meta': ConflictRecords(t1=('hello',), t21=(), t22=(), t3=(), conflict=True), 'sample': {'dummy_dataset': ConflictRecords(t1=(), t21=(), t22=(), t3=(), conflict=False)}, 'conflict_found': True}
Alternatively, use the diff methods on a checkout to test for conflicts¶
merge_results, conflicts_found = co.diff.branch('testbranch')
conflicts_found
{'dset': ConflictRecords(t1=(), t21=(), t22=(), t3=(), conflict=False),
'meta': ConflictRecords(t1=('hello',), t21=(), t22=(), t3=(), conflict=True),
'sample': {'dummy_dataset': ConflictRecords(t1=(), t21=(), t22=(), t3=(), conflict=False)},
'conflict_found': True}
conflicts_found['meta']
ConflictRecords(t1=('hello',), t21=(), t22=(), t3=(), conflict=True)
The type codes for a ConflictRecords
namedtuple
such as the one
we saw:
ConflictRecords(t1=('hello',), t21=(), t22=(), t3=(), conflict=True)
are as follow:
t1
: Addition of key in master AND dev with different values.t21
: Removed key in master, mutated value in dev.t22
: Removed key in dev, mutated value in master.t3
: Mutated key in both master AND dev to different values.conflict
: Bool indicating if any type of conflict is present.
To resolve, remove the conflict¶
del co.metadata['hello']
co.metadata['resolved'] = 'conflict by removing hello key'
co.commit('commit which removes conflicting metadata key')
Commit operation requested with message: commit which removes conflicting metadata key
(413, 296, 413)
removing all stage hash records
Commit completed. Commit hash: 4f312b10775c2b0ac51b5f284d2f94e9a8548868
'4f312b10775c2b0ac51b5f284d2f94e9a8548868'
co.merge(message='this merge succeeds as it no longer has a conflict', dev_branch='testbranch')
Selected 3-Way Merge Strategy
(465, 331, 465)
removing all stage hash records
'3550984bd91afe39d9462f7299c2542e7d45444d'
repo.log()
* 3550984bd91afe39d9462f7299c2542e7d45444d ([1;31mnew[m) : this merge succeeds as it no longer has a conflict [1;31m|[m[1;32m[m * [1;32m|[m 4f312b10775c2b0ac51b5f284d2f94e9a8548868 : commit which removes conflicting metadata key * [1;32m|[m 5e76faba059c156bc9ed181446e104765cb471c3 : commit on new branch to hello metadata key so we can demonstrate a conflict [1;32m|[m * ce8a9198d638b8fd89a175486d21d2bb2efabc91 ([1;31mtestbranch[m) : added hellow world metadata [1;32m|[m * 4fdb96afed4ec62e9fc80328abccae6bf6774fea : mutated sample 0 of dummy_dataset to new value * [1;32m|[m 0cdd8c833f654d18ddc2b089fabee93c32c9c155 : commit on new branch adding a sample to dummy_dataset [1;32m|[m[1;32m/[m * b21ebbeeece723bf7aa2157eb2e8742a043df7d0 : first commit with a single sample added to a dummy dataset