Thereβs no love and care put into crafting our git repositories nowadays.
Letβs change that.
Iβm going to talk about how to handmake your git repositories without using these silly git commands.
You might also learn a bit more about how git works under the hood during the process, or whatever.
If youβre so inclined, you might also take it as an opportunity to appreciate how the power of git comes not from the complexity of its code but from the simplicity and elegance of its design. I mean, if thatβs your thing.
Git refers to the user-friendly commands as βporcelainβ and to the internals as βplumbingβ, so you can think of this as an introductory lesson in git plumbing. In fact, git has βplumbingβ commands which weβre not even using, so this is more like an introductory course in git fluid dynamics. In other words, itβs pretty silly.
Pre-requisites
Iβm going to assume you are familiar with git and are comfortable in a shell environment, otherwise this probably wonβt make much sense to you.
Letβs get started
The first thing weβd do normally is run git init
, but whereβs the care and attention in that? Weβre going to do it the old-fashioned way, like the pilgrims of yore wouldβve done.
1$ mkdir artisanal-git2$ cd artisanal-git3
4# This is where git stores all the information for a repository, from branches to5# commits to objects.6$ mkdir .git7
8# Git expects these folders to exist, but we don't need to add anything into them yet.9$ mkdir -p .git/hooks .git/info .git/objects/info .git/objects/pack .git/refs/heads .git/refs/remotes .git/refs/tags .git/logs10
11# This is just a standard repo-specific config file β these are the default values on my12# machine.13$ cat <<EOF > .git/config14[core]15 repositoryformatversion = 016 filemode = true17 bare = false18 logallrefupdates = true19 ignorecase = true20 precomposeunicode = true21EOF
Now we only have to add one more file before weβve got a valid git repository, and thatβs the HEAD
. You mightβve come across this before β HEAD
just means βwhat is your repository pointing to right now?β. Itβs a text file and itβs either got a reference in (the normal state of a repository) or just a plain commit hash β this is whatβs referred to as a βdetached headβ (sounds painful, I know).
Weβre going to point git to our default branch, called main
.
1$ echo "ref: refs/heads/main" > .git/HEAD
You might be thinking β wait a minute, where is this main
branch? Iβve not created any branches yet. And you havenβt, but thatβs fine because we donβt have any commits yet. We can run git status
to check whether git is happy with our handiwork so far.
1$ git status2On branch main3
4No commits yet5
6nothing to commit (create/copy files and use "git add" to track)
If you get the error βfatal: This operation must be run in a work treeβ, itβs probably just because youβre running it from inside the .git
folder, which wonβt work. Just cd one level up and you should be good to go.
Nice! Now, letβs get committing.
Content Addressable Storage
Before we start hand-crafting our commits, we need to spend a bit of time refining our craftwork by studying the lore.
First, we need to understand what an βobjectβ is in git, what they are and how git stores them.
Git needs to store lots of things. If you create a commit in git, git then needs to store everything about the commit itself, including the commit description, which files it includes, who committed it, the timestamp, and the contents of the files.
All of these things are stored as βobjectsβ. But what do these look like? Letβs take another repository which already has a bunch of data in it.
1$ git log -12ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ3commit 84eae65b6780129486768e6497736c38bfdf9b3d (HEAD -> main, origin/main, origin/HEAD) β4ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ»ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ5Author: Drew Silcock <redacted>6Date: Mon Sep 30 21:39:55 2024 +01007
8 Improve py 3.13 graphs and add extra section on scaling.
This is the last commit in the repo for this blog, at the time of writing. We can see that the commit is identified by its SHA-1 hash, 84eae65b
(I canβt be bothered to type out the whole thing).
This is where things start to get clever β the way git stores all these things like commits, files, etc.1 is not based on some location separate from the data itself like a filename or a key (which would require knowing this filename/key outside of the data itself, e.g. by querying a central listing) but based on the contents of the thing itself. Do a SHA-1 hash of the object and that is the βkeyβ that you use to find the object on disk.
This is called Content Addressable Storage (CAS) or sometimes fixed-content storage and once you recognise it, you start seeing it all over the place. CAS and content addressing in general are used by, among other things:
A side bonus of using content addressable storage is that if you ever have duplicate files, you donβt need to store them twice β theyβll have the same hash so theyβll be stored in the same location.
Commit objects
To find commit 84eae65b
we need to look in .git/objects/84/eae65b6780129486768e6497736c38bfdf9b3d
. The data is compressed using zlib so we have to pipe it into a decompressor (Iβm using pigz here):
1$ cat .git/objects/84/eae65b6780129486768e6497736c38bfdf9b3d | pigz -d2commit 1136tree 5a0be7720e65417e08034a64bc257bc56a60b4b33parent e345662c7d53408eb2638cf0fdbae442fe6b68f44author Drew Silcock <redacted> 1727728795 +01005committer Drew Silcock <redacted> 1727728795 +01006gpgsig -----BEGIN PGP SIGNATURE-----7
8 iQIzBAABCAAdFiEEaZwozZ5d++BpkqZmtEW8+mMmNyAFAmb7DJsACgkQtEW8+mMm9 NyAZeBAAs2I1rodxTBpOnFUgNnl5Slf2o03VZlc7kvbw2miCUP5CkO40REHzGXXE10 K3sJSUhObttTrKr0GjUChcvzBZBKoigawP+h3IeY07whhhTcnNaBXjQqzpcl+G5A11 ryEVkQXdCqVRWAk3I/6Z3hFlfUogzbxihGoEKvjyMZtmfy0di0WAOJ+PLlTIEwJJ12 SQYcUaA7l01ocIWy85MezGJHZEpurcBjzu5nkYCMGRw85u9tXXqjzaYh6Fu7WVEH13 rHmBO8tEFF/WcQC1FonVggrOQOAsssuaMxwxKV/p4HRxP9lHGmzFCGfbKAY1bQ8214 2dWgROwMAp6jtvLSX6OLu6i0O3+m6NAwTtKcOFDU+Jae4h2m1GC3/8qDukhK7o+e15 5LJCLAZPtTvpai43COLRnF9iteV15H267WOxpIvXqbMBwIFcaaHepFMLA0Y39Kr316 FHd1JAaaE6fiUe4rjNP5Wx6ZVLKdEYznjbxgxiRkr9dcemR5SUQtreHjaaLTo0E917 m6bEE1huZp+gu/dy9e7hgNORiwmkUP49r4/WPbNwwKrMxr5lD1ZwQk6DKEi6jAyq18 BduJ4fdtamFlngnbJtoW0LHsdxROMwHkqs1Pz4zxpmeOZZEv0p0pzFhM30ta+YjQ19 ogBAyoRGHAZG2cze5uI8Cg7fr1A+uTqGmBAXexYN+/ok4+Bf/5g=20 =gvSk21 -----END PGP SIGNATURE-----22
23Improve py 3.13 graphs and add extra section on scaling.
If we pipe this into hexyl we can see the hidden null byte:
1$ cat .git/objects/84/eae65b6780129486768e6497736c38bfdf9b3d | pigz -d | hexyl2ββββββββββ¬ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββ¬βββββββββ¬βββββββββ3β00000000β 63 6f 6d 6d 69 74 20 31 β 31 33 36 00 74 72 65 65 βcommit 1β136βtreeβ4β00000010β 20 35 61 30 62 65 37 37 β 32 30 65 36 35 34 31 37 β 5a0be77β20e65417β5β00000020β 65 30 38 30 33 34 61 36 β 34 62 63 32 35 37 62 63 βe08034a6β4bc257bcβ6β00000030β 35 36 61 36 30 62 34 62 β 33 0a 70 61 72 65 6e 74 β56a60b4bβ3_parentβ7β00000040β 20 65 33 34 35 36 36 32 β 63 37 64 35 33 34 30 38 β e345662βc7d53408β8β00000050β 65 62 32 36 33 38 63 66 β 30 66 64 62 61 65 34 34 βeb2638cfβ0fdbae44β9...
So this follows the format commit (content length)\x00(commit contents)
which is the format used by all the other types of objects too.
Tree objects
We can see the hash of another object here by looking at the first info after the null byte: tree 5a0be7720e65417e08034a64bc257bc56a60b4b3
. This tells us that the βtreeβ object with hash 5a0be772
contains the files included in the commit.
1$ cat .git/objects/5a/0be7720e65417e08034a64bc257bc56a60b4b3 | pigz -d2tree 678100644 .gitignoreοΏ½ H@οΏ½LοΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½BOοΏ½οΏ½pοΏ½100644 .prettierignoreοΏ½YοΏ½οΏ½Vt$οΏ½οΏ½n!]- ]οΏ½100644 .prettierrc.mjsοΏ½2iYοΏ½οΏ½nοΏ½$3Q%οΏ½wοΏ½οΏ½40000 .vscode3 3οΏ½οΏ½οΏ½i(οΏ½[οΏ½ΖοΏ½οΏ½DοΏ½οΏ½100644 LICENCEοΏ½οΏ½οΏ½+οΏ½οΏ½οΏ½οΏ½οΏ½rW4οΏ½^οΏ½;οΏ½οΏ½r100644 README.md]%οΏ½οΏ½hοΏ½4~nοΏ½*QοΏ½w5 οΏ½+οΏ½100644 TODO.txtοΏ½οΏ½E+1οΏ½nοΏ½[οΏ½οΏ½ οΏ½οΏ½ΪοΏ½100644 astro.config.mjs(6 οΏ½οΏ½cοΏ½XpοΏ½οΏ½οΏ½οΏ½jlοΏ½5A40000 fontsοΏ½0οΏ½AοΏ½οΏ½οΏ½οΏ½οΏ½'οΏ½οΏ½οΏ½yοΏ½}40000 imagesοΏ½οΏ½[zοΏ½οΏ½CYοΏ½οΏ½9οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½100644 logo.svgοΏ½οΏ½οΏ½7 i:LοΏ½οΏ½6οΏ½οΏ½8100644 package-lock.jsonοΏ½οΏ½οΏ½οΏ½οΏ½,!οΏ½οΏ½οΏ½Τ₯οΏ½οΏ½1οΏ½οΏ½100644 package.jsonHοΏ½jCiοΏ½οΏ½οΏ½ iοΏ½zaοΏ½c;οΏ½bοΏ½40000 publicοΏ½DοΏ½PD0en/οΏ½%1οΏ½οΏ½0Ζ°οΏ½οΏ½40000 srcοΏ½OF;οΏ½0οΏ½ οΏ½οΏ½οΏ½)9οΏ½aοΏ½100644 tsconfig.jso?οΏ½sοΏ½οΏ½οΏ½οΏ½VοΏ½v 4ice>yβ
We can clearly see that this contains some useful information like filenames and some 644s which smells of unix permissions/file modes, but it looks like a mixed binary-text format. If we look at the hex, we can see a bit clearer whatβs going on:
1$ cat .git/objects/5a/0be7720e65417e08034a64bc257bc56a60b4b3 | pigz -d | hexyl2ββββββββββ¬ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββ¬βββββββββ¬βββββββββ3β00000000β 74 72 65 65 20 36 37 38 β 00 31 30 30 36 34 34 20 βtree 678ββ100644 β4β00000010β 2e 67 69 74 69 67 6e 6f β 72 65 00 d6 20 48 19 40 β.gitignoβreβΓ Hβ’@β5β00000020β d7 4c de f6 ff ba be c0 β d2 42 4f a7 b5 70 f6 31 βΓLΓΓΓΓΓΓβΓBOΓΓpΓ1β6β00000030β 30 30 36 34 34 20 2e 70 β 72 65 74 74 69 65 72 69 β00644 .pβrettieriβ7β00000040β 67 6e 6f 72 65 00 dd 59 β c0 f4 8e 56 74 24 ef 9b βgnoreβΓYβΓΓΓVt$ΓΓβ8β00000050β e7 6e 21 1d 5d 2d 20 5d β e2 80 31 30 30 36 34 34 βΓn!β’]- ]βΓΓ100644β9β00000060β 20 2e 70 72 65 74 74 69 β 65 72 72 63 2e 6d 6a 73 β .prettiβerrc.mjsβ10β00000070β 00 b8 32 69 59 18 96 10 β b9 6e eb 24 33 51 25 89 ββΓ2iYβ’Γβ’βΓnΓ$3Q%Γβ11β00000080β 77 04 02 b4 eb 34 30 30 β 30 30 20 2e 76 73 63 6f βwβ’β’ΓΓ400β00 .vscoβ12β00000090β 64 65 00 0c 33 c7 c3 c0 β 69 28 c9 5b e8 c6 8f 9f βdeβ_3ΓΓΓβi(Γ[ΓΓΓΓβ13β000000a0β 15 a5 08 c8 44 a3 dc 31 β 30 30 36 34 34 20 4c 49 ββ’Γβ’ΓDΓΓ1β00644 LIβ14β000000b0β 43 45 4e 43 45 00 e2 18 β cd e0 2b a8 bf d9 ea c7 βCENCEβΓβ’βΓΓ+ΓΓΓΓΓβ15β000000c0β 72 57 34 f4 5e 95 3b b7 β 87 72 31 30 30 36 34 34 βrW4Γ^Γ;ΓβΓr100644β16β000000d0β 20 52 45 41 44 4d 45 2e β 6d 64 00 5d 25 dd e9 68 β README.βmdβ]%ΓΓhβ17β000000e0β b1 0a 7e 6e a3 2a 51 a7 β 14 77 0c c8 2b 06 b0 31 βΓ_~nΓ*QΓββ’w_Γ+β’Γ1β18β000000f0β 30 30 36 34 34 20 54 4f β 44 4f 2e 74 78 74 00 be β00644 TOβDO.txtβΓβ19β00000100β 1c e0 45 2b 31 bd 6e fd β 5b a2 e3 9f 14 09 e9 e8 ββ’ΓE+1ΓnΓβ[ΓΓΓβ’_ΓΓβ20β00000110β da 8f a0 31 30 30 36 34 β 34 20 61 73 74 72 6f 2e βΓΓΓ10064β4 astro.β21β00000120β 63 6f 6e 66 69 67 2e 6d β 6a 73 00 28 0c 93 c5 63 βconfig.mβjsβ(_ΓΓcβ22β00000130β bd 58 70 16 16 b0 e6 1e β f9 ab 6a 6c ef 35 41 34 βΓXpβ’β’ΓΓβ’βΓΓjlΓ5A4β23β00000140β 30 30 30 30 20 66 6f 6e β 74 73 00 ef 30 1e e8 41 β0000 fonβtsβΓ0β’ΓAβ24β00000150β 8d c7 18 d7 c1 16 ed 27 β 1c 8d b6 a4 79 d0 7d 34 βΓΓβ’ΓΓβ’Γ'ββ’ΓΓΓyΓ}4β25β00000160β 30 30 30 30 20 69 6d 61 β 67 65 73 00 1c 10 b7 b5 β0000 imaβgesββ’β’ΓΓβ26β00000170β 5b 7a 99 9c 43 59 18 e7 β d9 39 fa 8f c8 ff d2 ca β[zΓΓCYβ’ΓβΓ9ΓΓΓΓΓΓβ27β00000180β 31 30 30 36 34 34 20 6c β 6f 67 6f 2e 73 76 67 00 β100644 lβogo.svgββ28β00000190β 95 07 b5 d8 0b 69 3a 4c β 96 eb 06 36 b6 a7 0b be βΓβ’ΓΓβ’i:LβΓΓβ’6ΓΓβ’Γβ29β000001a0β a6 83 29 0d 31 30 30 36 β 34 34 20 70 61 63 6b 61 βΓΓ)_1006β44 packaβ30β000001b0β 67 65 2d 6c 6f 63 6b 2e β 6a 73 6f 6e 00 9e 8f bc βge-lock.βjsonβΓΓΓβ31β000001c0β b0 d9 7f 11 2c 21 f7 a9 β 8e d4 a5 be d8 31 10 86 βΓΓβ’β’,!ΓΓβΓΓΓΓΓ1β’Γβ32β000001d0β e4 31 30 30 36 34 34 20 β 70 61 63 6b 61 67 65 2e βΓ100644 βpackage.β33β000001e0β 6a 73 6f 6e 00 48 97 6a β 43 69 f1 f4 e4 20 69 91 βjsonβHΓjβCiΓΓΓ iΓβ34β000001f0β 02 7a 61 ec 63 3b 80 62 β b9 34 30 30 30 30 20 70 ββ’zaΓc;ΓbβΓ40000 pβ35β00000200β 75 62 6c 69 63 00 b0 44 β d0 50 44 17 30 65 6e 2f βublicβΓDβΓPDβ’0en/β36β00000210β 90 25 31 97 d1 30 c6 b0 β 90 a8 34 30 30 30 30 20 βΓ%1ΓΓ0ΓΓβΓΓ40000 β37β00000220β 73 72 63 00 cb 4f 46 3b β b9 02 30 0e 8d 0a aa d6 βsrcβΓOF;βΓβ’0β’Γ_ΓΓβ38β00000230β 84 03 44 c6 40 2d b9 5d β 31 30 30 36 34 34 20 74 βΓβ’DΓ@-Γ]β100644 tβ39β00000240β 61 69 6c 77 69 6e 64 2e β 63 6f 6e 66 69 67 2e 6d βailwind.βconfig.mβ40β00000250β 6a 73 00 6b 3c 62 76 4e β 97 2a f4 50 e1 e4 3d d8 βjsβk<bvNβΓ*ΓPΓΓ=Γβ41β00000260β d6 b9 35 5f b8 07 f2 34 β 30 30 30 30 20 74 69 6e βΓΓ5_Γβ’Γ4β0000 tinβ42β00000270β 61 00 60 6f b2 72 1f d5 β df 74 c9 81 c1 f6 36 42 βaβ`oΓrβ’ΓβΓtΓΓΓΓ6Bβ43β00000280β 38 23 0d d0 61 ae 31 30 β 30 36 34 34 20 74 73 63 β8#_ΓaΓ10β0644 tscβ44β00000290β 6f 6e 66 69 67 2e 6a 73 β 6f 6e 00 08 3f fe 73 de βonfig.jsβonββ’?ΓsΓβ45β000002a0β c8 f0 8b 56 e3 76 20 00 β 05 34 69 63 65 3e 79 βΓΓΓVΓv βββ’4ice>y β46ββββββββββ΄ββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββ΄βββββββββ΄βββββββββ
Ok maybe itβs not that much clearer, but itβs obvious once I explain it:
1(file mode) (file name)\x00(binary hash)(file mode) (file name)\x00(binary hash)...
The file mode is in plaintext so here we see that the file .gitignore
has mode 100644
which (according to Greg Bacon on StackOverflow) means a regular file with permissions 644. (Side note: I donβt know what git does on Windows where unix file modes and stat donβt exist, leave a comment below if you do!)
This also tells us that the hash of the .gitignore
file is d6 20 48 19 40 d7 4c de f6 ff ba be c0 d2 42 4f a7 b5 70 f6
β we donβt need a null terminator at the end of this because SHA-1 hashes are always precisely 20 bytes, so we can go straight into the file mode of the next file in the tree without any bytes separating the two files in the tree.
Because git uses this content-addressing system, we can then look up the file itself using this hash:
1$ cat .git/objects/d6/20481940d74cdef6ffbabec0d2424fa7b570f6|pigz -d2blob 274.vscode3
4# build output5dist/6
7# generated types8.astro/9
10# dependencies11node_modules/12
13# logs14npm-debug.log*15yarn-debug.log*16yarn-error.log*17pnpm-debug.log*18
19
20# environment variables21.env22.env.production23
24# macOS-specific files25.DS_Store26
27# jetbrains setting folder28.idea/
Again, we see the object type, in this case βblobβ indicating file contents, followed by the length of the file contents, followed by a null byte (not shown by terminal here), followed by the file contents (and a very exciting example it is).
Thereβs actually some helper commands, which git calls βplumbingβ commands, for printing out object info: git cat-file -p (sha-1 hash)
will pretty print the object contents (the -t
flag will display the object type and -s
will display the size). We havenβt covered annotated tags, which are also objects, but theyβre quite similar to commits and not massively interesting for our discussion. Lightweight tags (without messages) are just references to commits hashes, not objects, so are stored in the refs/
folder alongside branches.
Neat!
But wait, thereβs no diffs/delta here! I thought commits were showing changes to files?
Commits do not store deltas or differences between files but rather store the whole before and the whole after. This was actually one of the big selling points of git compared to competing version control systems, back before git became the de-facto standard2.
What git does when you do git diff
is get the before contents, get the after contents, then run a differ on these two versions to display to the user. You can even change the differ that git uses. Personally, I use delta, but Iβve heard good things about difftastic and diff-so-fancy.
You might be thinking that this will surely make repositories mahoosive, right? Weβll get onto that in just a second.
Whatβs up with the first two letters of the hash being the folder?
Good question. Apparently, there are some filesystems which donβt allow you to have more than a certain n# files in any particular directory, and others which use a sequential scan to find a file within a directory, which wouldnβt be good if you were trying to commit to the Linux kernel, which has some ~4.5 million objects or so.
SHA-1 hashes are well distributed so you can expect as many hashes starting with 00
as 1e
as 8f
and every other possible combination. Git takes advantage of this by checking whether there are 27 or more files in any of these subdirectories3. If so, it would indicate that there are more than 6,700 objects in total, which tells git that itβs time to βpackβ those βlooseβ objects up into packfiles, to save on space.
Time to pack it up
If you look in the .git/objects
folder, there are a couple of non-hex folders that you might be wondering about. One is called info/
and one is called pack/
.
The βinfoβ folder is super interesting and not something I encountered before researching for this blog post β turns out multiple git repositories can share a single object database to reduce on storage size which is pretty neat. I think the βinfoβ folder is only used for very specific, niche use cases to be honest.
The βpackβ folder is very commonly used, and itβs where git puts its packfiles. Whatβs a packfile, I hear you ask? Well, as I said above, when git decides that there are too many βlooseβ objects in the .git/objects
folder (apparently, 6,700 by default), it pulls multiple objects together into a single packfile. This allows multiple objects to be compressed together within the same file, improving compression ratios.
This does introduce the problem of how to find objects once theyβve been packed up. Git supports per-pack index files and also multi-index files, which is a more recent introduction that allows using a single index for all your packs. If youβre using multiple pack index files, you can try some tactics like looking at the most recently changed packfiles first as potential speedups, but ultimately you do have to search through all the packfile index files to find your object.
At this point you might have a couple of questions:
- How does this stop repositories from getting mahoosive? Sure, itβs a bit better compressed, but that wonβt make that much of a difference.
- If git stores the complete version of every file, why does git say βresolving deltasβ¦β when I clone a repository?
The answer is that git does store deltas in the packfiles. In fact, git uses various strange heuristics to determine how to pack objects as efficiently as possible, both in terms of the deltas and in terms of the compression (which is done after the delta creation).
In particular, git does not store the chronologically βfirstβ version of a file as the base and then add deltas on top of that. It doesnβt even care whether two objects represent the same or not β git looks at which objects are similar and determines which would be the most optimal base for the deltas, regardless of whether the similar objects even represent the same file.
It is this clever but slightly magical heuristic-based packfile creation system that keeps the size of git repositories down. It also helps that git throws out βunreachableβ objects so that your repository only has what you need it to have.
Taking the trash out
Itβs worth talking a bit more about gitβs garbage collection, i.e. the removal of unreachable objects, because this is a bit of knowledge that may actually save your proverbial bacon one day.
If you accidentally delete a branch that has a bunch of important commits in, you might be thinking βEgad, Iβve lost all my precious files! Woe!β, but donβt worry β git does not in fact delete all your commits and files when you delete the branch.
In fact, git keeps a log of all the actions you do on each branch and HEAD in the folder .git/logs/
. All the actions you do on HEAD are in .git/logs/HEAD
while changes to references like branches are in e.g. .git/logs/refs/heads/main
. Each log file contains one line per βactionβ, e.g. making a commit, pulling from a remote, merging a branch, checking out another branch (checkouts will only be in HEAD log, not individual branches), etc.
Because these logs contain information about actions performed on each reference, theyβre called reference logs or just reflogs for short.
Reflog entries are kept for 90 days (configurable as gc.reflogExpire
) and while a commit is referenced from the reflog, it is still considered βreachableβ by git. Even once an item is truly unreachable, i.e. itβs been removed from the reflog, git still gives it 2 weeks (configurable as gc.pruneExpire
) before it garbage collects it as a grace period. This means that you can often still recover your work using either the manual methods we describe here or by using the git reflog
commands followed by git checkout
or git reset --hard
.
For more details on reflog and more, check out the Oh Shit, Git! zine from Julia Evans.
Beware, however! Logref files are local to your individual repository β if you delete your .git
folder or clone afresh, it will not contain your deleted branch.
Creating our first commit
Now that we understand git objects, we can make our first commit.
1$ echo -e "Has spring come indeed?\nOn that nameless mountain lie\nThin layers of mist.\n\n - Matsuo BashΕ" > haiku.txt2
3$ git status4On branch main5
6No commits yet7
8Untracked files:9 (use "git add <file>..." to include in what will be committed)10 haiku.txt11
12nothing added to commit but untracked files present (use "git add" to track)
Okay, so normally we would need to add the file to our staging area before we can commit it. But we donβt really need to do that because weβre handcrafting our artisanal repository. Plus, the index is a binary file located in .git/index
which makes it a) not massively interesting and b) not particularly easy to manipulate using just command line tools.
If youβre interested in learning more about the index file format and enjoy sitting down with a cup of hot mocha and an internal documentation page describing binary file formats, take a look over at the git docs reference page on the index format: https://git-scm.com/docs/index-format 4.
Creating our file blob
Okay, so we need to construct the commit object from scratch. Letβs start with the blob, then use that to construct the tree, then we can build the commit object from that tree.
1$ cat haiku.txt | wc -c2 943
4# Note: SHA-1 is done before applying zlib compression.5$ echo -e "blob 94\x00$(cat haiku.txt)" | sha1sum6e5d59773e77daf9f9b9129781ca77d475a451831 -7
8$ mkdir -p .git/objects/e59
10$ echo -e "blob 94\x00$(cat haiku.txt)" | pigz -z > .git/objects/e5/d59773e77daf9f9b9129781ca77d475a45183111
12# Check whether we messed it up yet13$ git cat-file -p e5d59773e77daf9f9b9129781ca77d475a45183114Has spring come indeed?15On that nameless mountain lie16Thin layers of mist.17
18 - Matsuo BashΕ19
20# Nice!
Creating our tree blob
Phew, so now letβs use that file blob object to create a tree blob with just that file in it, using file mode 100644 again:
1# Let's make a helper function for the hex to bytes conversion.2# Note: I use fish, where this works but you need to add another `\\` before the `x&` in3# the sed command.4$ function hex-to-bytes() { printf "$(printf "$1" | sed 's/../\\x&/g')"; }
Note for Fish shell users: your hex-to-bytes()
function will look like this:
1function hex-to-bytes; printf "$(printf "$argv[1]" | sed 's/../\\\\x&/g')"; end
Back to creating our tree object:
1$ printf "100644 haiku.txt\x00$(hex-to-bytes e5d59773e77daf9f9b9129781ca77d475a451831)" | hexyl2ββββββββββ¬ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββ¬βββββββββ¬βββββββββ3β00000000β 31 30 30 36 34 34 20 68 β 61 69 6b 75 2e 74 78 74 β100644 hβaiku.txtβ4β00000010β 00 e5 d5 97 73 e7 7d af β 9f 9b 91 29 78 1c a7 7d ββΓΓΓsΓ}ΓβΓΓΓ)xβ’Γ}β5β00000020β 47 5a 45 18 31 0a β βGZEβ’1_ β β6ββββββββββ΄ββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββ΄βββββββββ΄βββββββββ7
8# Looks good so far.9$ printf "100644 haiku.txt\x00$(hex-to-bytes e5d59773e77daf9f9b9129781ca77d475a451831)" | wc -c10 3711
12$ printf "tree 37\x00100644 haiku.txt\x00$(hex-to-bytes e5d59773e77daf9f9b9129781ca77d475a451831)" | sha1sum134aff48f6390a65b88d343ea5d23c03007646b5c2 -14
15$ mkdir -p .git/objects/4a16
17$ printf "tree 37\x00100644 haiku.txt\x00$(hex-to-bytes e5d59773e77daf9f9b9129781ca77d475a451831)" | pigz -z > .git/objects/4a/ff48f6390a65b88d343ea5d23c03007646b5c2
Creating our commit
Now that we have our tree, we can create the commit! (Finallyβ¦)
1# This will vary on your machine, obviously2$ date +'%s %z'31752659127 +01004
5# Feel free to replace with your name/email and timestamp/timezone β just bear in mind6# that the SHA-1 hash will be different.7# Note: Commits usually have a "parent" field but as this is our first commit, it has no8# parent.9$ cat <<EOF > /tmp/my-commit10tree 4aff48f6390a65b88d343ea5d23c03007646b5c211author Drew Silcock <redacted> 1752659127 +010012committer Drew Silcock <redacted> 1752659127 +010013
14Initial artisanal commit.15EOF16
17# Optional β I like to sign my commits.18$ cat /tmp/my-commit | gpg --armor --detach-sign19-----BEGIN PGP SIGNATURE-----20
21iQIzBAABCAAdFiEEaZwozZ5d++BpkqZmtEW8+mMmNyAFAmh3diEACgkQtEW8+mMm22NyDomRAAvWYhK9Eg+ZjmChFR2ZX9bB/KZH+H3ksziy2UHp8LiaHgOb3Ira02rpSm23LvVQjxmgurzYBd3nl1e/8E+V3TH1kGOzmvaoCcjJkSUj6togvD7+eImulc1/xkri24q/qqPXxvj2UoRMbSc4cVy/8SZ/MTxNWtCJsuFRe6iKLRiqk67h3PY+gvebCuJteC25TevKxWV/ra+NRX2Q0w52SEUpGTVcnnxYPyMEi28Kmd9VZUsOvuC43RMm/p7u/eiC26kAzJ3GKN4oQvN/3Xz8akb09VX66M/xbMYNv/J0pbSdeIGofMDfLA3oKeZzhrvUVf27zsrpiJ9kq2CTGIuZMJZQvPc8aEEMbr/PAHgSnSTicayon7JLoi5aaoyhZLCD+pgK28Sd6OMMjrKs61UL2qxelVVde2tZumfOL4GmILrhxQgqZbZsdfDUvPMee9yFkEZVam29re8ekkUlYmlmckTqJ0yQ7VTLYhdxPN+0DRynuiKKQaQlsCHhQi2MTYEn9l+mfbrO307gUAe697+kDwo2VECs4Z7wtPG9F+kGNFpsC0CnGMWjKRR8ZBV9BBiLyc/SZgNd9Q31MZLQWPPvnMw0YlS59rHbUk3VwebxBKx8vX2WBt1NPFtmzkFRr73yL+e69JczspoM32t6FRuXH0bTtF+uVf7qD0saFXCC9lphLYFe5PuyzpIKwWbazGDFA=33=rE6v34-----END PGP SIGNATURE-----35
36# We need to insert this signature into the commit in the format `gpgsig <signature>`.37# The single space at the start of each line is important.38# NOTE: My auto-formatter is removing the whitespace on the line after "BEGIN PGP39# SIGNATURE" β it should have a single space. You need that single space.40$ cat <<EOF > /tmp/my-commit-signed41tree 4aff48f6390a65b88d343ea5d23c03007646b5c242author Drew Silcock <redacted> 1752659127 +010043committer Drew Silcock <redacted> 1752659127 +010044gpgsig -----BEGIN PGP SIGNATURE-----45
46 iQIzBAABCAAdFiEEaZwozZ5d++BpkqZmtEW8+mMmNyAFAmh3diEACgkQtEW8+mMm47 NyDomRAAvWYhK9Eg+ZjmChFR2ZX9bB/KZH+H3ksziy2UHp8LiaHgOb3Ira02rpSm48 LvVQjxmgurzYBd3nl1e/8E+V3TH1kGOzmvaoCcjJkSUj6togvD7+eImulc1/xkri49 q/qqPXxvj2UoRMbSc4cVy/8SZ/MTxNWtCJsuFRe6iKLRiqk67h3PY+gvebCuJteC50 TevKxWV/ra+NRX2Q0w52SEUpGTVcnnxYPyMEi28Kmd9VZUsOvuC43RMm/p7u/eiC51 kAzJ3GKN4oQvN/3Xz8akb09VX66M/xbMYNv/J0pbSdeIGofMDfLA3oKeZzhrvUVf52 zsrpiJ9kq2CTGIuZMJZQvPc8aEEMbr/PAHgSnSTicayon7JLoi5aaoyhZLCD+pgK53 Sd6OMMjrKs61UL2qxelVVde2tZumfOL4GmILrhxQgqZbZsdfDUvPMee9yFkEZVam54 re8ekkUlYmlmckTqJ0yQ7VTLYhdxPN+0DRynuiKKQaQlsCHhQi2MTYEn9l+mfbrO55 7gUAe697+kDwo2VECs4Z7wtPG9F+kGNFpsC0CnGMWjKRR8ZBV9BBiLyc/SZgNd9Q56 MZLQWPPvnMw0YlS59rHbUk3VwebxBKx8vX2WBt1NPFtmzkFRr73yL+e69JczspoM57 t6FRuXH0bTtF+uVf7qD0saFXCC9lphLYFe5PuyzpIKwWbazGDFA=58 =rE6v59 -----END PGP SIGNATURE-----60
61Initial artisanal commit.62EOF63
64# Now we're ready to add the commit as an object into git's loose object storage.65$ cat /tmp/my-commit-signed | wc -c66 102767
68$ echo -e "commit 1027\x00$(cat /tmp/my-commit-signed)" | sha1sum69d62016426c1b7b4125d47bad267aeaaa78bb817c -70
71$ mkdir -p .git/objects/d672
73$ echo -e "commit 1027\x00$(cat /tmp/my-commit-signed)" | pigz -z > .git/objects/d6/2016426c1b7b4125d47bad267aeaaa78bb817c
Okay, so now weβve created our commit, but we havenβt told git that our main branch points to this commit yet. Thatβs fine, we just need to create the main branch reference file.
Okay, but what is a reference?
Well, an object is something that has data within it β it has a type and size and gets stored in .git/objects
as either loose or packed files. A reference is just a hash β no contents. References live in .git/refs
and there are 3 common types:
- local branches, that live in
.git/refs/heads/
, e.g..git/refs/heads/main
for the βmainβ branch - remote branches, that live in
.git/refs/remotes
, e.g..git/refs/remotes/origin/main
for the βmainβ branch in the remote called βoriginβ - lightweight tags, that live in
.git/refs/tags
, e.g..git/refs/tags/v1.2.3
for the βv1.2.3β tag (as mentioned before, annotated tags are objects because they contain messages).
When you do git fetch
, git checks the refs on the server to see whether they match the ones youβve got in your .git/refs/remotes
folder, and updates your local folder accordingly. When git says Your branch is up to date with 'origin/main'.
, itβs just saying that your local branch has the same reference in it than the remote branch, i.e. .git/refs/heads/main
is identical to .git/refs/remotes/origin/main
.
So whatβs in one of these reference files? Itβs just a commit hash! We already know our commit hash so letβs make our main
branch and point it to our artisanally created commit:
1echo "d62016426c1b7b4125d47bad267aeaaa78bb817c" > .git/refs/heads/main
Checking our work
We can use our good-old porcelain commands to check our handiwork:
1$ git status2On branch main3nothing to commit, working tree clean4
5$ git branch6* main7
8$ git log --show-signature9ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ10commit d62016426c1b7b4125d47bad267aeaaa78bb817c (HEAD -> main) β11ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ»ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ12gpg: Signature made Wed 16 Jul 10:51:29 2025 BST13gpg: using RSA key 699C28CD9E5DFBE06992A666B445BCFA6326372014gpg: Good signature from "Drew Silcock <redacted>" [unknown]15gpg: WARNING: This key is not certified with a trusted signature!16gpg: There is no indication that the signature belongs to the owner.17Primary key fingerprint: 699C 28CD 9E5D FBE0 6992 A666 B445 BCFA 6326 372018Author: Drew Silcock <redacted>19Date: Wed Jul 16 10:45:27 2025 +010020
21 Initial artisanal commit.
Nice! We did it ππ
Troubleshooting
If you get stuck with any of these points or get an error like error: bad tree object HEAD
, try running git fsck --full
and itβs likely to tell you whatβs gone wrong.
Future Topics
Itβs taken long enough to get to this point, but thereβs a bunch of really interesting stuff that we didnβt have a chance to talk about β leave a comment, send me an email or shout into the wind which of the following youβd like to hear in a follow-up post:
- Stashes β how do they work? (Spoiler: theyβre basically just commits)
- Reflog β I want to know more about how I can use the hidden power of the reflog to bring my precious files back from the dead.
- Packfile format and indices β I want to hear more about these packfiles and how git looks through them.
- Index file format β I am deeply upset that you skipped over this and demand a full blog post covering the binary format in full detail bit-by-bit, otherwise I will be seeking legal action.
- Networking β how does git communicate with the server? Whatβs the actual on-the-wire difference between using
https://github.com/...
vs.[email protected]:/...
vs.[email protected]:...
. I actually have no idea about this so itβd be interesting to explore.
Conclusion
While it was fun to learn how all these internals work, itβs not a good idea to do this in an actual repository β you will break things, and it will make you sad.
Hopefully, you found this interesting. To be honest, if you make it this far, you either a) skipped over the rest to see whether the conclusion said anything interesting (sorry to disappoint) or b) are the kind of person who reads all the way through highly technical articles about artisanal hand-crafted git repositories. Either way, I hope you enjoyed the read and maybe even learned something.
If thereβs one thing to take away from this, it is the elegance of the design of git, and how itβs actually not that complicated once you understand the underlying file formats. Yes, I know thereβs rebases and reflogs and all these more complicated things but git is not magic, and implementing a git clone (pun intended) from scratch in a language of your choice actually wouldnβt be that hard! At least, if you ignore the more complex things.
Updates
- 2025-07-17: Added more details about packfiles, garbage collection and reflog.
Further reading
The main inspiration for this was reading Julia Evansβ blog posts about git, so if you found this interesting, check our her posts about git. Then check out all the others too, theyβre all good.
- Julia Evans β Inside .git β https://jvns.ca/blog/2024/01/26/inside-git
- Julia Evans β In a git repository, where do your files live? β https://jvns.ca/blog/2023/09/14/in-a-git-repositoryβwhere-do-your-files-live-/
- Git Reference Documentation β The Git index file has the following format β https://git-scm.com/docs/index-format
- Unpacking Git packfiles β https://codewords.recurse.com/issues/three/unpacking-git-packfiles
- Abin Simon (@meain) β What is in that .git directory? β https://blog.meain.io/2023/what-is-in-dot-git/
- Dulwich Project Documentation β Git File format β https://www.samba.org/~jelmer/dulwich/docs/tutorial/file-format.html
Footnotes
-
But not everything. β©
-
Some people still use Perforce and Mercurial and whatnot, I know. Perforce was still popular in game dev last time I was there (which was a while ago now). β©
-
Given SHA-1 is well distributed, git just picks one folder to check to determine how many loose objects are in it. Junio Hamano chose folder
17/
. Donβt ask me why. Maybe itβs his favourite hexadecimal number? β© -
Full disclosure: I am one of those people. β©