July 19, 2020

Git LFS Basics

In most cases, it's not the best practice to store large files in git. It's designed to store source code, not large files. However, for some legacy projects and processes where you don't want to invest a lot time and effort, or perhaps a game project with large assets, or an ML project with a large set of training data, or you're using Bitbucket Cloud which has a hard 2 GB repo size limit, it may be necessary to do so, and Git LFS allows you to "store" large files with whatever git workflow you're using, without git's inherent "inefficiencies" with large files.

Beware that if you decide to go with LFS, you'll lose some distributed-ness of git, since the content of some files have been moved to the LFS server and not part of the repo anymore, and when you clone the repo, it will only pull down the files that's needed for checking out the master branch. Be sure to have a backup policy in place.

Here's a high level animation of how Git LFS works:

As for what LFS does, the man page explains it pretty well:

$ git lfs
Git LFS is a system for managing and versioning large files in
association with a Git repository.  Instead of storing the large files
within the Git repository as blobs, Git LFS stores special "pointer
files" in the repository, while storing the actual file contents on a
Git LFS server.  The contents of the large file are downloaded
automatically when needed, for example when a Git branch containing
the large file is checked out.
...
...

Let's go over some of the basics of Git LFS by doing the same operation with, and without, LFS.

Installation

If you have an older version of git and can't upgrade for some reason, you'll need to install Git LFS first. If you're on Windows, as of version 2.12.0, Git LFS is already bundled with Git for Windows.

Initialize and Configure Git LFS

Let's start from two brand new repos in the remote (such as Azure Repos, Bitbucket, GitHub), and clone them into local folders. In the first repo, we'll enable LFS. In the second repo, we'll leave it alone so that we can compare normal git with LFS.

In the first local repo, let's initialize and configure Git LFS:

$ git lfs install
Updated git hooks.
Git LFS initialized.

As you can see from the message, it uses git hooks to perform LFS operations. It modifies 4 hook files – post-checkout, post-commit, post-merge, and pre-push. Here's what pre-push looks like:

#!/bin/sh
command -v git-lfs >/dev/null 2>&1 || { echo >&2 "\nThis repository is configured for Git LFS but 'git-lfs' was not found on your path. If you no longer wish to use Git LFS, remove this hook by deleting .git/hooks/pre-push.\n"; exit 2; }
git lfs pre-push "$@"

Next, let's assume we have large DLL files that we need to keep in our repo, so we'll tell LFS to handle, or "track", DLL files.

$ git lfs track '*.dll'
Tracking "*.dll"

(Note that we're using quotes around *.dll to prevent the shell from expanding the actual files matching the pattern.)

Above will create .gitattributes with the following content:

$ cat .gitattributes
*.dll filter=lfs diff=lfs merge=lfs -text

According to Git LFS documentation, the pattern follows gitignore rules.

Commit and push the changes we made above.

Adding a Large File

Now, let's add a DLL to both repos, say a DLL called MyLibrary.dll which is about 1.6 MB. At this point, the two repos are about the same size, except for a few extra bytes to account for files such as .gitattributes.

  With LFS Without LFS
Folder Size: ~1.6 MB ~1.6 MB
count-objects
count: 6
size: 861 bytes
count: 3
size: 540 bytes

(The Folder Size is from Windows Explorer, and count-objects is from git count-objects -vH command.)

Stage the DLL file with git add.

  With LFS Without LFS
Folder Size: ~3.22 MB ~2.37 MB
count-objects
count: 7
size: 986 bytes
count: 4
size: 774.87 KiB

Now things are getting interesting...

  • So why is the LFS repo bigger? Git compresses objects, but LFS doesn't. An issue has been opened to address this in 2015, but still no implementation yet. It's still on their roadmap. (In a way it makes sense, since if you have large files, maybe they are compressed already, so you might not want to waste time and resources compressing it again as it may even yield a bigger file, so perhaps a more granular control is needed here.)
  • Git's count-object size is much smaller in LFS. This is the size of the push.
  • The 774 KiB for the non-LFS repo is about the size of the DLL compressed with zlib.
  • The file object is stored in .git/lfs/objects for LFS, not in .git/objects.

Let's commit and push to remote.

With LFS
$ git push
Uploading LFS objects: 100% (1/1), 1.7 MB | 89 KB/s, done.
Enumerating objects: 4, done.
Counting objects: 100% (4/4), done.
Delta compression using up to 4 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 437 bytes | 218.00 KiB/s, done.
Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
To https://bitbucket.org/[organization]/lfs-demo.git
   09e450f..753cc3d  master -> master
Without LFS
$ git push
Enumerating objects: 4, done.
Counting objects: 100% (4/4), done.
Delta compression using up to 4 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 717.60 KiB | 6.71 MiB/s, done.
Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
To https://bitbucket.org/[organization]/without-lfs-demo.git
   23d4243..b21091d  master -> master

Note that there's an extra step for uploading LFS objects in the LFS repo, and the push size to remote is much smaller in the LFS repo as we've seen from count-objects before.

Most git providers such as Azure Repos, Bitbucket, and GitHub have built-in support for LFS. You can also use a separate LFS server.

Cloning LFS Repo

One benefit of using LFS is that you can use git without the need to download the full git history data, as it will download LFS files as needed. In git, the LFS files it stores are actually reference, or "pointer" files to the LFS objects. Let's see a bit of how that works.

Let's overwrite the DLL with a much small one, say ~49 KB, then commit and push.

With LFS
$ git commit -m "Replaced with smaller DLL."
[master 7c9b2ab] Replaced with smaller DLL.
 1 file changed, 2 insertions(+), 2 deletions(-)
Without LFS
$ git commit -m "Replaced with smaller DLL"
[master 2286636] Replaced with smaller DLL
 1 file changed, 0 insertions(+), 0 deletions(-)
 rewrite MyLibrary.dll (99%)

Note that the messages are a bit different, and that's because as far as git is concerned for the LFS, the file that changed is the pointer file. If you run git show, you will see something like below, which shows you the content of the pointer file and how it changed. The hash is how it tracks the file between git and LFS object storage.

diff --git a/MyLibrary.dll b/MyLibrary.dll
index 91c5966..d8b338d 100644
--- a/MyLibrary.dll
+++ b/MyLibrary.dll
@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:3dce36d583ba1c741e95df1a265e47f0de581bef77ab48165dd67266be7a42ef
-size 1677824
+oid sha256:2b615798c36b1996093d44e77eb5306b4db9260546ce5aa2d3f7dde23476586b
+size 49664

Now, let's clone the repositories to new folders, and see what the sizes are.

  With LFS Without LFS
Folder Size: ~122 KB ~872 KB
count-objects
count: 12
size: 1.68 KiB
count: 9
size: 801.93 KiB

So there you have it, Git LFS is much smaller, as it only downloaded the latest commit of the DLL file.

At this point, if you look at .git/lfs/objects/, there should be one folder, in my case 2b, and inside that is 61 folder. If you open that folder, there is a file, in my case: 2b615798c36b1996093d44e77eb5306b4db9260546ce5aa2d3f7dde23476586b, sitting in at 49KB. This is the actual MyLibrary.DLL file stored by LFS with git. Note that how the folder names match the beginning of the file name, which is the hash from the pointer file.

To view which hash the file corresponds to, use git lfs ls-files:

$ git lfs ls-files
2b615798c3 * MyLibrary.dll

What would happen if we checkout the previous commit, the one that had the larger DLL?

In my case, the previous commit's hash is a4febdc, so if we checkout that commit with git checkout a4febdc, the folder size gets larger, about 3.27 MB. There's a new folder under the .git/lfs/objects folder, storing the DLL of this commit. In my case, .git/lfs/objects/3d/ce/3dce36d5...., at 1.6 MB.

If we go back to the latest commit that has the smaller DLL, will LFS delete the old one from .git/lfs/objects? No, but you can run git lfs prune, which will delete the file from .git/lfs/objects/3d/ce (though it doesn't seem to delete the folders, just the file).

$ git lfs prune
prune: 2 local object(s), 1 retained, done.
prune: Deleting objects: 100% (1/1), done.

Now you may wonder, when I cloned the repo that had LFS, do I need to run git lfs install again? The answer is no, because I'm running git version 2.27. As per this commit, LFS clone support is built into git as of git version 2.15 (released in October 2017), so it will update the hook files and such. Note that the documentation for git lfs clone has not been updated with this information.

Some Concerns

Azure Repos LFS Interface

In Azure Repos, there doesn't seem to be a UI to view and manage LFS objects as of now. There is a suggestion, but it's closed.

Bitbucket Cloud offers a dedicated UI to manage LFS objects, such as deleting them:

Maximum File Size

So LFS can support large files, but there might be a limit on the maximum size of a single file:

  • GitHub enforces 4 GiB size limit on Team plan, and 5 GiB on Enterprise.
  • Azure Repos LFS doesn't seem to have a documented limit.
  • Bitbucket doesn't have a limit, as long as you pay for storage capacity ($10 per 100 GB increments). As per their documentation: "Note that there's no limit on the LFS file size you can push to Bitbucket Cloud."
  • Note that in most cases the storage limit is across the organization/account, not per repo.
Push Size Limit
  • Some remotes may have push limit size...? If you encounter errors while pushing to remote, may need to update some configuration, such as running git config http.version HTTP/1.1.
Pipelines
  • Some remotes may require special instructions when using LFS from a Pipeline process/builds, such as in Azure Pipelines.
Deleting LFS Objects from Remote

Since LFS is built to support git's workflow where all history is stored, it probably makes sense that you need to use caution if you want to delete LFS objects. For GitHub, and also for Azure Repos and Bitbucket, to reclaim LFS storage, you'll need to delete the entire repo. In Bitbucket, as seen above, you can use the UI to delete individual objects, but this will break your git unless you also clean up the git accordingly.

Other Limitations

Additional notes can be found on the follow up post.