August 23, 2020

SVN to Git Migration

Recently, we decided to migrate one of our "legacy" product in SVN repository to git that will be hosted on Bitbucket. It had to have the full history maintained. While researching this topic, I was surprised to find out that git has built-in support for SVN!

Steps

Note that I'm only migrating the trunk in this post, but you can also migrate the full SVN structure including branches and tags.

  1. Get the list of users that have committed to SVN (from PowerShell): PS C:\MySVNRepo> svn log --quiet | ? { $_ -notlike '-*' } | % { "{0} = {0} <{0}>" -f ($_ -split ' \| ')[1] } | Select-Object -Unique | Out-File 'authors-transform.txt' -Encoding utf8
  2. Open the authors-transform.txt generated from above and add the email address of the users. This step is actually optional, it's because git includes email addresses in commits and will help map out the users later when pushed to remote.
  3. "Clone" the SVN repo as a git repo using the git svn clone command, e.g. in bash: git svn clone http://<svn repo URL>/trunk --prefix=svn/ --no-metadata --authors-file "authors-transform.txt" /c/repos/migrated-repo --username <svn user name>. (It will prompt for the SVN password.)
  4. Create an empty repository on the git server, such as Azure Repos/Bitbucket/GitHub, and get the clone URL.
  5. Add the clone URL as origin in the git repo: git remote add origin <clone URL>
  6. Push to the git server: git push -u origin master

If you were to use LFS, you can run LFS migrate between steps 3 and 4.

Migration..?

Now, as you might know already, when you hear the word "migration", it never goes smoothly. Below are some of the issues that I had to deal with.

Git 2.27.0

So at the time, I was using git 2.27.0, and of course, git svn is broken in that release. Used 2.28 RC since it wasn't officially out yet. I suppose I could've downgraded as well.

Author Not Defined

In the middle of the process, it kicked out with a message, Author: [user] not defined in authors-transform.txt file. The [user] was the first user listed in the file. I've used the -Encoding utf8 option in the PowerShell command, and git didn't like the BOM. So opened the file in Notepad++ and converted to UTF-8 without BOM, and that did the trick.

Hanging, Timeouts and Other Errors

The SVN repo had a lot of revisions, riddled with large binaries. After getting some revisions, it hanged – no activity for a while. In the Task Manager, perl was taking up about 50% of the CPU, and the folder was not growing. So after canceling the command, cd'ed into the folder and ran git svn fetch --authors-file "../authors-transform.txt" as per this article, and that seemed to have made it continue from where it left off – it detected that the last retrieved revision was not complete, and started over again from that revision. All errors below were resolved the same way:

  • Connection timed out: Connection timed out at C:/Program Files/Git/mingw64/share/perl5/Git/SVN/Ra.pm line 312
  • 1 [main] perl 44975 cygwin_exception: Dumping stack trace to perl.exe.stackdump
  • Failed to commit, invalid old:
  • Name or service not known at C:/Program Files/Git/mingw64/share/per15/Git/SVN/Ra.pm line 312.
    (This can happen if your DNS goes down, yes, it happened.)
Checksum mismatch

This was a tricky one, and it occurred on random files. After many trials, I was able to avoid this error by running the git svn clone on the SVN server machine itself, e.g.:

git svn clone file:///c/SVNData/MyProject/trunk --prefix=svn/ --no-metadata --authors-file "authors-transform.txt" /c/gitrepo/my-project

I really don't know what the root cause is, maybe our internal network is unstable, or the SVN webserver (CollabNet) had some issues. It was a large SVN repo though, took about 24 hours to complete.

New Changes in SVN

For this migration, I didn't have to worry about two way support, since we were planning to retire the SVN after migrating to git. But I did have to get some new changes from SVN after the initial migration to git.

I used git svn fetch as above to get the latest SVN changes into git. This is not enough, though, since now I have the master branch, so I needed to bring in those changes into the master branch. This SO answer seems incorrect – running git svn rebase -l gave the following error message after running for a while:

$ git svn rebase -l
Unable to determine upstream SVN information from working tree history

Just simply doing git merge remotes/svn/git-svn worked.

LFS

If you run LFS migrate, you may no longer be able to run git svn fetch again and do a merge to master to bring it up to the latest:

$ git merge remotes/svn/git-svn
fatal: refusing to merge unrelated histories

LFS migrate rewrites commits, hence new hashes will be created for commits and git will think the master branch and the svn remote branch are completely separate since they won't share a common ancestor (remember, LFS rewrites the very first commit to add .gitattributes file). There are ways around it, such as specifying --allow-unreleated-histories option, but it may get ugly since all commits are technically different, and git will warn you about merging binaries as well:

$ git merge remotes/svn/git-svn --allow-unrelated-histories
warning: Cannot merge binary files: Libraries/MyLibrary.dll (HEAD vs. remotes/svn/git-svn)
warning: Cannot merge binary files: Libraries/MyLibrary.exe (HEAD vs. remotes/svn/git-svn)
...
...

One way to handle this would be by making backups before each step, and then you can go back to the point right before you ran LFS migrate, run git svn fetch, merge to master, then run LFS migrate again.

August 18, 2020

More on Git LFS

This is a follow up to the Git LFS Basics post – some additional notes on LFS.

Partial Clone

If the storage space is not a concern (e.g., Bitbucket's 2 GB hard limit), then partial clone and sparse checkout may replace the need for LFS. They are still in early stages, though.

Case Sensitivity

Windows file system is case insensitive, but git is case sensitive, so there may be problems when specifying the tracking pattern. According to this open issue (which was based on an older issue), you can use regex patterns, e.g., git lfs migrate import --include="*.[dD][lL][lL], *.[eE][xX][eE], [Bb]in/". Specifying it as "*.dll, *.DLL" doesn't work as expected.

LFS File Size Report

LFS has a built-in feature that will go through the history and report on the file types and sizes.

$ git lfs migrate info
migrate: Fetching remote refs: ..., done.
migrate: Sorting commits: ..., done.
migrate: Examining commits: 100% (1681/1681), done.
*.dll   5.6 GB  2013/2013 files(s)      100%
*.exe   3.6 GB    546/546 files(s)      100%
*.dat   1.9 GB        2/2 files(s)      100%
*.zip   1.7 GB      16/16 files(s)      100%
*.war   595 MB      17/17 files(s)      100%

The thing is, it defaults to showing only the top five. To show more, use the --top option, e.g., git lfs migrate info --top=100

Create .gitattributes Before or After Migrate?

Per migrate import documentation, it will create .gitattributes for you (except on certain cases based on options passed in). One thing it doesn't tell you is where it's added - it adds it to the very first commit in the history – it rewrites the very first commit (which is not surprising since rewriting history is one of the main tasks for LFS).

Find Lingering Large Files After the Migration

After migrating to LFS, if you find that the repo is still too large, you may want to run git lfs migrate info again, but it won't show any files. Instead, you can use git ls-tree to find large files in the repository. For example: git ls-tree -r -l --abbrev --full-name HEAD | sort -n -r -k 4 | head -n 10

Viewing Differences

There's no git lfs diff, and git log will show differences in "pointer" files, not the content. But in some cases, it might be useful to be able to view the differences. One way to do it is by using the external diff tool, e.g., git difftool HEAD^ HEAD Document.pdf (assuming you have difftool already configured). Note that in Bitbucket, if you browse to the source file tracked as LFS, it won't let you view the differences in the UI even if it's a "text" file, it just allows you to download the file.