Site icon Jessitron

Finding and removing large files in git history

Sometimes it behooves one to erase commits of large binaries from history. Binaries don’t play nicely in git. They slow it down. This post will help you find them and remove them like they were never there.

In svn, people only download the latest version of the repository, so if you commit a large binary, you can delete it in a future commit and the rest of your team is unharmed. In git, everyone who clones a repo gets the entire history, so they’re stuck downloading every version of every binary ever committed. Yuck!

Therefore, the svn-to-git conversion is a good time to delete all the large binaries from history. Do this before anyone has cloned the repository, before you push all the commits to a shared place like bitbucket or github.

Caution: Never alter commits that are in your repo and someone else’s, if you ever plan to talk to their repo again.

Step 1: Identify the large files. We need to search through all of the history to find the files that are good candidates for deletion. As far as I can tell, this is nontrivial, so here is a complicated command that lists the sum of the sizes of all revisions of files that are over a million bytes. Run it on a mac.

git rev-list master | while read rev; do git ls-tree -lr $rev | cut -c54- | grep -v ‘^ ‘; done | sort -u | perl -e ‘
  while () {
    chomp;
    @stuff=split(“\t”);
    $sums{$stuff[1]} += $stuff[0];
  }
  print “$sums{$_} $_\n” for (keys %sums);
| sort -rn >> large_files.txt

Please replace master with a list of all branches you care about.
This command says: List all commits in the history of these branches. For each one, list all the files; descend into directories recursively; include the size of the file. Cut out everything before the size of the file (which starts at character 54). Anything that starts with space is under a million bytes, so skip it. Now, choose only the unique lines; that’s approximately the unique large revisions. Sum the sizes for each filename, and output these biggest-first. Store the output in a file.

If this works, large_files.txt will look something like mine:

186028032 AccessibilityNative/WindowsAccessibleHandler/WindowsAccessibleHandler.sdf
94973848 quorum/installers/windows/jdk-7u21-windows-x64.exe
93300120 quorum/installers/windows/jdk-7u21-windows-i586.exe
84144520 quorum/installers/windows/jdk-7-windows-x64.exe
83345288 quorum/installers/windows/jdk-7-windows-i586.exe
57712115 quorum/Run/Default.jar

Yeah, let’s not retain multiple versions of the jdk in our repository.

Step 2: Decide which large files to keep. For any file you want to keep in the history, delete its line from large_files.txt.

Step 3: Remove them like they were never there. This is the fun part. If large_files.txt is still in the same format as before, do this:

git filter-branch –tree-filter ‘rm -rf `cat /full/path/to/large_files.txt | cut -d ” ” -f 2` ‘ –prune-empty 

This says: Create an alternate universe with a history that looks like , except for each commit, take its files and remove everything in large_files.txt (which contains the filename in the second space-delimited field). Drop any commits which only affected files that don’t exist anymore. Point at this new version of history.

Whew. If this worked, then when you push to a brand-new repository for sharing, those binaries won’t go. Not in the current revision, not in any history. It is like they were never there.

————————-

OH GOD WHAT DID I DO: If you change your mind or mess up, you can undo this operation.
First, look at the history of where your branch has pointed recently:
git reflog

Here’s my output:

→ git reflog bbm2e9429a7 bbm2@{0}: filter-branch: rewrite08d7da5 bbm2@{1}: branch: Created from HEAD

The top line is the filter-branch I just did. The line before that lists the tip of the branch before that crazy filter operation.
I can do git log 08d7da5 to check on it, and git ls-tree 08d7da5 to see what’s in it. (If you want all the files to be listed, then git ls-tree -r 08d7da5.)

When I’m sure I want to undo the filter-branch, then:
git checkout
git reset @{1}

will put the branch riiiight back where it was. If you don’t like the weird @{1} notation, you can use the specific commit name instead, and tell the branch exactly where you want it to be.

It’s important to feel safe to experiment. In git, as long as it was ever committed in the last 30 days, you won’t lose it.

Exit mobile version