Why is Git considered inefficient when handling binaries?

Question

Why is Git considered inefficient when handling binaries?

I am planning to migrate our SVN repository to Git, and I have heard a lot about how Git is very inefficient when dealing with binaries. But I don’t understand what problems there might be (besides the size of the repository), I will face this question, since we have a lot of binaries in our repository.

This is our scenario: We have one 800 MB repository that contains 2 directories:

src (300 MB)
libs (500 MB binaries)

This is the current size, excluding history (assume we are starting the Git repo from scratch, without any history).

Binaries never exceed 25MB, most are below 10MB and rarely change (2 or 3 times a year).

Can I expect problems with the repository, for example when using Git? If the only problem with Git is that all history is stored in every local repository, then I don't expect it to grow that much because those files don't change often.

But can the performance of Git (when committing or checking the state) be affected if I have two binaries in the repository? Can the Git Subtree function help with this (by making the "libs" directory a subtype of the main repository)?

EDIT . I know I can use something like Maven to store these binaries outside, but we have a limitation that we have to keep these files together.

UPDATE . I ran a series of tests and I concluded that Git is smart enough to parse zip content and save delta: for example if I add a 20MB zip file and then I change one text file inside the zip when I commit a new version zip and run 'git gc', the size is almost unchanged (still 20MB). So I can assume that Git works fine with zip files. Can someone confirm this?

+3

git

xsilmarx Apr 20 At 10:09 am

source to share

2 answers

che · Answer 1 · 2015-04-21T13:28:24+0000

The main problem you may run into is that every git repository has a complete history of all files. Even when packaged together, there is no easy way to do a "lightweight" check of just one subdirectory of the source files you need to work.

If you have 500MB of binaries that change 2-3 times a year, this means that after three years you will need to process 3+ GB history (ok, compressed a bit) whenever you checkout the repo or be it somewhere- then. This can be a little annoying.

In my experience, git submodules aren't a huge help in this regard: you still have a git repository with files (i.e. a large and growing repository), and submodules basically complicate things. A better approach is to try to avoid large binaries, for example by storing the sources you use to create them (and possibly caching them somewhere if it takes too long).

However, git will certainly survive in your use case, so if you don't mind a little disk space, take a shot.

Unex · Answer 2 · 2015-04-20T16:24:06+0000

The main reason you see a size-to-size difference with git (versus svn) is because git and svn don't build in the same way.

Svn: svn uses delta to process files. I. The first time you check in a file, svn creates files, and when you make a modification, svn only stores the differences between the two files. If I recall correctly (or rather), svn stores the complete last file you committed and stores the delta negatively. This is pretty fast when you have few revisions and when you want to get a HEAD commit, but the more changes you have, the slower it will get to a specific revision, since svn will have to rebuild the file using deltas.

GIT: Git works very differently from svn. It does not store delta, it stores blob (binary large object). When you check in a file, it saves the file in a block with a revision tag. If you commit without modifying the file, git create a symlink to the blob from the previous commit. If you change the file, git keeps the complete block. The advantage of this is to be equally fast for every revision, but your repository can grow quite quickly.

I will not answer how to deal with binaries, because I believe it is completely present on the web (and I'm sure it's on SO).

I hope this helped you

Why is Git considered inefficient when handling binaries?

More articles: