Removing files from Git history exclusively between two commits

I am trying to remove multiple large files from my history using filter-branch

. I've used this command before with success, but I'm currently having problems with a specific edge case.

The problem is that these large files were never deleted, but replaced with smaller versions with the same path .

As far as I can tell, I believe I have a unique problem.

Git Log

To clarify, here is a rudimentary view of my repo:

----- A ------ B ----------- HEAD

      

Where:

A is the commit where the large files were introduced
B is the commit (about 30 later) where the large files were replaced with smaller ones
HEAD is thousands of commits forward of B (~2 years of active development)

      

Git Filtering

In theory, I should have done something like this:

git filter-branch --index-filter 'git rm --cached --ignore-unmatch filenames' <parent of A>..B 

      

I believe I should use <parent of A>

because it is filter-branch

not included. (I'm not sure if I also need to use parent B as well, but that is the least of my worries right now).

Running this gives me an error:

$ git filter-branch --index-filter 'git rm --cached --ignore-unmatch filenames' <parent of A>..B 
Which ref do you want to rewrite?

      

So, I have included --glob="refs/heads/master*"

at the end a command that seems to do the trick ( source ).

Once the execution was over, the files were completely deleted - it seems that git is ignoring the upper bound I specified.

So I'm wondering if this method is possible?

Alternative approaches

I thought I should list some other ideas I had so potential answers could be focused on solving the problem.

  • The pragmatic approach was to commit the filename change to HEAD and then run git filter-branch ... HEAD

    . However, my repository has several branches in active development and I find this method to be very messy.
  • Another way might be to do something like described here . Quote:create a temporary branch to point at HEAD^, filter-branch it, then add a graft to stitch the remaining commit on top of it, then filter-branch HEAD and then remove the graft.

Hopefully someone has encountered this problem before and can provide their experience.

Update

The files I want to delete are ~ 500MB , so I really want to delete them! They were committed long before I joined the company and remain our transition from the back-end Mercurial to GitHub (I guess the 500MB push to the back-end will be less noticeable than GitHub ...).

Update 2

I am following twalberg's second answer (I think I am using it correctly):

git filter-branch --index-filter '(( $(git rev-list <SHA-of-child-of-B> --not $GIT_COMMIT | wc -l) > 0 )) && git rm --cached --ignore-unmatch <filenames>' 

      

This will give the result I expect:

...
Rewrite dc8a4b29463bfa43c2f3efe0c6e5a29a5cc6e0ef (1071/5680)rm 'file1'
rm 'file2'
rm 'file3'
rm 'file4'
...

      

Before the error completes (expected?):

Rewrite e6b712b57257e2edd0bb9fbbac59e4c9d7b5aa79 (1072/5680)index filter failed: (( $(git rev-list e6b712b --not $GIT_COMMIT | wc -l) > 0 )) && git rm -rf --ignore-unmatch <filename>

      

Where e6b712b

is the child B

.

At this point, I assume everything worked, so I clone my repository's local filesystem to test it out:

git clone file://<repo> <new repo>

      

The number of objects and the package size have decreased very little - I'm not sure why. By running git count-objects -v

against the original repository versus what it had filter-branch

, against it:

Original repository:

count: 0
size: 0
in-pack: 106640
packs: 1
size-pack: 815512
prune-packable: 0
garbage: 0

      

Cloned filter-branch

ed repository and filesystem:

count: 0
size: 0
in-pack: 96165
packs: 1
size-pack: 793656
prune-packable: 0
garbage: 0

      

I really don't know why this still doesn't work - maybe I'm not following the suggested answer correctly?

+3


source to share


2 answers


Unfortunately, if you really want to remove those objects from your repository (as opposed to just removing them from current and future versions), filter-branch

this is the way to do it, and if you are going to rewrite commit A

, every commit of every branch of the branch that includes A

in its history must also be rewritten, since the commit hash of the commit depends on the commit hash of each parent of that commit. If you don't rewrite all the branches that include A

, then those objects are still legally part of some commit in your reachable history, and they won't be truncated.

For every branch BR

that A

has history in it, this should work:

git filter-branch --index-filter 'git rm --cached --ignore-unmatch filenames' BR --not A~1

      

which will overwrite from A

(by trimming the branch in A

parent) to the current branch tip BR

. It will remove files from all those commits, albeit even after replacing them with newer, smaller versions. To remove them just before commit B

, you can expand the script filter like this:



... --index-filter '(( $(git rev-list <SHA-of-child-of-B> --not $GIT_COMMIT | wc -l) > 0 )) && git rm ...' ...

      

This uses rev-list to list all revisions after the commit that is currently being overwritten, and before the child B

, counts those lines and only does git rm

if one or more revisions fall within that range (when $GIT_COMMIT == B

, one line is printed, so a child must be used element B

).

This is a pretty big change even for a single branch and a lot of work if you have many branches that were created in or after A

, so you'll have to decide if it's worth it in the end, or if you just need a larger disk (you didn't specify exactly. how big these files are).

+1


source


A     is the commit where the large files were introduced
B     is the commit (about 30 later) where the large files were replaced 
      with smaller ones
HEAD  is thousands of commits forward of B (~2 years of active development)

      

You, having said that, I highly recommend against filter-branch

, as I believe it will overwrite the 2 years of SHA cost. Maybe another solution would begit revert



git revert SHA_A..SHA_B
    Revert the changes done by commits from commit SHA_A (included) to
    SHA_B (included)

      

0


source







All Articles