Why is the no-op filter branch creating a discrepancy and how do I fix it?

I have a situation where I have merged commits in a couple of years into a repository. One of the commits had a comment that was the log insert of the Sanitizer address associated with the fix.

It doesn't sound too bad, except for the Sanitizer log addresses, which look like this:

==10856==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x62a00000b201 at pc 0x47df61 bp 0x7fffffff2ca0 sp 0x7fffffff2c98
READ of size 1 at 0x62a00000b201 thread T0
#0 0x47df60 in Expand_Series ../src/core/m-series.c:145
#1 0x47e5a7 in Extend_Series ../src/core/m-series.c:187
#2 0x466e0c in Scan_Quote ../src/core/l-scan.c:462
#3 0x46a797 in Scan_Token ../src/core/l-scan.c:918
#4 0x46e263 in Scan_Block ../src/core/l-scan.c:1188
...

      

And in this case, it goes up to # 250 or so. GitHub scans #XXX templates and if they match the issue number, please note the note mentioning the mentioned issue. So, GitHub thinks this commit notices every issue and pulls in a request, and will do so for some time.

I thought I would just use git filter-branch

it since I don't mind breaking with history (I had to make a filter branch to get rid of some things I didn't want). However, I made this different filter branch before I did the merge and continue working. Now that I've noticed that this has appeared on GitHub, I'd like to go back and rewrite it, and don't mind if every commit on every branch after that point gets a new hash. It's okay with me.

Rewriting that I should work, but I can't figure out why there are so many discrepancies. It seems he did a rewrite that affected things before I made any changes to the comment. As a simple test, I tried what I thought the no-op should be:

git filter-branch -f --msg-filter 'sed "s/a/a/g"' -- --all

      

I am not human, but I understand that will redo all commit messages and replace a

with a

. (Ayn Rand will be delighted.)

It does not diverge the way many people fix my actual replacement ... 600 instead of 1000. But this discrepancy generally indicates that I have some kind of misunderstanding here. How can I rewrite which commit a post in history without damaging any commits other than those after it ... and affect all branches?

+3


source to share


2 answers


There is an extra seat that could be the culprit (and was in my case). Consider:

$ git cat-file -p 20b9cd59c6c6a1a2bccfb2ddb9af68c083a28698
tree dee80bcd856b23aceb8946473bf64d9aef0fe629
parent b12dc8b9388dc0a2ae34563426043a612d296195
author XXX <xxx@example.com> 1355477802 +0200
committer XXX <xxx@example.com> 1355478447 +0200
encoding cp1251

Add (literally) three characters to one file that will
inadvertently create hours of fun for people years later.

      

This is the encoding, in this case Windows 1251 . The person who found it summed up:



msg-filter receives unprocessed message, does not encode meta information. This way, even if you are using an 8-bit transparent msg filter (like plain cat), the re-created commit will not contain this encoding meta information.

(This is a bit imprecise because the filter is getting the encoding of the information, it could have read it through the GIT_COMMIT env variable. This is an output that has no control over the encoding. At least I don't know how ...)

He established a general confusion in our particular situation using the Point Chart . This is above my current git knowledge, so I won't try to explain it.

+2


source


If there is an existing message that does not end with a newline, sed

add one (at least some versions of sed, including the ones I tested here):

$ printf 'foo\nbar'
foo
bar$ printf 'foo\nbar' | sed 's/a/a/'
foo
bar
$ 

      

which means your test message filter may have changed the message. Based on your results, I would guess that at least one commit, about 600 enrollments from some branch (s) of the branch, was changed in this way. (I've seen this very problem before).

(Another possibility is some Unicode normalization, although I haven't seen this since sed

.)

Assuming that's the case, the trick for you is to find a command that doesn't affect other commits. One of the good ones is to use an environment variable $GIT_COMMIT

to identify the commit to touch and make sure you are doing something that really doesn't work (the cat

msg filter might work better than sed

, for example) for all other commits:

... --msg-filter 'if [ $GIT_COMMIT == <the one> ]; then fix_msg; else cat; fi' ...

      

As for getting the effect for all branches, yours -- --all

should do the trick already.


It looks like you already know why the rest of the commits are getting the new SHA-1, but for completeness I'll include that as well. You can skip this part, it is here for other people reading the question.

If a commit is modified, it gets a new SHA-1 (by definition, since SHA-1 is the checksum of the commit's contents). There is nothing to do yet, but let's say that in this case there are only five commits (all on the master, not what is important), and we will modify the middle filter of the filter:

A <- B <- C <- D <- E        [original]

      



Let's say the actual SHA-1 for C

starts with 30001

). Now, let's plot the partial result in the middle of the filter operation:

A <- B <- C'

      

Let's say by some strange coincidence the new SHA-1 starts with 30002

version 2 Commit 3.

Let's take a look at (part of) the original commit D

:

$ git cat-file -p HEAD^
tree 954019cba5244a4a135ff62258660b3d2e3a8087
parent 30001...

      

Commit D

refers by number to the commit C

. So, filter-branch

but won't change anything in D

, should build a new commit D'

that says parent 30002...

:

A <- B <- C' <- D'

      

Likewise, I have filter-branch

to copy the old commit E

into the new one E'

:

A <- B <- C' <- D' <- E'     [replacement]

      

Hence, anyone filter-branch

that changes some commit also changes all subsequent commits. (This is also true for git rebase

. In fact, git rebase

and git filter-branch

are prominent cousins. Both just read existing commits, apply some changes, and write the results as new commits; does it all programmatically, i.e. has no mode --interactive

, and has a very wide and complex set specs to make the change, and then can apply it to multiple branches instead of one separate branch.)

+4


source







All Articles