Emacs using replace-regexp-in-string to match two regexps
I am trying to replace two parts of a string using replace-regexp-in-string
but I can only get one part to work at a time. Here is an example where I want to remove #
both spaces from the beginning and a newline from the end of the line. What am I doing wrong when I combine two calls in one expression?
;; Test string
(setq inputStr "## Header Stuff
")
;; This doesnt trim the newline
(setq header
(replace-regexp-in-string "^[#\s]*\\|\n$" "" inputStr) )
;; Each match done separately works though
(setq header
(replace-regexp-in-string "^[#\s]*" "" inputStr) )
(setq header
(replace-regexp-in-string "\n$" "" header) )
header
"Header Stuff"
UPDATE: the problem seems to be associated with the first expression, for example, it replaces the new line and "S"
on "X"
, (replace-regexp-in-string "S\\|\n$" "X" inputStr)
.
source to share
It looks like it replace-regexp-in-string
has some unexpected behavior with regexes that match an empty string. The following regexp does what you expect (note the +
quantifier instead *
):
(let ((input-string "## Header Stuff
"))
(replace-regexp-in-string "\\`[#\s]+\\|\n*\\'" "" input-string))
The reason lies in the internal implementation replace-regexp-in-string
, which you can find with M-x find-function
. In pseudocode, it does something like this:
Given a regexp
, a replacement
and a string
:
-
Set
l
to line length andstart
to0
. Create an empty stack calledmatches
to copy the newline chunks. -
While
start
lessl
, butregexp
matches somewhere insidestring
, follow these steps:-
Extract the part
string
that matches the regular expression and name itstr
. -
Replace
regexp
withreplacement
, in a shorter linestr
(this is important) -
Push the following two newline snippets onto the stack
matches
:-
inconsistent start
string
, fromstart
before the start of the match -
the substring
str
where the match for isregexp
now replaced withreplacement
-
-
Install
start
at the end of the matched part and repeat.
-
-
Finally, attach the pieces of the string on the stack
matches
in reverse order and return the result.
The problem with your original regex happens in step (3) of the loop. Even though the regex correctly matches a newline at the end of a full line "## Header stuff\n"
when it matches a single-character string the second time "\n"
, the first branch of the alternative - which matches an empty string - takes precedence over the second and replaces the empty string with an empty string, without deleting trailing newline.
This is possibly a bug in replace-regexp-in-string
, but also shows how complex regex semantics can be, especially when empty strings are involved. For me, the workaround is easier to read and understand:
(let ((input-string "## Header Stuff
"))
(setq input-string (replace-regexp-in-string "\\`[#\s]*" "" input-string))
(setq input-string (replace-regexp-in-string "\n*\\'" "" input-string))
input-string)
If you have a very recent Emacs (pre-test 24.4 or higher), you can also use a function string-trim-right
from the built-in package subr-x
:
(let ((input-string "## Header Stuff
"))
(string-trim-right (replace-regexp-in-string "\\`[#\s]*" "" input-string)))
By the way, I was surprised to learn that \s
in Emacs it is just a different way of writing the space character. If you want the behavior of the regex to be similar to the Perl wildcard \s
, you can use "\\s-"
(match any character with whitespace syntax) or "[[:space:]]"
.
source to share