Emacs using replace-regexp-in-string to match two regexps

I am trying to replace two parts of a string using replace-regexp-in-string

but I can only get one part to work at a time. Here is an example where I want to remove #

both spaces from the beginning and a newline from the end of the line. What am I doing wrong when I combine two calls in one expression?

;; Test string
(setq inputStr "## Header Stuff
")

;; This doesnt trim the newline
(setq header
      (replace-regexp-in-string "^[#\s]*\\|\n$" "" inputStr) )

;; Each match done separately works though
(setq header
      (replace-regexp-in-string "^[#\s]*" "" inputStr) )
(setq header
      (replace-regexp-in-string "\n$" "" header) )

header
"Header Stuff"

      

UPDATE: the problem seems to be associated with the first expression, for example, it replaces the new line and "S"

on "X"

, (replace-regexp-in-string "S\\|\n$" "X" inputStr)

.

+3


source to share


1 answer


It looks like it replace-regexp-in-string

has some unexpected behavior with regexes that match an empty string. The following regexp does what you expect (note the +

quantifier instead *

):

(let ((input-string "## Header Stuff
"))
  (replace-regexp-in-string "\\`[#\s]+\\|\n*\\'" "" input-string))

      

The reason lies in the internal implementation replace-regexp-in-string

, which you can find with M-x find-function

. In pseudocode, it does something like this:

Given a regexp

, a replacement

and a string

:

  • Set l

    to line length and start

    to 0

    . Create an empty stack called matches

    to copy the newline chunks.

  • While start

    less l

    , but regexp

    matches somewhere inside string

    , follow these steps:

    • Extract the part string

      that matches the regular expression and name it str

      .

    • Replace regexp

      with replacement

      , in a shorter line str

      (this is important)

    • Push the following two newline snippets onto the stack matches

      :

      • inconsistent start string

        , from start

        before the start of the match

      • the substring str

        where the match for is regexp

        now replaced withreplacement

    • Install start

      at the end of the matched part and repeat.

  • Finally, attach the pieces of the string on the stack matches

    in reverse order and return the result.

The problem with your original regex happens in step (3) of the loop. Even though the regex correctly matches a newline at the end of a full line "## Header stuff\n"

when it matches a single-character string the second time "\n"

, the first branch of the alternative - which matches an empty string - takes precedence over the second and replaces the empty string with an empty string, without deleting trailing newline.



This is possibly a bug in replace-regexp-in-string

, but also shows how complex regex semantics can be, especially when empty strings are involved. For me, the workaround is easier to read and understand:

(let ((input-string "## Header Stuff
"))
  (setq input-string (replace-regexp-in-string "\\`[#\s]*" "" input-string))
  (setq input-string (replace-regexp-in-string "\n*\\'" "" input-string))
  input-string)

      

If you have a very recent Emacs (pre-test 24.4 or higher), you can also use a function string-trim-right

from the built-in package subr-x

:

(let ((input-string "## Header Stuff
"))
  (string-trim-right (replace-regexp-in-string "\\`[#\s]*" "" input-string)))

      


By the way, I was surprised to learn that \s

in Emacs it is just a different way of writing the space character. If you want the behavior of the regex to be similar to the Perl wildcard \s

, you can use "\\s-"

(match any character with whitespace syntax) or "[[:space:]]"

.

+2


source







All Articles