Building a Markdown parser. Is it possible to detect links without detecting underscores in them?

Question

Building a Markdown parser. Is it possible to detect links without detecting underscores in them?

I am trying to write a basic Markdown parser and I want to create a regex that can detect links and accent.

In Markdown, links look like [text](URL)

, and highlight / italics look like *text*

or _text_

.

I have no problem detecting attention and I have no problem detecting links, but when links have underscores in them, for example http://example.com/link_to_article

, my parser detects _to_

as an attempt to make an emphasis.

How do you stop this?

My first attempt was to make sure there were no characters before the first underscore or after the second character, but the inline accent is fully valid, as shown here on Stackoverflow, so examples like this intere_stin_g

are completely valid, taking that idea off to a foot.

So how would I go about it?

+3

regex

Doug smith Dec 15. 14 at 5:01

source to share

1 answer

QBot · Answer 1 · 2016-06-18T19:48:47+0000

There are three main ways to do this.

A big, fancy regex that would look something like this:
```
(?<!\(\s*\S+)_([^_]+)_(?!\S+(?:\s+"[^"]")\s*\))

      

        
        
        
      

    
```
I highly recommend against this approach because even this monster is not fully standardized, and ... I mean, who wants to try to decipher this? Even splitting it across multiple lines makes it a little better. Also, this lookbehind might not even be accepted, depending on your regex engine.

Prevent italics in the middle of a word with _

. This makes your regex much easier:

\b_[^_]+_\b

Stack Overflow does this.

Orient your entire program around a streaming project where you map chunks and parse them as you work through a string. This is a bit tricky to code, but basically this:
- Find the first thing that looks like italics.
- Find the first thing that looks like a link
- Format it based on what matches the previous one.

NB: I put [^_]

in a few places when it's not strictly accurate; more accurate will be (?:(?<!\\)(\\\\)*\\_|[^_])+

; those. series of escaped characters _

or not _

. Alternatively, you can do something like _.*?(?<!\\)(\\\\)*_

; that is, match from _

to the very next unescaped _

.

PS If you want to learn more about regex there are many handy tools to help you, like online parsers and tutorials

Building a Markdown parser. Is it possible to detect links without detecting underscores in them?

More articles: