Powershell remove HTML tags in inline content

Question

Powershell remove HTML tags in inline content

I have a large HTML data string split into small chunks. I'm trying to write a PowerShell script to remove all HTML tags, but I'm having a hard time finding the correct regex pattern.

Example line:

<p>This is an example</br>of various <span style="color: #445444">html content</span>

I tried using:

$string -replace '\<([^\)]+)\>',''

It works with simple examples, but the ones above will capture the entire line.

Any suggestions on the best way to achieve this?

Thank you in advance

+3

string html regex powershell

Arturski Apr 28 15 at 21:23

source to share

2 answers

You can try this:

$string -replace '<.*?>',''

0

Giedrius Apr 28 At 21:27

source to share

briantist · Accepted Answer · 2015-04-28T21:27:58+0000

For a pure regex, it should be as simple as <[^>]+>

:

$string -replace '<[^>]+>',''

Rendering Regular Expressions

Demo Debuggex

Note that this may end up with some HTML comments or tag content <pre>

.

You can use the HTML Agility Pack instead , which is for use in .Net code, and I've used it successfully in PowerShell:

Add-Type -Path 'C:\packages\HtmlAgilityPack.1.4.6\lib\Net40-client\HtmlAgilityPack.dll'

$doc = New-Object HtmlAgilityPack.HtmlDocument
$doc.LoadHtml($string)
$doc.DocumentNode.InnerText

HTML Agility Pack works well with imperfect HTML.

Powershell remove HTML tags in inline content

More articles: