Powershell remove HTML tags in inline content
I have a large HTML data string split into small chunks. I'm trying to write a PowerShell script to remove all HTML tags, but I'm having a hard time finding the correct regex pattern.
Example line:
<p>This is an example</br>of various <span style="color: #445444">html content</span>
I tried using:
$string -replace '\<([^\)]+)\>',''
It works with simple examples, but the ones above will capture the entire line.
Any suggestions on the best way to achieve this?
Thank you in advance
+3
source to share
2 answers
For a pure regex, it should be as simple as <[^>]+>
:
$string -replace '<[^>]+>',''
Note that this may end up with some HTML comments or tag content <pre>
.
You can use the HTML Agility Pack instead , which is for use in .Net code, and I've used it successfully in PowerShell:
Add-Type -Path 'C:\packages\HtmlAgilityPack.1.4.6\lib\Net40-client\HtmlAgilityPack.dll' $doc = New-Object HtmlAgilityPack.HtmlDocument $doc.LoadHtml($string) $doc.DocumentNode.InnerText
HTML Agility Pack works well with imperfect HTML.
+7
source to share