Remove parts of string Regex.Match
So I have an HTML table in a row. Most of this HTML came from FrontPage, so it is mostly poorly formatted. Here's a quick example of what it looks like.
<b>Table 1</b>
<table class='class1'>
<tr>
<td>
<p>Procedure Name</td>
<td>
<p>Procedure</td>
</tr>
</table>
<p><b>Table 2</b></p>
<table class='class2'>
<tr>
<td>
<p>Procedure Name</td>
<td>
<p>Procedure</td>
</tr>
</table>
<p> Some text is here</p>
From what I understand, FrontPage automatically adds <p>
to every new cell.
I want to remove those tags <p>
that are inside tables, but keep them outside of tables. I tried 2 methods:
Method one
The first method was to use one RegEx tp tag for each tag <p>
in the tables, and then before Regex.Replace()
to remove them. However, I was never able to get the correct RegEx. (I know parsing HTML with RegEx is bad. I thought the data was simple enough to apply RegEx to it).
I can easily get everything in each table with this regex: <table.*?>(.*?)</table>
Then I would just grab the tags <p>
, so I wrote this: (?<=<table.*?>)(<p>)(?=</table>)
. This means nothing. (Apparently .NET allows quantifiers in their views. At least that's the impression I had when using http://regexhero.net/tester/ )
Is there anyway I can change this RegEx to only capture what I need?
Method two
The second method was to remove only the contents of the table into a row and then String.Replace()
to remove the tags <p>
. I am using the following code to capture the matches:
MatchCollection tablematch = Regex.Matches(htmlSource, @"<table.*?>(.*?)</table>", RegexOptions.Singleline);
htmlSource
is a string containing the entire HTML page and this variable is what will be sent back to the client after processing. I only want to remove what I need to remove from htmlSource
.
How can I use MatchCollection to remove tags <p>
and then send the updated tables back to htmlSource
?
thank
source to share
This answer is based on the second suggested approach. Changed Regex to fit everything inside the table:
<table.*?table>
And Regex.Replace is used to tell the MatchEvaluator to behave with the desired replacement:
Regex myRegex = new Regex(@"<table.*?table>", RegexOptions.Singleline);
string replaced = myRegex.Replace(htmlSource, m=> m.Value.Replace("<p>",""));
Console.WriteLine(replaced);
Exit using question input:
<b>Table 1</b>
<table class='class1'>
<tr>
<td>
Procedure Name</td>
<td>
Procedure</td>
</tr>
</table>
<p><b>Table 2</b></p>
<table class='class2'>
<tr>
<td>
Procedure Name</td>
<td>
Procedure</td>
</tr>
</table>
<p> Some text is here</p>
source to share
I guess it can be done with a delegate (callback).
string html = @"
<b>Table 1</b>
<table class='class1'>
<tr>
<td>
<p>Procedure Name</td>
<td>
<p>Procedure</td>
</tr>
</table>
<p><b>Table 2</b></p>
<table class='class2'>
<tr>
<td>
<p>Procedure Name</td>
<td>
<p>Procedure</td>
</tr>
</table>
<p> Some text is here</p>
";
Regex RxTable = new Regex( @"(?s)(<table[^>]*>)(.+?)(</table\s*>)" );
Regex RxP = new Regex( @"<p>" );
string htmlNew = RxTable.Replace(
html,
delegate(Match match)
{
return match.Groups[1].Value + RxP.Replace(match.Groups[2].Value, "") + match.Groups[3].Value;
}
);
Console.WriteLine( htmlNew );
Output:
<b>Table 1</b>
<table class='class1'>
<tr>
<td>
Procedure Name</td>
<td>
Procedure</td>
</tr>
</table>
<p><b>Table 2</b></p>
<table class='class2'>
<tr>
<td>
Procedure Name</td>
<td>
Procedure</td>
</tr>
</table>
<p> Some text is here</p>
source to share
Usually regex allows you to work with nested structures, it is very ugly and you should avoid it, but if you have no other option, you can use it.
static void Main()
{
string s =
@"A()
{
for()
{
}
do
{
}
}
B()
{
for()
{
}
}
C()
{
for()
{
for()
{
}
}
}";
var r = new Regex(@"
{
(
[^{}] # everything except braces { }
|
(?<open> { ) # if { then push
|
(?<-open> } ) # if } then pop
)+
(?(open)(?!)) # true if stack is empty
}
", RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture);
int counter = 0;
foreach (Match m in r.Matches(s))
Console.WriteLine("Outer block #{0}\r\n{1}", ++counter, m.Value);
Console.Read();
}
here the regex "knows" where the block starts and ends, so you can use this information to remove the tag <p>
if it doesn't have a matching closure.
source to share