Remove parts of string Regex.Match

So I have an HTML table in a row. Most of this HTML came from FrontPage, so it is mostly poorly formatted. Here's a quick example of what it looks like.

<b>Table 1</b>
  <table class='class1'>
  <tr>
    <td>
      <p>Procedure Name</td>
    <td>
        <p>Procedure</td>
    </tr>
  </table>
<p><b>Table 2</b></p>
  <table class='class2'>
    <tr>
      <td>
        <p>Procedure Name</td>
        <td>
        <p>Procedure</td>
    </tr>
  </table>
<p> Some text is here</p>

      

From what I understand, FrontPage automatically adds <p>

to every new cell.

I want to remove those tags <p>

that are inside tables, but keep them outside of tables. I tried 2 methods:

Method one

The first method was to use one RegEx tp tag for each tag <p>

in the tables, and then before Regex.Replace()

to remove them. However, I was never able to get the correct RegEx. (I know parsing HTML with RegEx is bad. I thought the data was simple enough to apply RegEx to it).

I can easily get everything in each table with this regex: <table.*?>(.*?)</table>

Then I would just grab the tags <p>

, so I wrote this: (?<=<table.*?>)(<p>)(?=</table>)

. This means nothing. (Apparently .NET allows quantifiers in their views. At least that's the impression I had when using http://regexhero.net/tester/ )

Is there anyway I can change this RegEx to only capture what I need?

Method two

The second method was to remove only the contents of the table into a row and then String.Replace()

to remove the tags <p>

. I am using the following code to capture the matches:

MatchCollection tablematch = Regex.Matches(htmlSource, @"<table.*?>(.*?)</table>", RegexOptions.Singleline);

      

htmlSource

is a string containing the entire HTML page and this variable is what will be sent back to the client after processing. I only want to remove what I need to remove from htmlSource

.

How can I use MatchCollection to remove tags <p>

and then send the updated tables back to htmlSource

?

thank

+3


source to share


3 answers


This answer is based on the second suggested approach. Changed Regex to fit everything inside the table:

<table.*?table>

      

And Regex.Replace is used to tell the MatchEvaluator to behave with the desired replacement:



Regex myRegex = new Regex(@"<table.*?table>", RegexOptions.Singleline);
string replaced = myRegex.Replace(htmlSource, m=> m.Value.Replace("<p>",""));
Console.WriteLine(replaced);

      

Exit using question input:

<b>Table 1</b>
    <table class='class1'>
    <tr>
    <td>
        Procedure Name</td>
    <td>
        Procedure</td>
    </tr>
    </table>
<p><b>Table 2</b></p>
    <table class='class2'>
    <tr>
        <td>
        Procedure Name</td>
        <td>
        Procedure</td>
    </tr>
    </table>
<p> Some text is here</p>

      

+1


source


I guess it can be done with a delegate (callback).

string html = @"
<b>Table 1</b>
  <table class='class1'>
  <tr>
    <td>
      <p>Procedure Name</td>
    <td>
        <p>Procedure</td>
    </tr>
  </table>
<p><b>Table 2</b></p>
  <table class='class2'>
    <tr>
      <td>
        <p>Procedure Name</td>
        <td>
        <p>Procedure</td>
    </tr>
  </table>
<p> Some text is here</p>
";

Regex RxTable = new Regex( @"(?s)(<table[^>]*>)(.+?)(</table\s*>)" );
Regex RxP = new Regex( @"<p>" );

string htmlNew = RxTable.Replace( 
    html,
    delegate(Match match)
    {
       return match.Groups[1].Value + RxP.Replace(match.Groups[2].Value, "") + match.Groups[3].Value;
    }
);
Console.WriteLine( htmlNew );

      



Output:

<b>Table 1</b>
  <table class='class1'>
  <tr>
    <td>
      Procedure Name</td>
    <td>
        Procedure</td>
    </tr>
  </table>
<p><b>Table 2</b></p>
  <table class='class2'>
    <tr>
      <td>
        Procedure Name</td>
        <td>
        Procedure</td>
    </tr>
  </table>
<p> Some text is here</p>

      

+1


source


Usually regex allows you to work with nested structures, it is very ugly and you should avoid it, but if you have no other option, you can use it.

static void Main()
{
    string s = 
@"A()
{
    for()
    {
    }
    do
    {
    }
}
B()
{
    for()
    {
    }   
}
C()
{
    for()
    {
        for()
        {
        }
    }   
}";

    var r = new Regex(@"  
                      {                       
                          (                 
                              [^{}]           # everything except braces { }   
                              |
                              (?<open>  { )   # if { then push
                              |
                              (?<-open> } )   # if } then pop
                          )+
                          (?(open)(?!))       # true if stack is empty
                      }                                                                  

                    ", RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture);

    int counter = 0;

    foreach (Match m in r.Matches(s))
        Console.WriteLine("Outer block #{0}\r\n{1}", ++counter, m.Value);

    Console.Read();
}

      

here the regex "knows" where the block starts and ends, so you can use this information to remove the tag <p>

if it doesn't have a matching closure.

0


source







All Articles