Find duplicate key keys in html using regex

Given the html file, how can I find if there is some duplicate id value using regex? I need it to search in SublimeText.

For example: with id=("[^"]*").*id=\1

I can find duplicate key keys in one line

<img id="key"><img id="key">

      

But I need to accomplish the same thing on multiple lines and with different key pairs. In this case, for example, key

and key2

repeated identifiers.

<img id="key">
<img id="key2">
<img id="key">
<img id="key3">
<img id="key2">
<img id="key">

      

Note. I am using img tag only as an example, html file is more complex.

+3


source to share


4 answers


For some reason Sublime .

doesn't contain line breaks, so you need to do something like this:id=("[^"]+")(.|\n)*id=\1

To be honest, I would rather use the Unix utilities:



grep -Eo 'id="[^"]+"' filename | sort | uniq -c

  3 id="key"
  2 id="key2"
  1 id="key3"

      

If they are full HTML documents, you can use the w3 HTML validator to catch duplicates along with other errors.

+1


source


If all you are trying to do is find duplicate ids, then here is a little Perl program I dumped that will do it:

use strict;
use warnings;

my %ids;
while ( <> ) {
    while ( /id="([^"]+)"/g ) {
        ++$ids{$1};
    }
}

while ( my ($id,$count) = each %ids ) {
    print "$id shows up $count times\n" if $count > 1;
}

      

Name it "dupes.pl". Then call it like this:

perl dupes.pl file.html

      



If I run it on my sample it tells me:

key shows up 3 times
key2 shows up 2 times

      

It has some limitations, for example, it won't find id=foo

or id='foo'

, but it will probably help you in the future.

0


source


Searching in rich text Sublime Text defaults to multi-line mode, which means .

it won't match a line break. You can use the mode modifier to use single line mode to .

match new lines:

(?s)id=("[^"]+").*id=\1

      

(?s)

- single line mode modifier.

However, this regex does a poor job of finding all duplicate keys as it will only match from key

to key

in your HTML example. You will probably need a multi-step process to find all the keys that can be programmed. As others have shown, you need to (1) pull out all the ids first, then (2) group them and count them to determine which are cheats.

Alternatively, a manual approach would be to change the regex pattern to find duplicate identifiers, then you can find the following match in Sublime Text:

(?s)id=("[^"]+")(?=.*id=\1)

      

With the above template and your HTML sample, you will see the following matches:

<img id="key">  <-- highlighted (dupe found on 3rd line)
<img id="key2"> <-- highlighted (dupe found on 5th line)
<img id="key">  <-- highlighted (next dupe found on last line)
<img id="key3">
<img id="key2">
<img id="key">

      

Please note that standby does not show actual cheats later in the file. It will stop at the first occurrence and indicate that there will be cheats later.

0


source


Here is an AWK script to find duplicate img id values:

awk < file.txt 
    '{ 
        $2 = tolower($2); 
        gsub(/(id|["=>])/, "", $2); 
        if (NF == 2) 
            imgs[$2]++; 
        } 

        END {

        for (img in imgs) 
                printf "Img ID: %s\t appears %d times\n", img, imgs[img] 
    }' 

      

0


source







All Articles