Simple Regex help for C #

I have an unfinished binary with some information that I can recover using regex. Content:

G $ 12.Angry.Men.1957.720p.HDTV.x264-HDLH Lhttp: //site.com/forum/f89/12-angry-men-1957-720p-hdtv-x264-hdl-538403/LI Š M, ABBA .The.Movie.1977.720p.BluRay.DTS.x264-iONN Phttp: //site.com/forum/f89/abba-movie-1977-720p-bluray-dts-x264-ion-428687/&

How can I parse it to at least get links

that:

http://site.com/forum/f89/abba-movie-1977-720p-bluray-dts-x264-ion-428687/

      

where 428687

is the number id

.

So, I would have full link

and id

.

Other names that precede are the name of the links:

ABBA.The.Movie.1977.720p.BluRay.DTS.x264-iON

      

Though I'm not sure if they can be analyzed. I noticed that they all have a before and after character links

and NAMES

. Maybe this can narrow down the problem?

Btw I am ready to give 500 bonuses for the correct answer.

+2


source to share


2 answers


Something like the following regex?

MatchCollection matches = Regex.Matches(yourString, @"http://\S+?-(\d+)/") 
foreach(Match m in matches)
{
    string id = m.Captures[0].Value;
    string url = m.Value;
}

      

which will grab links (starting with http://

), then everything is not space (spaces are guaranteed not in HTTP (URI) links) and assumes it ends with numbers and a trailing slash (this will properly remove &

in your example or other trailing text).

EDIT: The whole match is a reference, the ID is in the first brackets, updated code to show how to get the information.

Update: if numbers + numbers + forward slash can appear in the url more than once in the url, then greed must be used, but subsequent links (no extra text with spaces) will be matched. If dash + numbers + slash occurs only once for each url, then laziness is preferable. This solution is currently in the code above.



An alternative approach

From the updates and additional information, I understand that there is a lot of confusion in the text. Another approach might be simpler: split everything by http://

and view the results. This avoids the need for a complex look-forward / backward regex and ensures that successive links (i.e. no text in between) are handled correctly:

// zero-width split:
string[] linksWithText = Regex.Split(yourString, @"(?<=http:\S+-\d+/)");
foreach (string link in linksWithText)
{
    Match m = Regex.Match(link, @"(.*)(http:\S+-(\d+)/)$");
    if (m.Success)
    {
        string text = m.Groups[1].Value;
        string url = m.Groups[2].Value;
        string id = m.Groups[3].Value;
    }
}

      

Update: The alternative approach has been updated. Text (name) first, then URL. Note the negative appearance of the expression to split into a zero-width spot by taking anything before the url to the end of the url.

+2


source


Assuming all urls end with a hyphen followed by some arbitrary numbers followed by a backslash. It might work.

`http://[^ ]*-?<id>(\d)+/`

      

What do you think?

UPDATE: Try this: -



http://(?!http://)[^ ]*-?<id>(\d)+/

Updated code (?! Http: //) to stop a url matching two urls merged with some data in the middle between urls that are not a space.

You can get the captured group by name. The entire search will match the URL and the group will match the id.

+1


source







All Articles