Reexpression to remove duplicate url
I have a list that contains a set of urls that looks like
- somesite.com/index.php?id=12
- somesite.com/index.php?id=14
- somesite.com/index.php?id=156
- example.com/view.php?image=441
- somesite.com/page.php?id=1
- example.com/view.php?ivideo=4
- somesite.com/page.php?id=56
- example.com/view.php?image=1
They are saved in a list and then displayed as a list after the crawl process. I tried different regex patterns but still couldn't archive what I needed because the query string became a problem.
Here's one of the templates I've tried.
(http://?)(w*)(\.*)(\w*)(\.)(\w*)
let me write how i need the specified url to be filtered.
- somesite.com/index.php?id=12
- example.com/view.php?image=441
- somesite.com/page.php?id=1
- example.com/view.php?ivideo=4
As you can see, pages that are the same but with different query strings have been removed. This is what I want to archive. Note that the above links contain http: // but do not include them as SOF finds them as spam. Can anyone be kind to help me with this. Thanks in advance.
Instead of parsing the Url manually, you can use the and class to do the parsing. Here is an example of using the LINQ method to collect similar URLs into groups, then select the first URL from the group. Uri
HttpUtility.ParseQueryString
.GroupBy
var distinctUrls = urls.GroupBy (u =>
{
var uri = new Uri(u);
var query = HttpUtility.ParseQueryString(uri.Query);
var baseUri = uri.Scheme + "://" + uri.Host + uri.AbsolutePath;
return new {
Uri = baseUri,
QueryStringKeys = string.Join("&", query.AllKeys.OrderBy (ak => ak))
};
})
.Select (g => g.First())
.ToList();
Output example distinctUrls
:
http://somesite.com/index.php?id=12
http://example.com/view.php?image=441
http://somesite.com/page.php?id=1
http://example.com/view.php?ivideo=4
This also correctly handles the case where two URLs have the same set of query parameters, but in a different order, such as example.com/view.php?image=441&order=asc
and example.com/view.php?order=desc&image=441
- treat them as similar.