Google safe browsing api url encoding (canonicalization)
In my application, I validate user-entered URLs for malware by submitting them to google.
To check if I was getting a "malware detected" reaction I used the URL http: //malware.testing.google.test/testing/malware
To my surprise, this URL has not been flagged as malware
In conversation that I found that when I enter a trailing slash, it hits as malware.
The documentation says the URL must be canonicalized.
Do any of you know about meeting this requirement? (preferably in C #)
source to share
Using ForguesR link provided that I created this C # implementation.
It passes 26 out of 33 tests from the google test suite found at: https://developers.google.com/safe-browsing/developers_guide_v3#Canonicalization
This was considered good enough for production as it does not capture more catchy web pages.
source to share
I am working on the same problem right now and the only one I found is the Java implementation in the jGoogleSafeBrowsing library . Unfortunately it is tied to the v2 API.
Anyway, you can look at the canonicalization code here . Please be aware that:
- this code is released as open source under a Creative Commons NC-SA license;
- this code cannot pass Google canonicalization test suit .
source to share