Python Script Proxy
I am writing a simple python script so I can test my sites from a different ip.
The page url is set in the request, the script fetches the page and displays it to the user. The code below is used to rewrite tags containing urls, but I don't think it is completely / completely fixed.
def rel2abs(rel_url, base=loc):
return urlparse.urljoin(base, rel_url)
def is_proxy_else_abs(tag, attr):
if tag in ('a',):
return True
if tag in ('form', 'img', 'link') and attr in ('href', 'src', 'action', 'background'):
return False
def repl(matchobj):
if is_proxy_else_abs(matchobj.group(1).lower(), matchobj.group(3).lower()):
return r'<%s %s %s="http://%s?%s" ' %(proxy_script_url, matchobj.group(1), matchobj.group(2), matchobj.group(3), urllib.urlencode({'loc':rel2abs(matchobj.group(5))}))
else:
return r'<%s %s %s="%s" ' %(matchobj.group(1), matchobj.group(2), matchobj.group(3), rel2abs(matchobj.group(5)))
def fix_urls(page):
get_link_re = re.compile(r"""<(a|form|img|link) ([^>]*?)(href|src|action|background)\s*=\s*("|'?)([^>]*?)\4""", re.I|re.DOTALL)
page = get_link_re.sub(repl, page)
return page
The idea is that the attributes of the h 'a' attribute should be redirected through the proxy script, but css, javascript, images, forms, etc. should not be, so they should be absolute if they are relative in the original pp.
The problem is that the code doesn't always work, css can be written in several ways, etc. Is there a more complete regex I can use?
Read other posts here about parsing HTML. For example Python Regular Expression for HTML parsing (BeautifulSoup) and HTML parser in Python .
Use Beautiful Soup, not regular expressions.
source to share