Python Script Proxy

I am writing a simple python script so I can test my sites from a different ip.

The page url is set in the request, the script fetches the page and displays it to the user. The code below is used to rewrite tags containing urls, but I don't think it is completely / completely fixed.

def rel2abs(rel_url, base=loc):
    return urlparse.urljoin(base, rel_url)

def is_proxy_else_abs(tag, attr):
    if tag in ('a',):
        return True
    if tag in ('form', 'img', 'link') and attr in ('href', 'src', 'action', 'background'):
        return False

def repl(matchobj):
    if is_proxy_else_abs(matchobj.group(1).lower(), matchobj.group(3).lower()):
        return r'<%s %s %s="http://%s?%s" ' %(proxy_script_url, matchobj.group(1), matchobj.group(2), matchobj.group(3), urllib.urlencode({'loc':rel2abs(matchobj.group(5))}))
    else:
        return r'<%s %s %s="%s" ' %(matchobj.group(1), matchobj.group(2), matchobj.group(3), rel2abs(matchobj.group(5)))

def fix_urls(page):
    get_link_re = re.compile(r"""<(a|form|img|link) ([^>]*?)(href|src|action|background)\s*=\s*("|'?)([^>]*?)\4""", re.I|re.DOTALL)
    page = get_link_re.sub(repl, page)
    return page

      

The idea is that the attributes of the h 'a' attribute should be redirected through the proxy script, but css, javascript, images, forms, etc. should not be, so they should be absolute if they are relative in the original pp.

The problem is that the code doesn't always work, css can be written in several ways, etc. Is there a more complete regex I can use?

0


source to share


1 answer


Read other posts here about parsing HTML. For example Python Regular Expression for HTML parsing (BeautifulSoup) and HTML parser in Python .



Use Beautiful Soup, not regular expressions.

+3


source







All Articles