Python Script Proxy

Question

Python Script Proxy

I am writing a simple python script so I can test my sites from a different ip.

The page url is set in the request, the script fetches the page and displays it to the user. The code below is used to rewrite tags containing urls, but I don't think it is completely / completely fixed.

def rel2abs(rel_url, base=loc):
    return urlparse.urljoin(base, rel_url)

def is_proxy_else_abs(tag, attr):
    if tag in ('a',):
        return True
    if tag in ('form', 'img', 'link') and attr in ('href', 'src', 'action', 'background'):
        return False

def repl(matchobj):
    if is_proxy_else_abs(matchobj.group(1).lower(), matchobj.group(3).lower()):
        return r'<%s %s %s="http://%s?%s" ' %(proxy_script_url, matchobj.group(1), matchobj.group(2), matchobj.group(3), urllib.urlencode({'loc':rel2abs(matchobj.group(5))}))
    else:
        return r'<%s %s %s="%s" ' %(matchobj.group(1), matchobj.group(2), matchobj.group(3), rel2abs(matchobj.group(5)))

def fix_urls(page):
    get_link_re = re.compile(r"""<(a|form|img|link) ([^>]*?)(href|src|action|background)\s*=\s*("|'?)([^>]*?)\4""", re.I|re.DOTALL)
    page = get_link_re.sub(repl, page)
    return page

The idea is that the attributes of the h 'a' attribute should be redirected through the proxy script, but css, javascript, images, forms, etc. should not be, so they should be absolute if they are relative in the original pp.

The problem is that the code doesn't always work, css can be written in several ways, etc. Is there a more complete regex I can use?

0

python proxy

7times9 Dec 29. 09 at 19:19

source to share

1 answer

S.Lott · Answer 1 · 2008-12-29T20:09:34+0000

Read other posts here about parsing HTML. For example Python Regular Expression for HTML parsing (BeautifulSoup) and HTML parser in Python .

Use Beautiful Soup, not regular expressions.

Python Script Proxy

More articles: