Lxml - how to change img src to absolute link

Using lxml, how do you globally replace all src attributes with an absolute reference?


after testing Mikko Ohtamaa's answer, here are some notes. it works for many tags and uses lxm, there are different situations like background-image: url (xxx). so i just use regex for replacement. here's the solution,

content = re.sub('(?P<left>("|\'))\s*(?P<url>(\w|\.)+(/.+?)+)\s*(?P<right>("|\'))',
                     '\g<left>' + url[:url.rfind('/')] + '/\g<url>\g<right>', content)
content = re.sub('(?P<left>("|\'))\s*(?P<url>(/.+?)+)\s*(?P<right>("|\'))',
                     '\g<left>' + url[:url.find('/', 8)] + '\g<url>\g<right>', content)




Here's some sample code that also covers <a href>


from lxml import etree, html
import urlparse

def fix_links(content, absolute_prefix):
    Rewrite relative links to be absolute links based on certain URL.

    @param content: HTML snippet as a string

    if type(content) == str:
        content = content.decode("utf-8")

    parser = etree.HTMLParser()

    content = content.strip()

    tree  = html.fragment_fromstring(content, create_parent=True)

    def join(base, url):
        Join relative URL
        if not (url.startswith("/") or "://" in url):
            return urlparse.urljoin(base, url)
            # Already absolute
            return url

    for node in tree.xpath('//*[@src]'):
        url = node.get('src')
        url = join(absolute_prefix, url)
        node.set('src', url)
    for node in tree.xpath('//*[@href]'):
        href = node.get('href')
        url = join(absolute_prefix, href)
        node.set('href', url)

    data =  etree.tostring(tree, pretty_print=False, encoding="utf-8")

    return data


The full story is available in the Plone developer documentation .



