Lxml - how to change img src to absolute link

Using lxml, how do you globally replace all src attributes with an absolute reference?

+3


source to share


2 answers


after testing Mikko Ohtamaa's answer, here are some notes. it works for many tags and uses lxm, there are different situations like background-image: url (xxx). so i just use regex for replacement. here's the solution,



content = re.sub('(?P<left>("|\'))\s*(?P<url>(\w|\.)+(/.+?)+)\s*(?P<right>("|\'))',
                     '\g<left>' + url[:url.rfind('/')] + '/\g<url>\g<right>', content)
content = re.sub('(?P<left>("|\'))\s*(?P<url>(/.+?)+)\s*(?P<right>("|\'))',
                     '\g<left>' + url[:url.find('/', 8)] + '\g<url>\g<right>', content)

      

+2


source


Here's some sample code that also covers <a href>

:

from lxml import etree, html
import urlparse

def fix_links(content, absolute_prefix):
    """
    Rewrite relative links to be absolute links based on certain URL.

    @param content: HTML snippet as a string
    """

    if type(content) == str:
        content = content.decode("utf-8")

    parser = etree.HTMLParser()

    content = content.strip()

    tree  = html.fragment_fromstring(content, create_parent=True)

    def join(base, url):
        """
        Join relative URL
        """
        if not (url.startswith("/") or "://" in url):
            return urlparse.urljoin(base, url)
        else:
            # Already absolute
            return url

    for node in tree.xpath('//*[@src]'):
        url = node.get('src')
        url = join(absolute_prefix, url)
        node.set('src', url)
    for node in tree.xpath('//*[@href]'):
        href = node.get('href')
        url = join(absolute_prefix, href)
        node.set('href', url)

    data =  etree.tostring(tree, pretty_print=False, encoding="utf-8")

    return data

      



The full story is available in the Plone developer documentation .

+6


source







All Articles