Get full url to shorten url using python

I have a list of urls like

l=['bit.ly/1bdDlXc','bit.ly/1bdDlXc',.......,'bit.ly/1bdDlXc']

      

I just want to see the full url from the short one for each item in this list.

Here is my approach,

import urllib2

for i in l:
    print urllib2.urlopen(i).url

      

But when the list contains thousands of URLs, the program takes a long time.

My question is, is there a way to shorten the execution time or any other approach that I should follow?

+3


source to share


2 answers


Method one

As suggested, one way to accomplish this would be to use the official weight loss api , which however has limitations (e.g. no more than 15 shortUrl

per request).

Method two



Alternatively, one could simply avoid getting the content for example. using the HEAD

HTTP method instead GET

. Here's just some sample code that uses the excellent requests package:

import requests

l=['bit.ly/1bdDlXc','bit.ly/1bdDlXc',.......,'bit.ly/1bdDlXc']

for i in l:
    print requests.head("http://"+i).headers['location']

      

+6


source


I would try a twisted asynchronous web client. Be careful with this, however, it does not limit the speed.



#!/usr/bin/python2.7

from twisted.internet import reactor
from twisted.internet.defer import Deferred, DeferredList, DeferredLock
from twisted.internet.defer import inlineCallbacks
from twisted.web.client import Agent, HTTPConnectionPool
from twisted.web.http_headers import Headers
from pprint import pprint
from collections import defaultdict
from urlparse import urlparse
from random import randrange
import fileinput

pool = HTTPConnectionPool(reactor)
pool.maxPersistentPerHost = 16
agent = Agent(reactor, pool)
locks = defaultdict(DeferredLock)
locations = {}

def getLock(url, simultaneous = 1):
    return locks[urlparse(url).netloc, randrange(simultaneous)]

@inlineCallbacks
def getMapping(url):
    # Limit ourselves to 4 simultaneous connections per host
    # Tweak this as desired, but make sure that it no larger than
    # pool.maxPersistentPerHost
    lock = getLock(url,4)
    yield lock.acquire()
    try:
        resp = yield agent.request('HEAD', url)
        locations[url] = resp.headers.getRawHeaders('location',[None])[0]
    except Exception as e:
        locations[url] = str(e)
    finally:
        lock.release()


dl = DeferredList(getMapping(url.strip()) for url in fileinput.input())
dl.addCallback(lambda _: reactor.stop())

reactor.run()
pprint(locations)

      

0


source







All Articles