Get full url to shorten url using python
I have a list of urls like
l=['bit.ly/1bdDlXc','bit.ly/1bdDlXc',.......,'bit.ly/1bdDlXc']
I just want to see the full url from the short one for each item in this list.
Here is my approach,
import urllib2
for i in l:
print urllib2.urlopen(i).url
But when the list contains thousands of URLs, the program takes a long time.
My question is, is there a way to shorten the execution time or any other approach that I should follow?
source to share
Method one
As suggested, one way to accomplish this would be to use the official weight loss api , which however has limitations (e.g. no more than 15 shortUrl
per request).
Method two
Alternatively, one could simply avoid getting the content for example. using the HEAD
HTTP method instead GET
. Here's just some sample code that uses the excellent requests package:
import requests
l=['bit.ly/1bdDlXc','bit.ly/1bdDlXc',.......,'bit.ly/1bdDlXc']
for i in l:
print requests.head("http://"+i).headers['location']
source to share
I would try a twisted asynchronous web client. Be careful with this, however, it does not limit the speed.
#!/usr/bin/python2.7
from twisted.internet import reactor
from twisted.internet.defer import Deferred, DeferredList, DeferredLock
from twisted.internet.defer import inlineCallbacks
from twisted.web.client import Agent, HTTPConnectionPool
from twisted.web.http_headers import Headers
from pprint import pprint
from collections import defaultdict
from urlparse import urlparse
from random import randrange
import fileinput
pool = HTTPConnectionPool(reactor)
pool.maxPersistentPerHost = 16
agent = Agent(reactor, pool)
locks = defaultdict(DeferredLock)
locations = {}
def getLock(url, simultaneous = 1):
return locks[urlparse(url).netloc, randrange(simultaneous)]
@inlineCallbacks
def getMapping(url):
# Limit ourselves to 4 simultaneous connections per host
# Tweak this as desired, but make sure that it no larger than
# pool.maxPersistentPerHost
lock = getLock(url,4)
yield lock.acquire()
try:
resp = yield agent.request('HEAD', url)
locations[url] = resp.headers.getRawHeaders('location',[None])[0]
except Exception as e:
locations[url] = str(e)
finally:
lock.release()
dl = DeferredList(getMapping(url.strip()) for url in fileinput.input())
dl.addCallback(lambda _: reactor.stop())
reactor.run()
pprint(locations)
source to share