Download from EXPLOSM.net Comics Script [Python]

So, I wrote this short script (correct word?) To download the comics from flashm.net because I found out a bit about this recently and I want to ... put it on my iPhone ... 3G.

It works great and that's it. urllib2 for getting web page html and urllib for image.retrieve ()

Why did I post this on SO: How do I optimize this code? Will REGEX (regular expressions) do it faster? Is this an internet restriction? Bad algorithm ...?

Any improvements in speed or overall aesthetics of the code would be appreciated "responses".

Thank.

<i> -------------------------------- CODE ----------- --- --------------------

import urllib, urllib2

def LinkConvert(string_link):
    for eachLetter in string_link:
        if eachLetter == " ":
            string_link = string_link[:string_link.find(eachLetter)] + "%20" + string_link[string_link.find(eachLetter)+1:]
    return string_link

start = 82
end = 1506

matchingStart = """<img alt="Cyanide and Happiness, a daily webcomic" src="http://www.explosm.net/db/files/Comics/"""
matchingEnd = """></"""
link = "http://www.explosm.net/comics/"

for pageNum in range(start,start+7):
    req = urllib2.Request(link+`pageNum`)
    response = urllib2.urlopen(req)
    page = response.read()

    istart1 = page.find(matchingStart)
    iend1 = page.find(matchingEnd, istart1)
    newString1 = page[istart1 : iend1]

    istart2 = newString1.find("src=")+4
    iend2 = len(newString1)
    final = newString1[istart2 +1 : iend2 -1]

    final = LinkConvert(final)
    try:
        image = urllib.URLopener()
        image.retrieve(final, `pageNum` + ".jpg")
    except:
        print "Uh-oh! " + `pageNum` + " was not downloaded!"

    print `pageNum` + " completed..."

      

By the way, this is Python 2.5 code, not 3.0, but as you think I have all the PYthon 3.0 features that have been very much explored and played with before or just after New Years (after College Apps - YAY! ^ - ^)

+1


source to share


5 answers


I would suggest using Scrapy for your page and Beautiful Soup for parsing. This will make your code a lot easier.



If you want to change existing code that works with these alternatives, it's up to you. If not, then regular expressions will probably simplify your code. I'm not sure what effect it will have on performance.

+7


source


refactormycode might be a more suitable website for "let's improve this code" discussion.



+3


source


I suggest using BeautifulSoup to do the parsing, it would simplify your code.

But since you've worked this way before, you might not want to touch it until it breaks (page format changes).

0


source


urllib2 uses blocking calls and is the main reason for performance. You have to use a non-blocking library (like scrapy) or use multiple threads to fetch. I've never used scrapy (so I can't tell about this), but writing in python is really easy and simple.

0


source


It's the same today using Bash. It's really basic but works great.

First, I created two directories where I put the files:

mkdir -p html/archived
mkdir png

      

Then worked with two steps. First go through all the pages:

START=15
END=4783
for ((i=START;i<=END;i++)); do
  echo $i
  wget http://explosm.net/comics/$i/ -O html/$i.html
done

#Remove 404
find html -name '*.html' -size 0 -print0 | xargs -0 rm

      

2nd, for each page, reverse the htmlm and extract the image:

#!/bin/bash
for filename in ./html/*.html; do
  i=`echo $filename | cut -d '"' -f 4 | cut -d '/' -f3 | cut -d '.' -f1`
  echo "$filename => $i"
  wget -c "$(grep '<meta property="og:image" content=' ${filename} | cut -d '"' -f 4)" -O ./png/${i}.png
  mv $filename ./html/archived/
done

      

The result is here: Cyanide_and_happiness__up_to_2017-11-24.zip

Note that I didn't really like the potential crash, but counting 4606 files seems to be mostly OK.

I also saved everything as png. They are probably jpg and I notice 185 files of size 0, but ... feel free to take care of that, I just won't :)

0


source







All Articles