How to remove a line from a file after using it

I am trying to create a script that makes requests to random urls from a txt file, for example:

import urllib2

with open('urls.txt') as urls:
    for url in urls:
        try:
            r = urllib2.urlopen(url)
        except urllib2.URLError as e:
            r = e
        if r.code in (200, 401):
            print '[{}]: '.format(url), "Up!"

      

But I want that when some url points 404 not found

, the line containing the url is removed from the file. There is one unique URL per line, so basically the goal is to remove every URL (and corresponding line) that it returns 404 not found

. How can i do this?

+3


source to share


2 answers


The easiest way is to read all lines, iterate over the saved lines and try to open them, and then when you're done, if any urls fail, you overwrite the file.

The way to overwrite a file is to write a new file, and then when the new file is successfully written and closed, you use os.rename()

to change the name of the new file to the name of the old file, overwrite the old file. This is a safe way to do it; you never overwrite a good file until you know the new file is written correctly.

I think the easiest way to do this is to simply create a list where you collect good urls, plus have a count of bad urls. If the counter is not zero, you need to rewrite the text file. Or, you can collect bad URLs in another list. I did it in this example code. (I haven't tested this code, but I think it should work.)

import os
import urllib2

input_file = "urls.txt"
debug = True

good_urls = []
bad_urls = []

bad, good = range(2)

def track(url, good_flag, code):
    if good_flag == good:
        good_str = "good"
    elif good_flag == bad:
        good_str = "bad"
    else:
        good_str = "ERROR! (" + repr(good) + ")"
    if debug:
        print("DEBUG: %s: '%s' code %s" % (good_str, url, repr(code)))
    if good_flag == good:
        good_urls.append(url)
    else:
        bad_urls.append(url)

with open(input_file) as f:
    for line in f:
        url = line.strip()
        try:
            r = urllib2.urlopen(url)
            if r.code in (200, 401):
                print '[{0}]: '.format(url), "Up!"
            if r.code == 404:
                # URL is bad if it is missing (code 404)
                track(url, bad, r.code)
            else:
                # any code other than 404, assume URL is good
                track(url, good, r.code)
        except urllib2.URLError as e:
            track(url, bad, "exception!")

# if any URLs were bad, rewrite the input file to remove them.
if bad_urls:
    # simple way to get a filename for temp file: append ".tmp" to filename
    temp_file = input_file + ".tmp"
    with open(temp_file, "w") as f:
        for url in good_urls:
            f.write(url + '\n')
    # if we reach this point, temp file is good.  Remove old input file
    os.remove(input_file)  # only needed for Windows
    os.rename(temp_file, input_file)  # replace original input file with temp file

      



EDIT: In the comments, @abarnert suggests there might be a problem with use os.rename()

on Windows (at least I think that's what he / she means). If os.rename()

doesn't work, you can use shutil.move()

instead.

EDIT: Rewrite your code to handle errors.

EDIT: Rewrite to add verbose messages as you track URLs. This should help in debugging. Also, I actually tested this version and it works for me.

+1


source


You could just save all the urls that worked and then rewrite them to a file:



good_urls = []
with open('urls.txt') as urls:
    for url in urls:
        try:
            r = urllib2.urlopen(url)
        except urllib2.URLError as e:
            r = e
        if r.code in (200, 401):
            print '[{}]: '.format(url), "Up!"
            good_urls.append(url)
with open('urls.txt', 'w') as urls:
    urls.write("".join(good_urls))

      

+2


source







All Articles