How to remove a line from a file after using it
I am trying to create a script that makes requests to random urls from a txt file, for example:
import urllib2
with open('urls.txt') as urls:
for url in urls:
try:
r = urllib2.urlopen(url)
except urllib2.URLError as e:
r = e
if r.code in (200, 401):
print '[{}]: '.format(url), "Up!"
But I want that when some url points 404 not found
, the line containing the url is removed from the file. There is one unique URL per line, so basically the goal is to remove every URL (and corresponding line) that it returns 404 not found
. How can i do this?
source to share
The easiest way is to read all lines, iterate over the saved lines and try to open them, and then when you're done, if any urls fail, you overwrite the file.
The way to overwrite a file is to write a new file, and then when the new file is successfully written and closed, you use os.rename()
to change the name of the new file to the name of the old file, overwrite the old file. This is a safe way to do it; you never overwrite a good file until you know the new file is written correctly.
I think the easiest way to do this is to simply create a list where you collect good urls, plus have a count of bad urls. If the counter is not zero, you need to rewrite the text file. Or, you can collect bad URLs in another list. I did it in this example code. (I haven't tested this code, but I think it should work.)
import os
import urllib2
input_file = "urls.txt"
debug = True
good_urls = []
bad_urls = []
bad, good = range(2)
def track(url, good_flag, code):
if good_flag == good:
good_str = "good"
elif good_flag == bad:
good_str = "bad"
else:
good_str = "ERROR! (" + repr(good) + ")"
if debug:
print("DEBUG: %s: '%s' code %s" % (good_str, url, repr(code)))
if good_flag == good:
good_urls.append(url)
else:
bad_urls.append(url)
with open(input_file) as f:
for line in f:
url = line.strip()
try:
r = urllib2.urlopen(url)
if r.code in (200, 401):
print '[{0}]: '.format(url), "Up!"
if r.code == 404:
# URL is bad if it is missing (code 404)
track(url, bad, r.code)
else:
# any code other than 404, assume URL is good
track(url, good, r.code)
except urllib2.URLError as e:
track(url, bad, "exception!")
# if any URLs were bad, rewrite the input file to remove them.
if bad_urls:
# simple way to get a filename for temp file: append ".tmp" to filename
temp_file = input_file + ".tmp"
with open(temp_file, "w") as f:
for url in good_urls:
f.write(url + '\n')
# if we reach this point, temp file is good. Remove old input file
os.remove(input_file) # only needed for Windows
os.rename(temp_file, input_file) # replace original input file with temp file
EDIT: In the comments, @abarnert suggests there might be a problem with use os.rename()
on Windows (at least I think that's what he / she means). If os.rename()
doesn't work, you can use shutil.move()
instead.
EDIT: Rewrite your code to handle errors.
EDIT: Rewrite to add verbose messages as you track URLs. This should help in debugging. Also, I actually tested this version and it works for me.
source to share
You could just save all the urls that worked and then rewrite them to a file:
good_urls = []
with open('urls.txt') as urls:
for url in urls:
try:
r = urllib2.urlopen(url)
except urllib2.URLError as e:
r = e
if r.code in (200, 401):
print '[{}]: '.format(url), "Up!"
good_urls.append(url)
with open('urls.txt', 'w') as urls:
urls.write("".join(good_urls))
source to share