Is Python list adding slowly?

I need to concatenate two text files together into one and create a new list from it. The first contains urls and other urlpaths / folder to be applied to each url. I'm working with lists and it is very slow because its roughly about 200,000 items.







Later, after the end of the loop, there should be a new list with


Python code:

URLS_TO_CHECK = [] #defined as global, needed later

def generate_list():
  urls = open("urls.txt", "r").read().splitlines()
  paths = open("paths.txt", "r").read().splitlines()
  done = open("done.txt", "r").read().splitlines() #old done urls

  for i in range(len(urls)):
    for x in range(len(paths)):
        url ='(http://(.+?)....)', urls[i]) #needed
        url = "%s%s" %(, paths[x])
        if url not in URLS_TO_CHECK:
            if url not in done:
                URLS_TO_CHECK.append(url) ##<<< slow!


Already read some other threads about the map

disable function , gc

but cannot use the function map

with my program. and turning it off gc

didn't really help.


source to share

3 answers

This approach uses things like:

  • fast set search - O (1) instead of O (n)
  • generating values ​​on demand instead of creating the entire list as one time
  • reading from a file in chunks instead of loading all data at once.
  • avoid unnecessary regex

def yield_urls():
    with open("paths.txt") as f:
        paths = f.readlines() # needed in each iteration and iterates over, may be list

    with open("done.txt") as f:
        done_urls = set(f.readlines()) # needed in each iteration and looked up, set is O(1) vs O(n) in list 

    # resources are cleaned up after with

    with open("urls.txt", "r") as f:
        for url in f: # iterate over list, not big list of ints generated before iteratiob, much quicker
            for subpath in paths:
                full_url = ''.join((url[7:], subpath)) # no regex means faster, maybe string formatting is quicker than join, you need to check
                # also, take care about trailing newlines in strings read from file
                if full_url not in done_urls:  # fast lookup in set
                    yield full_url  # yield instead of appending

# usage
for url in yield_urls():
    pass  # to something with url




 URLS_TO_CHECK = set(re.findall("'http://(.+?)....'",open("urls.txt", "r").read()))
 for url in URLS_TO_CHECK:
     for path in paths:


will probably be much faster ... and I think its essentially the same ...



Dictionaries lookup compares faster Python: List vs Dict for table lookups

URLS_TO_CHECK = {} #defined as global, needed later

def generate_list():
  urls = open("urls.txt", "r").read().splitlines()
  paths = open("paths.txt", "r").read().splitlines()
  done = dict([(l, True) for l in open("done.txt", "r").read().splitlines()]) #old done urls

  for i in range(len(urls)):
    for x in range(len(paths)):
      url ='(http://(.+?)....)', urls[i]) #needed
      url = "%s%s" %(, paths[x])
      if not url in URLS_TO_CHECK:
        if not url in done:
          URLS_TO_CHECK[url] = True #Result in URLS_TO_CHECK.keys()




All Articles