How can I write the following code in a more efficient and pythonic way?

I have a list with urls: file_url_list

that prints on this:

www.latimes.com, www.facebook.com, affinitweet.com, ...

      

And another list of Top 1M: urls top_url_list

that prints on this:

[1, google.com], [2, www.google.com], [3, microsoft.com], ...

      

I want to find how many urls in file_url_list

are in top_url_list

. I wrote the following code that works, but I know this is not the fastest way to do it, nor the most pythonic one.

# Find the common occurrences
found = []
for file_item in file_url_list:
    for top_item in top_url_list:
        if file_item == top_item[1]:
            # When you find an occurrence, put it in a list
            found.append(top_item)

      

How can I write this in a more efficient and pythonic way?

+3


source to share


3 answers


Establishing the intersection should help. Alternatively, you can use a generator expression to extract only the URL from each entry in the top_url_list

.

file_url_list = ['www.latimes.com', 'www.facebook.com', 'affinitweet.com']
top_url_list = [[1, 'google.com'], [2, 'www.google.com'], [3, 'microsoft.com']]

common_urls = set(file_url_list) & set(url for (index, url) in top_url_list)

      



or equivalently thanks to Jean-François Fabre :

common_urls = set(file_url_list) & {url for (index, url) in top_url_list}

      

+7


source


You say you want to know how many URLs from the file are in the top 1m list, not really. Create a set from a larger list (I assume it will be 1m) and then enumerate another list, counting if each is there:

top_urls = {url for (index, url) in top_url_list}
total = sum(url in top_urls for url in file_url_list)

      

If the list of files is larger, install the set instead:

file_urls = set(file_url_list)
total = sum(url in file_urls for index, url in top_url_list)

      



sum

will add numbers together. url in top_urls

evaluates bool

to either True

, or False

. It is converted to an integer, 1

or 0

accordingly. url in top_urls for url in file_url_list

effectively generates a sequence for 1

or 0

for sum

.

Perhaps a little more efficient (I have to test it), you can only filter and sum 1

if url in top_urls

:

total = sum(1 for url in file_url_list if url in top_urls)

      

+2


source


You can take urls from the second list and then either use set

like Kos shown in his answer, or you can use a lambda with a filter.

top_url_list_flat = [item[1] for item in top_url_list]
print filter(lambda url: url in file_url_list, top_url_list_flat)

      

Python 3 filter

returns an object that is iterable, so you will need to do the following:

for common in (filter(lambda url: url in file_url_list, top_url_list_flat)):
    print (common)

      

Demo

+1


source







All Articles