How can I write the following code in a more efficient and pythonic way?
I have a list with urls: file_url_list
that prints on this:
www.latimes.com, www.facebook.com, affinitweet.com, ...
And another list of Top 1M: urls top_url_list
that prints on this:
[1, google.com], [2, www.google.com], [3, microsoft.com], ...
I want to find how many urls in file_url_list
are in top_url_list
. I wrote the following code that works, but I know this is not the fastest way to do it, nor the most pythonic one.
# Find the common occurrences
found = []
for file_item in file_url_list:
for top_item in top_url_list:
if file_item == top_item[1]:
# When you find an occurrence, put it in a list
found.append(top_item)
How can I write this in a more efficient and pythonic way?
source to share
Establishing the intersection should help. Alternatively, you can use a generator expression to extract only the URL from each entry in the top_url_list
.
file_url_list = ['www.latimes.com', 'www.facebook.com', 'affinitweet.com']
top_url_list = [[1, 'google.com'], [2, 'www.google.com'], [3, 'microsoft.com']]
common_urls = set(file_url_list) & set(url for (index, url) in top_url_list)
or equivalently thanks to Jean-François Fabre :
common_urls = set(file_url_list) & {url for (index, url) in top_url_list}
source to share
You say you want to know how many URLs from the file are in the top 1m list, not really. Create a set from a larger list (I assume it will be 1m) and then enumerate another list, counting if each is there:
top_urls = {url for (index, url) in top_url_list}
total = sum(url in top_urls for url in file_url_list)
If the list of files is larger, install the set instead:
file_urls = set(file_url_list)
total = sum(url in file_urls for index, url in top_url_list)
sum
will add numbers together. url in top_urls
evaluates bool
to either True
, or False
. It is converted to an integer, 1
or 0
accordingly. url in top_urls for url in file_url_list
effectively generates a sequence for 1
or 0
for sum
.
Perhaps a little more efficient (I have to test it), you can only filter and sum 1
if url in top_urls
:
total = sum(1 for url in file_url_list if url in top_urls)
source to share
You can take urls from the second list and then either use set
like Kos shown in his answer, or you can use a lambda with a filter.
top_url_list_flat = [item[1] for item in top_url_list]
print filter(lambda url: url in file_url_list, top_url_list_flat)
Python 3 filter
returns an object that is iterable, so you will need to do the following:
for common in (filter(lambda url: url in file_url_list, top_url_list_flat)):
print (common)
source to share