Trying to find all unique values ββin a specific part of a string in Python
I have a list of urls that I am trying to parse and find the utm codes in each url. First I want to find unique values ββthat appear after utm, i.e. Utm_source, and create new columns with each of these values. The last thing I'm looking for is something like
Source: website
Wednesday: email
campaigns: campaign1
UUID: 999124
languages: en
I now have the following:
import pandas as pd
email_list = pd.read_csv('/Users/rethompsoniii/Documents/Work-Related/Jeb 2016/email_list_20150804.csv', sep=',', header=0, error_bad_lines=False, index_col=False, dtype='unicode')
url = email_list['SourceUrl']
utms = url.split("utm",1)[1]
print(utms)
However the utms line doesn't work either. Not looking for someone to give me all the code, but just point me in the right direction. Much appreciated
source to share
You can use a urlparse
library.
First, you can parse the url of its respective components using a function urlparse.urlparse()
.
>>> import urlparse
>>> url = "https://website.com/donate?utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en"
>>> parsed_url = urlparse.urlparse(url)
>>> parsed_url
ParseResult(scheme='https', netloc='website.com', path='/donate', params='', query='utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en', fragment='')
>>> parsed_url.query
'utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en'
From the parsed url you can parse the request using another function urlparse.parse_qs()
>>> parsed_query = urlparse.parse_qs(parsed_url.query)
>>> parsed_query
{'lang': ['en'], 'utm_campaign': ['campaign1'], 'utm_medium': ['email'], 'uuid': ['999124'], 'utm_source': ['site']}
source to share
You can use regular expression.
import re
m = re.findall('utm_(\w+)=(\w+)', 'https://website.com/donate?utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en')
'm' is now a list with tuples:
[('source', 'site'), ('medium', 'email'), ('campaign', 'campaign1')]
But consider urlparse as Peter Wood mentioned in the comments.
source to share
You can use the python library urlparse
.
Example:
import urlparse
url = 'https://website.com/donate?utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en'
params = dict(urlparse.parse_qsl(urlparse.urlsplit(url).query))
new_params = {key[4:] if key.startswith('utm_') else key:value for key, value in params.iteritems()}
print new_params
Output:
{'lang': 'en', 'source': 'site', 'medium': 'email', 'uuid': '999124', 'campaign': 'campaign1'}
source to share
You can use the built-in library urlparse
.
Parse the url first :
>>> from urlparse import urlparse, parse_qs
>>> url = ('https://website.com/donate?utm_source=site&'
'utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en')
>>> parsed = urlparse(url)
>>> parsed.query
'utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en'
Then parse the query string using urlparse.parse_qs
:
>>> parse_qs(parsed.query)
{'lang': ['en'],
'utm_campaign': ['campaign1'],
'utm_medium': ['email'],
'utm_source': ['site'],
'uuid': ['999124']}
source to share