Trying to find all unique values ​​in a specific part of a string in Python

I have a list of urls that I am trying to parse and find the utm codes in each url. First I want to find unique values ​​that appear after utm, i.e. Utm_source, and create new columns with each of these values. The last thing I'm looking for is something like


Source: website

Wednesday: email

campaigns: campaign1

UUID: 999124

languages: en

I now have the following:

import pandas as pd

email_list = pd.read_csv('/Users/rethompsoniii/Documents/Work-Related/Jeb 2016/email_list_20150804.csv', sep=',', header=0, error_bad_lines=False, index_col=False, dtype='unicode')

url = email_list['SourceUrl']

utms = url.split("utm",1)[1]



However the utms line doesn't work either. Not looking for someone to give me all the code, but just point me in the right direction. Much appreciated


source to share

4 answers

You can use a urlparse


First, you can parse the url of its respective components using a function urlparse.urlparse()


>>> import urlparse
>>> url = ""
>>> parsed_url = urlparse.urlparse(url)
>>> parsed_url
ParseResult(scheme='https', netloc='', path='/donate', params='', query='utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en', fragment='')
>>> parsed_url.query


From the parsed url you can parse the request using another function urlparse.parse_qs()

>>> parsed_query = urlparse.parse_qs(parsed_url.query)
>>> parsed_query
{'lang': ['en'], 'utm_campaign': ['campaign1'], 'utm_medium': ['email'], 'uuid': ['999124'], 'utm_source': ['site']}




You can use regular expression.

import re
m = re.findall('utm_(\w+)=(\w+)', '')


'm' is now a list with tuples:

[('source', 'site'), ('medium', 'email'), ('campaign', 'campaign1')]


But consider urlparse as Peter Wood mentioned in the comments.



You can use the python library urlparse



import urlparse
url = ''
params = dict(urlparse.parse_qsl(urlparse.urlsplit(url).query))
new_params = {key[4:] if key.startswith('utm_') else key:value for key, value in params.iteritems()}
print new_params



{'lang': 'en', 'source': 'site', 'medium': 'email', 'uuid': '999124', 'campaign': 'campaign1'}




You can use the built-in library urlparse


Parse the url first :

>>> from urlparse import urlparse, parse_qs
>>> url = (''

>>> parsed = urlparse(url)
>>> parsed.query


Then parse the query string using urlparse.parse_qs


>>> parse_qs(parsed.query)
{'lang': ['en'],
 'utm_campaign': ['campaign1'],
 'utm_medium': ['email'],
 'utm_source': ['site'],
 'uuid': ['999124']}




All Articles