Trying to find all unique values ​​in a specific part of a string in Python

I have a list of urls that I am trying to parse and find the utm codes in each url. First I want to find unique values ​​that appear after utm, i.e. Utm_source, and create new columns with each of these values. The last thing I'm looking for is something like

sourceUrl: https://website.com/donate?utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en

Source: website

Wednesday: email

campaigns: campaign1

UUID: 999124

languages: en

I now have the following:

import pandas as pd

email_list = pd.read_csv('/Users/rethompsoniii/Documents/Work-Related/Jeb 2016/email_list_20150804.csv', sep=',', header=0, error_bad_lines=False, index_col=False, dtype='unicode')

url = email_list['SourceUrl']

utms = url.split("utm",1)[1]

print(utms)

      

However the utms line doesn't work either. Not looking for someone to give me all the code, but just point me in the right direction. Much appreciated

+3


source to share


4 answers


You can use a urlparse

library.

First, you can parse the url of its respective components using a function urlparse.urlparse()

.

>>> import urlparse
>>> url = "https://website.com/donate?utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en"
>>> parsed_url = urlparse.urlparse(url)
>>> parsed_url
ParseResult(scheme='https', netloc='website.com', path='/donate', params='', query='utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en', fragment='')
>>> parsed_url.query
'utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en'

      



From the parsed url you can parse the request using another function urlparse.parse_qs()

>>> parsed_query = urlparse.parse_qs(parsed_url.query)
>>> parsed_query
{'lang': ['en'], 'utm_campaign': ['campaign1'], 'utm_medium': ['email'], 'uuid': ['999124'], 'utm_source': ['site']}

      

+3


source


You can use regular expression.

import re
m = re.findall('utm_(\w+)=(\w+)', 'https://website.com/donate?utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en')

      

'm' is now a list with tuples:



[('source', 'site'), ('medium', 'email'), ('campaign', 'campaign1')]

      

But consider urlparse as Peter Wood mentioned in the comments.

+1


source


You can use the python library urlparse

.

Example:

import urlparse
url = 'https://website.com/donate?utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en'
params = dict(urlparse.parse_qsl(urlparse.urlsplit(url).query))
new_params = {key[4:] if key.startswith('utm_') else key:value for key, value in params.iteritems()}
print new_params

      

Output:

{'lang': 'en', 'source': 'site', 'medium': 'email', 'uuid': '999124', 'campaign': 'campaign1'}

      

+1


source


You can use the built-in library urlparse

.

Parse the url first :

>>> from urlparse import urlparse, parse_qs
>>> url = ('https://website.com/donate?utm_source=site&'
           'utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en')

>>> parsed = urlparse(url)
>>> parsed.query
'utm_source=site&utm_medium=email&utm_campaign=campaign1&uuid=999124&lang=en'

      

Then parse the query string using urlparse.parse_qs

:

>>> parse_qs(parsed.query)
{'lang': ['en'],
 'utm_campaign': ['campaign1'],
 'utm_medium': ['email'],
 'utm_source': ['site'],
 'uuid': ['999124']}

      

+1


source







All Articles