How to extract text infront of a sample using python3?
Here is a sample entry that I have.
Record ID: 9211
User name: Administrator first
User principal name: Administrator@example.com
When created: 1999-12-23 3:8:52
When changed: 2000-06-10 4:8:55
Account expires: Never
I would like to extract data from infront of values. The output should look like this:
9211
Administrator first
Administrator
first
Administrator@example.com
1999-12-23 3:8:52
2000-06-10 4:8:55
Never
The word Administrator first
should be extracted and split as shown above.
I tried the following inorder to extract User name
from the sample but didn't get any output.
re.findall(r'User name: (\w+)', i)
Please let me know how can I achieve this? There should be only the extracted data, not the spaces that are given before the data.
Please let me know how can I achieve this?
source to share
You can use dict comprehension
import re
string = """
Record ID: 9211
User name: Administrator first
User principal name: Administrator@example.com
When created: 1999-12-23 3:8:52
When changed: 2000-06-10 4:8:55
Account expires: Never
"""
rx = re.compile(r'^(?P<key>[^:\n]+):\s*(?P<value>.+)', re.MULTILINE)
result = {m.group('key'): m.group('value') for m in rx.finditer(string)}
print(result)
Then just ask your dict i.e. result['User name']
... See the demo at ideone.com .
If you have multiple occurrences of records, and the records always have the same format (i.e. they start with
Record ID
and end with
Account expires
), you can wrap another expression and class around it, which ends up with a list of dictionaries:
import re
string = """
Record ID: 9211
User name: Administrator first
User principal name: Administrator@example.com
When created: 1999-12-23 3:8:52
When changed: 2000-06-10 4:8:55
Account expires: Never
Record ID: 9390
User name: Administrator first
User principal name: Administrator@example.com
When created: 1999-12-23 3:8:52
When changed: 2000-06-10 4:8:55
Account expires: Never
"""
class Analyzer:
''' Parses the input string and returns matched entries '''
rx_parts = re.compile(r'^Record ID:(?s:.+?)^Account expires:.+', re.MULTILINE)
rx_entries = re.compile(r'^(?P<key>[^:\n]+):\s*(?P<value>.+)', re.MULTILINE)
result = list()
def __init__(self, input_string = None):
self.result = [{m.group('key'): m.group('value')
for m in self.rx_entries.finditer(part.group(0))}
for part in self.rx_parts.finditer(input_string)]
def query(self, key=None, value=None):
try:
subset = [item for item in self.result if item[key] == value]
except KeyError:
subset = []
return subset
a = Analyzer(string)
admin = a.query(key = 'Record ID', value='9390')
print(admin)
source to share
You can use a naive approach:
text = """Record ID: 9211
User name: Administrator first
User principal name: Administrator@example.com
When created: 1999-12-23 3:8:52
When changed: 2000-06-10 4:8:55
Account expires: Never"""
# cut text at newline chars
for line in text.splitlines():
# find the first ':'
idx=line.index(':')
# remove spaces from the start
strippedLine = line[idx+1:].lstrip()
if 'User name' in line:
print (strippedLine)
source to share
Usage r'User name:\s*(\w+\s*\w*)'
as the regex string works; it looks like the problem was the space between the field name and the value that caused and produced, as well as the space between the first and last words in the value (for values ββthat have them, hence the match *
).
source to share
What you can do is turn each string into a list and use the method .split()
on the list to split the string into two separate list indices. For example. If I were to split the phrase "Good people" and split it by "(space), then I would get a list with two indices" People "at index 0 and" People "at first.
I probably explained it badly so you can check other posts on the split method.
source to share