Extracting information from a text file via regex and / or python
I am working with a large number of files (~ 4GB) that contain from 1 to 100 records in the following format (between two *** - one record):
***
Type:status
Origin: @z_rose yes
Text: yes
URL:
ID: 95482459084427264
Time: Mon Jul 25 08:16:06 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 20776334
Hashtags:
***
***
Type:status
Origin: @aaronesilvers text
Text: text
URL:
ID: 95481610861953024
Time: Mon Jul 25 08:12:44 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 2226621
Hashtags:
***
***
Type:status
Origin: @z_rose text
Text: text and stuff
URL:
ID: 95480980026040320
Time: Mon Jul 25 08:10:14 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 20776334
Hashtags:
***
Now I want to somehow import them into Pandas for bulk analysis, but obviously I will have to convert it to Pandas format. So I want to write a script that converts the above to .csv, looking something like this (User is the name of the file):
User Type Origin Text URL ID Time RetCount Favorite MentionedEntities Hashtags
4012987 status @z_rose yes yes Null 95482459084427264 Mon Jul 25 08:16:06 CDT 2011 0 false 20776334 Null
4012987 status @aaronsilvers text text Null 95481610861953024 Mon Jul 25 08:12:44 CDT 2011 0 false 2226621 Null
(The formatting isn't perfect, but hopefully you get the idea)
I had some code work that worked based on its regular information in 12 segments, but unfortunately some of the files contain some white lines in some fields. Basically I want:
fields[] =['User', 'Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']
starPair = 0;
User = filename;
read(file)
#Determine if the current entry has ended
if(stringRead=="***"){
if(starPair == 0)
starPair++;
if(starPair == 1){
row=row++;
starPair = 0;
}
}
#if string read matches column field
if(stringRead == fields[])
while(strRead != fields[]) #until next field has been found
#extract all characters into correct column field
However, a problem arises that some fields may contain words in the [] fields. I can check the \ n char first, which will greatly reduce the number of erroneous entries, but not eliminate them.
Can anyone point me in the right direction?
Thanks in advance!
source to share
Your code / pseudocode doesn't look like python, but since you have a python tag, here's how I would go about it. First read the file in line, then go through each field and do a regex to find the value after it, push the result into a 2d list and then output that 2d sheet to CSV. Also, your CSV is more like TSV (tab separated instead of comma).
import re
import csv
filename='4012987'
User=filename
# read your file into a string
with open(filename, 'r') as myfile:
data=myfile.read()
fields =['Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']
csvTemplate = [['User','Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']]
# for each field use regex to get the entry
for n,field in enumerate(fields):
matches = re.findall(field+':\s?([^\n]*)\n+', data)
# this should run only the first time to fill your 2d list with the right amount of lists
while len(csvTemplate)<=len(matches):
csvTemplate.append([None]*(len(fields)+1)) # Null isn't a python reserved word
for e,m in enumerate(matches):
if m != '':
csvTemplate[e+1][n+1]=m.strip()
# set the User column
for i in range(1,len(csvTemplate)):
csvTemplate[i][0] = User
# output to csv....if you want tsv look at https://stackoverflow.com/a/29896136/3462319
with open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(csvTemplate)
source to share
You can use a combination of regex and dict comprehension:
import regex as re, pandas as pd
rx_parts = re.compile(r'^{}$(?s:.*?)^{}$'.format(re.escape('***'), re.escape('***')), re.MULTILINE)
rx_entry = re.compile(r'^(?P<key>\w+):[ ]*(?P<value>.+)$', re.MULTILINE)
result = ({m.group('key'): m.group('value')
for m in rx_entry.finditer(part.group(0))}
for part in rx_parts.finditer(your_string_here))
df = pd.DataFrame(result)
print(df)
What gives
Favorite Hashtags ID MentionedEntities Origin \
0 false 95482459084427264 20776334 @z_rose yes
1 false 95481610861953024 2226621 @aaronesilvers text
2 false 95480980026040320 20776334 @z_rose text
RetCount Text Time Type URL
0 0 yes Mon Jul 25 08:16:06 CDT 2011 status
1 0 text Mon Jul 25 08:12:44 CDT 2011 status
2 0 text and stuff Mon Jul 25 08:10:14 CDT 2011 status
Explanation:
- Divide the string into different parts, surrounded
***
on both sides - Look for key value pairs on each line
- Put all pairs in a dict
As a result, we have a generator of dictionaries, which we then transfer to pandas
.
Tips:
The code has not been tested with large amounts of data, especially not 4gb. You also need a new one for the expression to work . regex
source to share