Extracting information from a text file via regex and / or python
I am working with a large number of files (~ 4GB) that contain from 1 to 100 records in the following format (between two *** - one record):
***
Type:status
Origin: @z_rose yes
Text: yes
URL:
ID: 95482459084427264
Time: Mon Jul 25 08:16:06 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 20776334
Hashtags:
***
***
Type:status
Origin: @aaronesilvers text
Text: text
URL:
ID: 95481610861953024
Time: Mon Jul 25 08:12:44 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 2226621
Hashtags:
***
***
Type:status
Origin: @z_rose text
Text: text and stuff
URL:
ID: 95480980026040320
Time: Mon Jul 25 08:10:14 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 20776334
Hashtags:
***
Now I want to somehow import them into Pandas for bulk analysis, but obviously I will have to convert it to Pandas format. So I want to write a script that converts the above to .csv, looking something like this (User is the name of the file):
User Type Origin Text URL ID Time RetCount Favorite MentionedEntities Hashtags
4012987 status @z_rose yes yes Null 95482459084427264 Mon Jul 25 08:16:06 CDT 2011 0 false 20776334 Null
4012987 status @aaronsilvers text text Null 95481610861953024 Mon Jul 25 08:12:44 CDT 2011 0 false 2226621 Null
(The formatting isn't perfect, but hopefully you get the idea)
I had some code work that worked based on its regular information in 12 segments, but unfortunately some of the files contain some white lines in some fields. Basically I want:
fields[] =['User', 'Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']
starPair = 0;
User = filename;
read(file)
#Determine if the current entry has ended
if(stringRead=="***"){
if(starPair == 0)
starPair++;
if(starPair == 1){
row=row++;
starPair = 0;
}
}
#if string read matches column field
if(stringRead == fields[])
while(strRead != fields[]) #until next field has been found
#extract all characters into correct column field
However, a problem arises that some fields may contain words in the [] fields. I can check the \ n char first, which will greatly reduce the number of erroneous entries, but not eliminate them.
Can anyone point me in the right direction?
Thanks in advance!
Your code / pseudocode doesn't look like python, but since you have a python tag, here's how I would go about it. First read the file in line, then go through each field and do a regex to find the value after it, push the result into a 2d list and then output that 2d sheet to CSV. Also, your CSV is more like TSV (tab separated instead of comma).
import re
import csv
filename='4012987'
User=filename
# read your file into a string
with open(filename, 'r') as myfile:
data=myfile.read()
fields =['Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']
csvTemplate = [['User','Type', 'Origin', 'Text', 'URL', 'ID', 'Time', 'RetCount', 'Favorite', 'MentionedEntities', 'Hashtags']]
# for each field use regex to get the entry
for n,field in enumerate(fields):
matches = re.findall(field+':\s?([^\n]*)\n+', data)
# this should run only the first time to fill your 2d list with the right amount of lists
while len(csvTemplate)<=len(matches):
csvTemplate.append([None]*(len(fields)+1)) # Null isn't a python reserved word
for e,m in enumerate(matches):
if m != '':
csvTemplate[e+1][n+1]=m.strip()
# set the User column
for i in range(1,len(csvTemplate)):
csvTemplate[i][0] = User
# output to csv....if you want tsv look at https://stackoverflow.com/a/29896136/3462319
with open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(csvTemplate)
You can use a combination of regex and dict comprehension:
import regex as re, pandas as pd
rx_parts = re.compile(r'^{}$(?s:.*?)^{}$'.format(re.escape('***'), re.escape('***')), re.MULTILINE)
rx_entry = re.compile(r'^(?P<key>\w+):[ ]*(?P<value>.+)$', re.MULTILINE)
result = ({m.group('key'): m.group('value')
for m in rx_entry.finditer(part.group(0))}
for part in rx_parts.finditer(your_string_here))
df = pd.DataFrame(result)
print(df)
What gives
Favorite Hashtags ID MentionedEntities Origin \
0 false 95482459084427264 20776334 @z_rose yes
1 false 95481610861953024 2226621 @aaronesilvers text
2 false 95480980026040320 20776334 @z_rose text
RetCount Text Time Type URL
0 0 yes Mon Jul 25 08:16:06 CDT 2011 status
1 0 text Mon Jul 25 08:12:44 CDT 2011 status
2 0 text and stuff Mon Jul 25 08:10:14 CDT 2011 status
Explanation:
- Divide the string into different parts, surrounded
***
on both sides - Look for key value pairs on each line
- Put all pairs in a dict
As a result, we have a generator of dictionaries, which we then transfer to pandas
.
Tips:
The code has not been tested with large amounts of data, especially not 4gb. You also need a new one for the expression to work . regex