Python prints specific lines from a file
Background:
Table$Gene=Gene1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 2872 208 0.928 0.00484 0.918 0.937
1 2664 304 0.822 0.00714 0.808 0.836
2 2360 104 0.786 0.00766 0.771 0.801
3 2256 48 0.769 0.00787 0.754 0.784
4 2208 40 0.755 0.00803 0.739 0.771
5 2256 48 0.769 0.00787 0.754 0.784
6 2208 40 0.755 0.00803 0.739 0.771
Table$Gene=Gene2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 2872 208 0.938 0.00484 0.918 0.937
1 2664 304 0.822 0.00714 0.808 0.836
2 2360 104 0.786 0.00766 0.771 0.801
3 2256 48 0.769 0.00787 0.754 0.784
4 1000 40 0.744 0.00803 0.739 0.774
#There is a new line ("\n") here too, it just doesn't come out in the code.
What I want seems simple. I want to turn the above file into output that looks like this:
Gene1 0.755
Gene2 0.744
i.e. each gene and the last number in the survival column from each section.
I've tried several ways using a regular expression, reading the file as a list and saying ".next ()". One example code I've tried:
fileopen = open(sys.argv[1]).readlines() # Read in the file as a list.
for index,line in enumerate(fileopen): # Enumerate items in list
if "Table" in line: # Find the items with "Table" (This will have my gene name)
line2 = line.split("=")[1] # Parse line to get my gene name
if "\n" in fileopen[index+1]: # This is the problem section.
print fileopen[index]
else:
fileopen[index+1]
So, as you can see in the problems section, I was trying to say in this attempt:
if the next item in the list is a new line, print the item, otherwise the next line will be the current line (and then I can split the line to pull out the number I want).
If someone can fix the code so I can figure out what I did wrong, I would appreciate it.
source to share
A bit of overkill, but instead of manually writing a parser for each piece of data, use an existing package like pandas to read in the csv file. You just need to write some code to specify the appropriate lines in the file. Non-optimized code (twice file to read):
import pandas as pd
def genetable(gene):
l = open('gene.txt').readlines()
l += "\n" # add newline to end of file in case last line is not newline
lines = len(l)
skiprows = -1
for (i, line) in enumerate(l):
if "Table$Gene=Gene"+str(gene) in line:
skiprows = i+1
if skiprows>=0 and line=="\n":
skipfooter = lines - i - 1
df = pd.read_csv('gene.txt', sep='\t', engine='python', skiprows=skiprows, skipfooter=skipfooter)
# assuming tab separated data given your inputs. change as needed
# assert df.columns.....
return df
return "Not Found"
this will read into a DataFrame with all the relevant data in that file
can:
genetable(2).survival # series with all survival rates
genetable(2).survival.iloc[-1] last item in survival
The advantage of this is that you have access to all elements, any incorrect file formatting is likely to be better matched and prevent the use of incorrect values. If my own code I would add assertions to the column names before returning the pandas DataFrame. Want to get any parsing errors early so it doesn't get propagated.
source to share
You can try something like this (I copied your data in foo.dat
);
In [1]: with open('foo.dat') as input:
...: lines = input.readlines()
...:
Using with
, make sure the file is closed after reading.
In [3]: lines = [ln.strip() for ln in lines]
This removes unnecessary spaces.
In [5]: startgenes = [n for n, ln in enumerate(lines) if ln.startswith("Table")]
In [6]: startgenes
Out[6]: [0, 10]
In [7]: emptylines = [n for n, ln in enumerate(lines) if len(ln) == 0]
In [8]: emptylines
Out[8]: [9, 17]
Usage emptylines
depends on the fact that the records are separated by lines containing only spaces.
In [9]: lastlines = [n-1 for n, ln in enumerate(lines) if len(ln) == 0]
In [10]: for first, last in zip(startgenes, lastlines):
....: gene = lines[first].split("=")[1]
....: num = lines[last].split()[-1]
....: print gene, num
....:
Gene1 0.771
Gene2 0.774
source to share
Instead of checking for a newline, just print when you finish reading the file
lines = open("testgenes.txt").readlines()
table = ""
finalsurvival = 0.0
for line in lines:
if "Table" in line:
if table != "": # print previous survival
print table, finalsurvival
table = line.strip().split('=')[1]
else:
try:
finalsurvival = line.split('\t')[4]
except IndexError:
continue
print table, finalsurvival
source to share