Python prints specific lines from a file

Question

Python prints specific lines from a file

Background:

                    Table$Gene=Gene1
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    0   2872     208    0.928 0.00484        0.918        0.937
    1   2664     304    0.822 0.00714        0.808        0.836
    2   2360     104    0.786 0.00766        0.771        0.801
    3   2256      48    0.769 0.00787        0.754        0.784
    4   2208      40    0.755 0.00803        0.739        0.771
    5   2256      48    0.769 0.00787        0.754        0.784
    6   2208      40    0.755 0.00803        0.739        0.771

                Table$Gene=Gene2
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    0   2872     208    0.938 0.00484        0.918        0.937
    1   2664     304    0.822 0.00714        0.808        0.836
    2   2360     104    0.786 0.00766        0.771        0.801
    3   2256      48    0.769 0.00787        0.754        0.784
    4   1000      40    0.744 0.00803        0.739        0.774
#There is a new line ("\n") here too, it just doesn't come out in the code.

What I want seems simple. I want to turn the above file into output that looks like this:

Gene1  0.755
Gene2  0.744

i.e. each gene and the last number in the survival column from each section.

I've tried several ways using a regular expression, reading the file as a list and saying ".next ()". One example code I've tried:

fileopen = open(sys.argv[1]).readlines()  # Read in the file as a list.
for index,line in enumerate(fileopen):   # Enumerate items in list
    if "Table" in line:  # Find the items with "Table" (This will have my gene name)
            line2 = line.split("=")[1]  # Parse line to get my gene name
            if "\n" in fileopen[index+1]: # This is the problem section.
                print fileopen[index]
            else:
                fileopen[index+1]

So, as you can see in the problems section, I was trying to say in this attempt:

if the next item in the list is a new line, print the item, otherwise the next line will be the current line (and then I can split the line to pull out the number I want).

If someone can fix the code so I can figure out what I did wrong, I would appreciate it.

+3

python parsing

user1288515 08 Aug 14 at 11:01

source to share

5 answers

Joop · Answer 1 · 2014-08-08T12:25:51+0000

A bit of overkill, but instead of manually writing a parser for each piece of data, use an existing package like pandas to read in the csv file. You just need to write some code to specify the appropriate lines in the file. Non-optimized code (twice file to read):

import pandas as pd
def genetable(gene):
    l = open('gene.txt').readlines()
    l += "\n"  # add newline to end of file in case last line is not newline
    lines = len(l)
    skiprows = -1
    for (i, line) in enumerate(l):
        if "Table$Gene=Gene"+str(gene) in line:
            skiprows = i+1
        if skiprows>=0 and line=="\n":
            skipfooter = lines - i - 1
            df = pd.read_csv('gene.txt', sep='\t', engine='python', skiprows=skiprows, skipfooter=skipfooter)
            #  assuming tab separated data given your inputs. change as needed
            # assert df.columns.....
            return df
    return "Not Found"

this will read into a DataFrame with all the relevant data in that file

can:

genetable(2).survival  # series with all survival rates
genetable(2).survival.iloc[-1]   last item in survival

The advantage of this is that you have access to all elements, any incorrect file formatting is likely to be better matched and prevent the use of incorrect values. If my own code I would add assertions to the column names before returning the pandas DataFrame. Want to get any parsing errors early so it doesn't get propagated.

monopole · Answer 2 · 2014-08-08T11:19:48+0000

This worked when I tried:

gene = 1
for i in range(len(filelines)):
    if filelines[i].strip() == "":
        print("Gene" + str(gene) + " " + filelines[i-1].split()[3])
        gene += 1

Roland Smith · Answer 3 · 2014-08-08T11:20:03+0000

You can try something like this (I copied your data in foo.dat

);

In [1]: with open('foo.dat') as input:
   ...:     lines = input.readlines()
   ...:

Using with

, make sure the file is closed after reading.

In [3]: lines = [ln.strip() for ln in lines]

This removes unnecessary spaces.

In [5]: startgenes = [n for n, ln in enumerate(lines) if ln.startswith("Table")]

In [6]: startgenes
Out[6]: [0, 10]

In [7]: emptylines = [n for n, ln in enumerate(lines) if len(ln) == 0]

In [8]: emptylines
Out[8]: [9, 17]

Usage emptylines

depends on the fact that the records are separated by lines containing only spaces.

In [9]: lastlines = [n-1 for n, ln in enumerate(lines) if len(ln) == 0]

In [10]: for first, last in zip(startgenes, lastlines):
   ....:     gene = lines[first].split("=")[1]
   ....:     num = lines[last].split()[-1]
   ....:     print gene, num
   ....:     
Gene1 0.771
Gene2 0.774

fredtantini · Answer 4 · 2014-08-08T11:20:03+0000

here is my solution:

>>> with open('t.txt','r') as f:
...     for l in f:
...         if "Table" in l:
...             gene = l.split("=")[1][:-1]
...         elif l not in ['\n', '\r\n']:
...             surv = l.split()[3]
...         else:
...             print gene, surv
...
Gene1 0.755
Gene2 0.744

Rhand · Answer 5 · 2014-08-08T11:20:04+0000

Instead of checking for a newline, just print when you finish reading the file

lines = open("testgenes.txt").readlines()
table = ""
finalsurvival = 0.0
for line in lines:
    if "Table" in line:
        if table != "": # print previous survival
            print table, finalsurvival
        table = line.strip().split('=')[1]
    else:
        try:                
            finalsurvival = line.split('\t')[4]
        except IndexError:
            continue
print table, finalsurvival

Python prints specific lines from a file

More articles: