Reading the selected column only from a CSV file when all other columns are guaranteed to be identical

I have a bunch of CSV files that I am trying to combine into one csv file. CSV files are shared by one space and look like this:

'initial', 'pos', 'orientation', 'ratio'
'chr', '106681', '+', '0.06'
'chr', '106681', '+', '0.88'
'chr', '106681', '+', '0.01'
'chr', '106681', '+', '0.02'

      

As you can see, all values ​​are the same except ratio

. The concatenated file I created will look like this:

'filename','initial', 'pos', 'orientation', 'ratio1','ratio2','ratio3'
'jon' , 'chr', '106681', '+', '0.06' , '0.88' ,'0.01'

      

Thus, in any case, without any problems, repeat every file, storing only a single value initial

, pos

, orientation

but all values ratio

and updating the table in the merged file. It turned out to be much more confusing than me. I have the following piece of code for reading csv files:

concatenated_file  = open('josh.csv', "rb")
reader = csv.reader(concatenated_file)

for row in reader:
    print row

      

which gives:

['chrom', 'pos', 'strand', 'meth_ratio']
['chr2', '106681786', '+', '0.06']
['chr2', '106681796', '+', '0.88']
['chr2', '106681830', '+', '0.01']
['chr2', '106681842', '+', '0.02']

      

It would be very helpful if someone could show me how to hold only one value initial

, pos

, orientation

(because they are the same), but valuesratio

+3


source to share


2 answers


It's a one-liner with pandas.read_csv () . And we can even opt out of quoting:

import pandas as pd

csva = pd.read_csv('a.csv', header=0, quotechar="'", delim_whitespace=True)

csva['ratio']
0    0.06
1    0.88
2    0.01
3    0.02
Name: ratio, dtype: float64

      



A few points:

  • actually your separator is comma + space. In this sense, it is not a simple vanilla CSV. See How do I make the delimiter in read_csv more flexible?
  • note that we have omitted quoting in numeric fields by setting quotechar="'"

  • if you really insist on preserving memory (not needed), you can delete all columns other csva

    than "ratio" after you read read_csv. See the pandas doc.
+1


source


First put it in English.

You have to read all of these fields from somewhere, so it could be from the first line as well.

Then, after doing this, you need to read the last column from each next line and wrap it to the end of a new line, ignoring the rest.

So, to turn this into Python:

with open(outpath, 'wb') as outfile:
    writer = csv.writer(outfile)
    for inpath in paths:
        with open(inpath, 'rb') as infile:
            reader = csv.reader(infile)

            # Read all values (including the ratio) from first row
            new_row = next(reader)

            # For every subsequent row...
            for row in reader:
                # ... read the ratio, pack it on, ignore the rest
                new_row.append(row[-1])

            writer.writerow(new_row)

      



I'm not sure if the comments really add anything; I think my Python is easier to follow than my English. :)


It's worth knowing that what you are trying to do here is called "denormalization". From what I can tell, your data will contain an arbitrary number of columns ratio

for each row, all of which have the same "value", so each row is no longer a value, but a set of values.

Denormalization is generally considered bad for a variety of reasons. There are times when denormalized data is easier or faster to work with - as long as you know you are doing it and why it might be useful. Wikipedia has a good article on database normalization that explains the problems; you can read it so you understand what you are doing here and you can make sure it is correct.

+1


source







All Articles