Reading the selected column only from a CSV file when all other columns are guaranteed to be identical
I have a bunch of CSV files that I am trying to combine into one csv file. CSV files are shared by one space and look like this:
'initial', 'pos', 'orientation', 'ratio' 'chr', '106681', '+', '0.06' 'chr', '106681', '+', '0.88' 'chr', '106681', '+', '0.01' 'chr', '106681', '+', '0.02'
As you can see, all values are the same except
. The concatenated file I created will look like this:
'filename','initial', 'pos', 'orientation', 'ratio1','ratio2','ratio3' 'jon' , 'chr', '106681', '+', '0.06' , '0.88' ,'0.01'
Thus, in any case, without any problems, repeat every file, storing only a single value
but all values
and updating the table in the merged file. It turned out to be much more confusing than me. I have the following piece of code for reading csv files:
concatenated_file = open('josh.csv', "rb") reader = csv.reader(concatenated_file) for row in reader: print row
['chrom', 'pos', 'strand', 'meth_ratio'] ['chr2', '106681786', '+', '0.06'] ['chr2', '106681796', '+', '0.88'] ['chr2', '106681830', '+', '0.01'] ['chr2', '106681842', '+', '0.02']
It would be very helpful if someone could show me how to hold only one value
(because they are the same), but values
source to share
It's a one-liner with pandas.read_csv () . And we can even opt out of quoting:
import pandas as pd csva = pd.read_csv('a.csv', header=0, quotechar="'", delim_whitespace=True) csva['ratio'] 0 0.06 1 0.88 2 0.01 3 0.02 Name: ratio, dtype: float64
A few points:
- actually your separator is comma + space. In this sense, it is not a simple vanilla CSV. See How do I make the delimiter in read_csv more flexible?
- note that we have omitted quoting in numeric fields by setting
- if you really insist on preserving memory (not needed), you can delete all columns other
than "ratio" after you read read_csv. See the pandas doc.
source to share
First put it in English.
You have to read all of these fields from somewhere, so it could be from the first line as well.
Then, after doing this, you need to read the last column from each next line and wrap it to the end of a new line, ignoring the rest.
So, to turn this into Python:
with open(outpath, 'wb') as outfile: writer = csv.writer(outfile) for inpath in paths: with open(inpath, 'rb') as infile: reader = csv.reader(infile) # Read all values (including the ratio) from first row new_row = next(reader) # For every subsequent row... for row in reader: # ... read the ratio, pack it on, ignore the rest new_row.append(row[-1]) writer.writerow(new_row)
I'm not sure if the comments really add anything; I think my Python is easier to follow than my English. :)
It's worth knowing that what you are trying to do here is called "denormalization". From what I can tell, your data will contain an arbitrary number of columns
for each row, all of which have the same "value", so each row is no longer a value, but a set of values.
Denormalization is generally considered bad for a variety of reasons. There are times when denormalized data is easier or faster to work with - as long as you know you are doing it and why it might be useful. Wikipedia has a good article on database normalization that explains the problems; you can read it so you understand what you are doing here and you can make sure it is correct.
source to share