Read mixed data types in Python text file

I have been given some "reports" from other software that contains data that I need to use. The file is pretty simple. It has a description line that starts with the # character, which is the name / description of the variable. This is followed by comma-separated data on the next line.

eg

    #wavelength,'<a comment describing the data>'
    400.0,410.0,420.0, <and so on>
    #reflectance,'<a comment describing the data>'
    0.001,0.002,0.002, <and so on>
    #date,'time file was written'
    2012-03-06 13:12:36.694597  < this is the bit that stuffs me up!! >

      

When I first typed the code, I expected all data to be read as float. But I found several dates and strings. For my purposes, all I care about is the data, which should be arrays of floats. Everything else I've read (like dates) can be thought of as strings (even if they are, for example, technically a date).

My first attempt, which worked until I found non-floating elements, basically ignores the #, then grabs the characters it does, creating a dictionary with a key that is the character it just read. Then I wrote an array for the key, separating it with commas and storing it in strings for 2D data. As in the next section of code.

    data = f.readlines()
    dataLines = data.split('\n')

    for i in range(0,len(dataLines)-1):
        if dataLines[i][0] == '#':
            key,comment = dataLines[i].split(',')
            keyList.append(key[1:])
            k+=1
        else: # it must be data
            d+=1
            dataList.append(dataLines[i])

        for j in range(0,len(dataList)):
            tmp = dataList[j]

            x = map(float,tmp.split(','))
            tempData = vstack((tempData,asarray(x)))

    self.__report[keyList[k]] = tempData  

      

When I find non-float in my file the line "x = map (float, tmp.split (','))" doesn't work (there are no commas in the data line). I thought I would try to check if it is a string or not using isinstance, but the file reader treats all data coming from the file as a string (of course). I tried converting a string from a file to a floating point array, thinking if it didn't work, and then just treating it as an array of strings - like that.

     try:
         scipy.array(tmp,dtype=float64)  #try to convert
         x = map(float,tmp.split(','))

     except:# ValueError: # must be a string
         x = zeros((1,1))
         x = asarray([tmp])
         #tempData = vstack((tempData,asarray(x)),dtype=str)
         if 'tempData' in locals():
             pass
         else:
             tempData = zeros((len(x)))

         tempData = vstack((tempData,asarray(x)))

      

This, however, causes EVERYTHING to be read as a character array and as such I cannot index the data as a numpy array. All data is in the dictionary, but, for example, dtype s | 8. The try block seems to fit the exception.

I would appreciate any advice on working with this so that I can differentiate between floats and strings. I don't know the order of the data before I get the report.

Also, large files can take quite a long time to load into memory, and any advice on how to make it more efficient would be appreciated as well.

thank

+3


source to share


2 answers


I guess you are finally wondering x

which should be in the format [400.0, 410.0, 420.0]

.

One way to deal with this is to split the split on command and convert to float operations in two different operators so you can catch ValueError

when you get string elements instead of float

or int

.



keyList = []
dataList = []
with open('sample_data','r') as f:
    for line in f.readline():
        if line.startswith("#"):
            key, comment = line.split(',')
            keyList.append(key[1:])
        else: # it must be data
            dataList.append(line)

for data in dataList:
    data_list = data.split(',')
    try:
        x = map(float, data_list)
    except ValueError:
        pass

      

Also note the other minor changes I have made to your code that make it more pythonic in nature.

+2


source


This might be a silly suggestion, but you could just do some extra checking.

if ',' in dataLines[i]

before adding a row to the data list? Or, if not, write a regex to check for a comma-separated floating point list?



(\d(\.\d+)?)(,\d(\.\d+)?)*

      

might do the trick (integers can also be used).

0


source







All Articles