A method for guessing data types currently represented as strings

I am currently parsing CSV tables and need to detect the "datatypes" of the columns. I don't know the exact format of the values. Obviously, whatever the CSV parser outputs is a string. I am currently interested in data types:

  • integer
  • floating point
  • date
  • boolean
  • line

My current thoughts are to check the pattern of strings (maybe a few hundred?) To determine the types of data present in the pattern matching.

I'm particularly concerned about the date data type - is their python module for parsing common date idiums (obviously I won't be able to detect them) ?

How about integers and floating point numbers?

+2


source to share


5 answers


Dateutil comes to mind for parsing dates.

For integers and floats, you can always try to cast in the try / except clause



>>> f = "2.5"
>>> i = "9"
>>> ci = int(i)
>>> ci
9
>>> cf = float(f)
>>> cf
2.5
>>> g = "dsa"
>>> cg = float(g)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for float(): dsa
>>> try:
...   cg = float(g)
... except:
...   print "g is not a float"
...
g is not a float
>>>

      

+3


source


ast.literal_eval()

can be easy.



+3


source


The datatypes I'm currently interested in ...

They don't exist in the CSV file. Data is just strings. Only. Nothing more.

check sample strings

Doesn't say anything other than what you saw in the sample. The next line after your sample might be a line that looks completely different from the sampled lines.

The only way to handle CSV files is to write CSV processing applications that assume specific data types and try to convert. You can't "discover" much about a CSV file.

If column 1 is to be a date, you will have to look at the row and work out the format. It could be anything. A number that is a typical Gregorian date in American or European format (there is no way to tell if 1/1/10 is American or European).

try:
    x= datetime.datetime.strptime( row[0], some format )
except ValueError:
    # column is not valid.

      

If column 2 is to be a float, you can only do this.

try:
    y= float( row[1] )
except ValueError:
    # column is not valid.

      

If column 3 is to be an int, you can only do that.

try:
    z= int( row[2] )
except ValueError:
    # column is not valid.

      

It is not possible to "detect" if the CSV has floating point digits, except to execute float

on every line. If the line fails, then someone has prepared the file incorrectly.

Since you need to do the conversion to see if the conversion is possible, you can simply process the string. It's easier and gives you results in one pass.

Don't waste time analyzing your data. Ask the people who created it what should be there.

+2


source


You may be interested in this python library, which does exactly this kind of guessing on both common python data and CSV and XLS files:

It scales thoroughly to very large files, internet streaming, and more.

There is also an even simpler wrapper library that includes a command line tool called dataconverters: http://okfnlabs.org/dataconverters/ (and an online service: https://github.com/okfn/dataproxy !)

The main algorithm that does type guessing is here: https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164

+2


source


We have tested ast.literal_eval()

but ast.literal_eval()

due to the error it is rather slow, if you want the ast.literal_eval()

data you get in the form string

, I think the regex will be faster.

Something like the following worked really well for us.

import datetime
import re

"""
Helper function to detect the appropriate type for a given string.
"""
def guess_type(s):
    if re.match("\A[0-9]+\.[0-9]+\Z", s):
        return float
    elif re.match("\A[0-9]+\Z", s):
        return int
    # 2019-01-01 or 01/01/2019 or 01/01/19
    elif re.match("\A[0-9]{4}-[0-9]{2}-[0-9]{2}\Z", s) or \
         re.match("\A[0-9]{2}/[0-9]{2}/([0-9]{2}|[0-9]{4})\Z", s): 
        return datetime.date
    elif re.match("\A(true|false)\Z", s):
        return bool
    else:
        return str

      

tests:

assert guess_type("this is a string") == str
assert guess_type("0.1") == float
assert guess_type("true") == bool
assert guess_type("1") == int
assert guess_type("2019-01-01") == datetime.date
assert guess_type("01/01/2019") == datetime.date
assert guess_type("01/01/19") == datetime.date

      

+1


source







All Articles