A method for guessing data types currently represented as strings
I am currently parsing CSV tables and need to detect the "datatypes" of the columns. I don't know the exact format of the values. Obviously, whatever the CSV parser outputs is a string. I am currently interested in data types:
- integer
- floating point
- date
- boolean
- line
My current thoughts are to check the pattern of strings (maybe a few hundred?) To determine the types of data present in the pattern matching.
I'm particularly concerned about the date data type - is their python module for parsing common date idiums (obviously I won't be able to detect them) ?
How about integers and floating point numbers?
source to share
Dateutil comes to mind for parsing dates.
For integers and floats, you can always try to cast in the try / except clause
>>> f = "2.5"
>>> i = "9"
>>> ci = int(i)
>>> ci
9
>>> cf = float(f)
>>> cf
2.5
>>> g = "dsa"
>>> cg = float(g)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for float(): dsa
>>> try:
... cg = float(g)
... except:
... print "g is not a float"
...
g is not a float
>>>
source to share
The datatypes I'm currently interested in ...
They don't exist in the CSV file. Data is just strings. Only. Nothing more.
check sample strings
Doesn't say anything other than what you saw in the sample. The next line after your sample might be a line that looks completely different from the sampled lines.
The only way to handle CSV files is to write CSV processing applications that assume specific data types and try to convert. You can't "discover" much about a CSV file.
If column 1 is to be a date, you will have to look at the row and work out the format. It could be anything. A number that is a typical Gregorian date in American or European format (there is no way to tell if 1/1/10 is American or European).
try:
x= datetime.datetime.strptime( row[0], some format )
except ValueError:
# column is not valid.
If column 2 is to be a float, you can only do this.
try:
y= float( row[1] )
except ValueError:
# column is not valid.
If column 3 is to be an int, you can only do that.
try:
z= int( row[2] )
except ValueError:
# column is not valid.
It is not possible to "detect" if the CSV has floating point digits, except to execute float
on every line. If the line fails, then someone has prepared the file incorrectly.
Since you need to do the conversion to see if the conversion is possible, you can simply process the string. It's easier and gives you results in one pass.
Don't waste time analyzing your data. Ask the people who created it what should be there.
source to share
You may be interested in this python library, which does exactly this kind of guessing on both common python data and CSV and XLS files:
It scales thoroughly to very large files, internet streaming, and more.
There is also an even simpler wrapper library that includes a command line tool called dataconverters: http://okfnlabs.org/dataconverters/ (and an online service: https://github.com/okfn/dataproxy !)
The main algorithm that does type guessing is here: https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164
source to share
We have tested ast.literal_eval()
but ast.literal_eval()
due to the error it is rather slow, if you want the ast.literal_eval()
data you get in the form string
, I think the regex will be faster.
Something like the following worked really well for us.
import datetime
import re
"""
Helper function to detect the appropriate type for a given string.
"""
def guess_type(s):
if re.match("\A[0-9]+\.[0-9]+\Z", s):
return float
elif re.match("\A[0-9]+\Z", s):
return int
# 2019-01-01 or 01/01/2019 or 01/01/19
elif re.match("\A[0-9]{4}-[0-9]{2}-[0-9]{2}\Z", s) or \
re.match("\A[0-9]{2}/[0-9]{2}/([0-9]{2}|[0-9]{4})\Z", s):
return datetime.date
elif re.match("\A(true|false)\Z", s):
return bool
else:
return str
tests:
assert guess_type("this is a string") == str
assert guess_type("0.1") == float
assert guess_type("true") == bool
assert guess_type("1") == int
assert guess_type("2019-01-01") == datetime.date
assert guess_type("01/01/2019") == datetime.date
assert guess_type("01/01/19") == datetime.date
source to share