A quick Pythonic way to turn a lot of string lists into float lists when catching ValueErrors
I have about 50 million string lists in Python like this:
["1", "1.0", "", "foobar", "3.0", ...]
And I need to turn them into a list of floats and Nones like this:
[1.0, 1.0, None, None, 3.0, ...]
I am currently using code like:
def to_float_or_None(x):
try:
return float(x)
except ValueError:
return None
result = []
for record in database:
result.append(map(to_float_or_None, record))
The to_float_or_None function takes about 750 seconds in total (according to cProfile) ... Is there a faster way to do this conversion from a list of strings to a list of float / Nones?
Update
I have identified the function to_float_or_None
as the main bottleneck. I cannot find any significant speed difference between using map
and using lists. I applied Paulo Scardine's advice to validate input and it already saves 1/4 of the time.
def to_float_or_None(x):
if not(x and x[0] in "0123456789."):
return None
try:
return float(x)
except:
return None
Using generators was new to me, so thanks for the tip from Cpfohl and Lattyware! This does make reading the file even faster, but I was hoping to save some memory by converting the strings to floats / Nones.
source to share
The answers given so far do not really fully answer the question. try...catch
against validation if then
can lead to different performance (see fooobar.com/questions/146237 / ... ). To summarize this answer: It depends on the ratio of failure to success and the measured time of failure and success in both cases. Basically, we cannot answer this, but we can tell you how:
- Look at a few typical cases to get the ratio.
- Write down
if/then
which tests the same,try/catch
optimizes it, and then measures how long it takes for both versions toto_float_or_None
run 100 times and measures how long it takes for both versionsto_float_or_None
to succeed 100 times. - Do some math to figure out which is faster.
Side note on list comprehension issue:
Depending on whether you want to index the results of this or whether you just want to iterate over it, a generator expression will be even better than a list comprehension (just replace [
]
chars with chars (
)
).
There is no need to create time, and the actual execution of to_float_or_None (which is the costly part) may delay until the result is needed.
This is useful for many reasons, but won't work if you need to index it. However, it will allow you to zip up the original collection with a generator so that you can still access the original string, along with its float_or_none result.
source to share
Edit: I just realized that I misunderstood the question and we are talking about a list of lists, not just a list. Updated accordingly.
You can use the list description here to create something a little faster and nicer to read:
def to_float_or_None(x):
try:
return float(x)
except ValueError:
return None
database = [["1", "1.0", "", "foobar", "3.0"], ["1", "1.0", "", "foobar", "3.0"]]
result = [[to_float_or_None(item) for item in record] for record in database]
Providing us:
[[1.0, 1.0, None, None, 3.0], [1.0, 1.0, None, None, 3.0]]
Edit: as noted by Paolo Moretti in the comments, if you want the fastest result then using map
might be faster as we are not using a lambda function:
def to_float_or_None(x):
try:
return float(x)
except ValueError:
return None
database = [["1", "1.0", "", "foobar", "3.0"], ["1", "1.0", "", "foobar", "3.0"]]
result = [list(map(to_float_or_None, record)) for record in database]
Giving us the same result. However, I would like to point out that premature optimization is bad. If you've identified this as a bottleneck in your application, then that's fair enough, but if not, stick with the more readable one the faster.
We are still using a list comprehension for the outer loop as we will need a lambda function to use map
again given its dependence on record
:
result = map(lambda record: map(to_float_or_None, record), database)
Naturally, if you want to evaluate it lazily, you can use generator expressions:
((to_float_or_None(item) for item in record) for record in database)
Or:
(map(to_float_or_None, record) for record in database)
This will be the preferred method if you don't need the entire list at once.
source to share
Or, if you really have that much data in your lists, perhaps use something like pandas Series and a apply()
lambda function to convert:
import pandas,re
inlist = ["1", "1.0", "", "foobar", "3.0"] # or however long...
series = pandas.Series(inlist)
series.apply(lambda x: float(x) if re.match("^\d+?(\.\d+?)*$",x) else None)
Out[41]:
0 1
1 1
2 NaN
3 NaN
4 3
Lots of other benefits - not least to indicate later how you want to handle those missing values ...
source to share