A quick Pythonic way to turn a lot of string lists into float lists when catching ValueErrors

I have about 50 million string lists in Python like this:

["1", "1.0", "", "foobar", "3.0", ...]

      

And I need to turn them into a list of floats and Nones like this:

[1.0, 1.0, None, None, 3.0, ...]

      

I am currently using code like:

def to_float_or_None(x):
    try:
        return float(x)
    except ValueError:
        return None

result = []
for record in database:
    result.append(map(to_float_or_None, record))

      

The to_float_or_None function takes about 750 seconds in total (according to cProfile) ... Is there a faster way to do this conversion from a list of strings to a list of float / Nones?

Update
I have identified the function to_float_or_None

as the main bottleneck. I cannot find any significant speed difference between using map

and using lists. I applied Paulo Scardine's advice to validate input and it already saves 1/4 of the time.

def to_float_or_None(x):
    if not(x and x[0] in "0123456789."):
        return None
    try:
        return float(x)
    except:
        return None

      

Using generators was new to me, so thanks for the tip from Cpfohl and Lattyware! This does make reading the file even faster, but I was hoping to save some memory by converting the strings to floats / Nones.

+3


source to share


4 answers


The answers given so far do not really fully answer the question. try...catch

against validation if then

can lead to different performance (see fooobar.com/questions/146237 / ... ). To summarize this answer: It depends on the ratio of failure to success and the measured time of failure and success in both cases. Basically, we cannot answer this, but we can tell you how:

  • Look at a few typical cases to get the ratio.
  • Write down if/then

    which tests the same, try/catch

    optimizes it, and then measures how long it takes for both versions to to_float_or_None

    run 100 times and measures how long it takes for both versions to_float_or_None

    to succeed 100 times.
  • Do some math to figure out which is faster.

Side note on list comprehension issue:



Depending on whether you want to index the results of this or whether you just want to iterate over it, a generator expression will be even better than a list comprehension (just replace [

]

chars with chars (

)

).

There is no need to create time, and the actual execution of to_float_or_None (which is the costly part) may delay until the result is needed.

This is useful for many reasons, but won't work if you need to index it. However, it will allow you to zip up the original collection with a generator so that you can still access the original string, along with its float_or_none result.

+2


source


Edit: I just realized that I misunderstood the question and we are talking about a list of lists, not just a list. Updated accordingly.

You can use the list description here to create something a little faster and nicer to read:

def to_float_or_None(x):
    try:
        return float(x)
    except ValueError:
        return None

database = [["1", "1.0", "", "foobar", "3.0"], ["1", "1.0", "", "foobar", "3.0"]]

result = [[to_float_or_None(item) for item in record] for record in database]

      

Providing us:

[[1.0, 1.0, None, None, 3.0], [1.0, 1.0, None, None, 3.0]]

      

Edit: as noted by Paolo Moretti in the comments, if you want the fastest result then using map

might be faster as we are not using a lambda function:

def to_float_or_None(x):
    try:
        return float(x)
    except ValueError:
        return None

database = [["1", "1.0", "", "foobar", "3.0"], ["1", "1.0", "", "foobar", "3.0"]]

result = [list(map(to_float_or_None, record)) for record in database]

      

Giving us the same result. However, I would like to point out that premature optimization is bad. If you've identified this as a bottleneck in your application, then that's fair enough, but if not, stick with the more readable one the faster.



We are still using a list comprehension for the outer loop as we will need a lambda function to use map

again given its dependence on record

:

result = map(lambda record: map(to_float_or_None, record), database)

      

Naturally, if you want to evaluate it lazily, you can use generator expressions:

((to_float_or_None(item) for item in record) for record in database)

      

Or:

(map(to_float_or_None, record) for record in database)

      

This will be the preferred method if you don't need the entire list at once.

+2


source


I don't know about the performance aspect, but this should work for your case.

list_floats = [to_float_or_None(item) for item in original_list]

      

+2


source


Or, if you really have that much data in your lists, perhaps use something like pandas Series and a apply()

lambda function to convert:

import pandas,re

inlist = ["1", "1.0", "", "foobar", "3.0"] # or however long...
series = pandas.Series(inlist)
series.apply(lambda x: float(x) if re.match("^\d+?(\.\d+?)*$",x) else None)

Out[41]: 
0     1
1     1
2   NaN
3   NaN
4     3

      

Lots of other benefits - not least to indicate later how you want to handle those missing values ...

+2


source







All Articles