Python - calculate average for each column in csv file

I am new to Python and I am trying to get the average for each (column or row) csv file and then select values ​​that are greater than double from the average of its column (o row), My file contains hundreds of columns and has values ​​like this , as:

845.123,452.234,653.23,...
432.123,213.452.421.532,...
743.234,532,432.423,...

      

I tried several changes in my code to get the average for each column (separately), but at the moment my code is similar to this one:

def AverageColumn (c):
    f=open(csv,"r")
    average=0
    Sum=0
    column=len(f)
    for i in range(0,column):
        for n in i.split(','):
            n=float(n)
            Sum += n
        average = Sum / len(column)
    return 'The average is:', average

    f.close()


csv="MDT25.csv"
print AverageColumn(csv)

      

But I always get an error like: "f has no object len ​​()" or "int", is not iterable "...

I would really appreciate if someone shows me how to get the average for each column (or row, as you want) and then select values ​​that are greater than a double from the average of its column (or row), I would preferred not to import modules as csv but as you prefer. Thank!

+3


source to share


5 answers


You can clean up your function here, but it probably doesn't do what you want. It currently gets the average of all values ​​in all columns:

def average_column (csv):
    f = open(csv,"r")
    average = 0
    Sum = 0
    row_count = 0
    for row in f:
        for column in row.split(','):
            n=float(column)
            Sum += n
        row_count += 1
    average = Sum / len(column)
    f.close()
    return 'The average is:', average

      



I would use a module csv

(which makes it easy to parse csv), with an object Counter

to manage the column totals, and a context manager to open the file (no need for close()

):

import csv
from collections import Counter

def average_column (csv_filepath):
    column_totals = Counter()
    with open(csv_filepath,"rb") as f:
        reader = csv.reader(f)
        row_count = 0.0
        for row in reader:
            for column_idx, column_value in enumerate(row):
                try:
                    n = float(column_value)
                    column_totals[column_idx] += n
                except ValueError:
                    print "Error -- ({}) Column({}) could not be converted to float!".format(column_value, column_idx)                    
            row_count += 1.0            

    # row_count is now 1 too many so decrement it back down
    row_count -= 1.0

    # make sure column index keys are in order
    column_indexes = column_totals.keys()
    column_indexes.sort()

    # calculate per column averages using a list comprehension
    averages = [column_totals[idx]/row_count for idx in column_indexes]
    return averages

      

+4


source


First of all, as people say, the CSV format looks simple but can be quite non-trivial, especially after the strings are in play. monkut already gave you two solutions, a cleaned up version of your code and another one that uses the CSV library. I'll give you another option: no libraries, but a lot of idiomatic code to chew on, which gives you averages for all columns at once.

def get_averages(csv):
    column_sums = None
    with open(csv) as file:
        lines = file.readlines()
        rows_of_numbers = [map(float, line.split(',')) for line in lines]
        sums = map(sum, zip(*rows_of_numbers))
        averages = [sum_item / len(lines) for sum_item in sums]
        return averages

      

Things to note: in your code f

, this is a file object. You are trying to close it after you have already returned a value. This code will never be achieved: nothing is executed after processing return

unless you have a construct try...finally

or with

(like I use) that automatically closes the stream).



map(f, l)

or equivalent [f(x) for x in l]

, creates a new list whose elements are obtained by applying a function f

for each element on l

.

f(*l)

will "unpack" the list l

before calling the function, providing the function to f

each element as a separate argument.

+1


source


If you want to do this without stdlib modules for some reason:

with open('path/to/csv') as infile:
    columns = list(map(float,next(infile).split(',')))
    for how_many_entries, line in enumerate(infile,start=2):
        for (idx,running_avg), new_data in zip(enumerate(columns), line.split(',')):
            columns[idx] += (float(new_data) - running_avg)/how_many_entries

      

0


source


I suggest breaking this down into a few small steps:

  • Read the CSV file in 2D list or 2D array.
  • Calculate averages for each column.

Each of these steps can be implemented as two separate functions. (In a realistic situation where the CSV file is large, reading the entire file into memory may be prohibitive due to space constraints. However, for a tutorial exercise, this is a great way to get an idea of ​​how to write your own functions.)

0


source


Hope this helps you ... Some help ... here's what I'll be doing is using numpy:

    # ==========================
    import numpy as np
    import csv as csv

    #  Assume that you have 2 columns and a header-row: The Columns are (1) 
    #  question # ...1; (2) question 2
    # ========================================

    readdata = csv.reader(open('filename.csv', 'r'))  #this is the file you 
    # ....will write your original file to....============
    data = []
    for row in readdata:
    data.append(row)
    Header = data[0]
    data.pop(0)
    q1 = []
    q2 = []
    # ========================================

    for i in range(len(data)):
        q1.append(int(data[i][1]))
        q2.append(int(data[i][2]))
    # ========================================
    # ========================================
    # === Means/Variance - Work-up Section ===
    # ========================================
    print ('Mean - Question-1:            ', (np.mean(q1)))
    print ('Variance,Question-1:          ', (np.var(q1)))
    print ('==============================================')
    print ('Mean - Question-2:            ', (np.mean(q2)))
    print ('Variance,Question-2:          ', (np.var(q2)))

      

0


source







All Articles