Python - calculate average for each column in csv file
I am new to Python and I am trying to get the average for each (column or row) csv file and then select values that are greater than double from the average of its column (o row), My file contains hundreds of columns and has values like this , as:
845.123,452.234,653.23,...
432.123,213.452.421.532,...
743.234,532,432.423,...
I tried several changes in my code to get the average for each column (separately), but at the moment my code is similar to this one:
def AverageColumn (c):
f=open(csv,"r")
average=0
Sum=0
column=len(f)
for i in range(0,column):
for n in i.split(','):
n=float(n)
Sum += n
average = Sum / len(column)
return 'The average is:', average
f.close()
csv="MDT25.csv"
print AverageColumn(csv)
But I always get an error like: "f has no object len ()" or "int", is not iterable "...
I would really appreciate if someone shows me how to get the average for each column (or row, as you want) and then select values that are greater than a double from the average of its column (or row), I would preferred not to import modules as csv but as you prefer. Thank!
source to share
You can clean up your function here, but it probably doesn't do what you want. It currently gets the average of all values in all columns:
def average_column (csv):
f = open(csv,"r")
average = 0
Sum = 0
row_count = 0
for row in f:
for column in row.split(','):
n=float(column)
Sum += n
row_count += 1
average = Sum / len(column)
f.close()
return 'The average is:', average
I would use a module csv
(which makes it easy to parse csv), with an object Counter
to manage the column totals, and a context manager to open the file (no need for close()
):
import csv
from collections import Counter
def average_column (csv_filepath):
column_totals = Counter()
with open(csv_filepath,"rb") as f:
reader = csv.reader(f)
row_count = 0.0
for row in reader:
for column_idx, column_value in enumerate(row):
try:
n = float(column_value)
column_totals[column_idx] += n
except ValueError:
print "Error -- ({}) Column({}) could not be converted to float!".format(column_value, column_idx)
row_count += 1.0
# row_count is now 1 too many so decrement it back down
row_count -= 1.0
# make sure column index keys are in order
column_indexes = column_totals.keys()
column_indexes.sort()
# calculate per column averages using a list comprehension
averages = [column_totals[idx]/row_count for idx in column_indexes]
return averages
source to share
First of all, as people say, the CSV format looks simple but can be quite non-trivial, especially after the strings are in play. monkut already gave you two solutions, a cleaned up version of your code and another one that uses the CSV library. I'll give you another option: no libraries, but a lot of idiomatic code to chew on, which gives you averages for all columns at once.
def get_averages(csv):
column_sums = None
with open(csv) as file:
lines = file.readlines()
rows_of_numbers = [map(float, line.split(',')) for line in lines]
sums = map(sum, zip(*rows_of_numbers))
averages = [sum_item / len(lines) for sum_item in sums]
return averages
Things to note: in your code f
, this is a file object. You are trying to close it after you have already returned a value. This code will never be achieved: nothing is executed after processing return
unless you have a construct try...finally
or with
(like I use) that automatically closes the stream).
map(f, l)
or equivalent [f(x) for x in l]
, creates a new list whose elements are obtained by applying a function f
for each element on l
.
f(*l)
will "unpack" the list l
before calling the function, providing the function to f
each element as a separate argument.
source to share
If you want to do this without stdlib modules for some reason:
with open('path/to/csv') as infile:
columns = list(map(float,next(infile).split(',')))
for how_many_entries, line in enumerate(infile,start=2):
for (idx,running_avg), new_data in zip(enumerate(columns), line.split(',')):
columns[idx] += (float(new_data) - running_avg)/how_many_entries
source to share
I suggest breaking this down into a few small steps:
- Read the CSV file in 2D list or 2D array.
- Calculate averages for each column.
Each of these steps can be implemented as two separate functions. (In a realistic situation where the CSV file is large, reading the entire file into memory may be prohibitive due to space constraints. However, for a tutorial exercise, this is a great way to get an idea of how to write your own functions.)
source to share
Hope this helps you ... Some help ... here's what I'll be doing is using numpy:
# ==========================
import numpy as np
import csv as csv
# Assume that you have 2 columns and a header-row: The Columns are (1)
# question # ...1; (2) question 2
# ========================================
readdata = csv.reader(open('filename.csv', 'r')) #this is the file you
# ....will write your original file to....============
data = []
for row in readdata:
data.append(row)
Header = data[0]
data.pop(0)
q1 = []
q2 = []
# ========================================
for i in range(len(data)):
q1.append(int(data[i][1]))
q2.append(int(data[i][2]))
# ========================================
# ========================================
# === Means/Variance - Work-up Section ===
# ========================================
print ('Mean - Question-1: ', (np.mean(q1)))
print ('Variance,Question-1: ', (np.var(q1)))
print ('==============================================')
print ('Mean - Question-2: ', (np.mean(q2)))
print ('Variance,Question-2: ', (np.var(q2)))
source to share