Python 2.7 Matching exact string in a string of a CSV file
I have a csv file containing rows in a specific column and other values in others. I also have a list of strings. Moving through the lines, I want to check if exactly one of the lines in the list is contained in the line of the csv file. If so, this line should be written to a new csv file.
The csv file is a list of lines like:
22/06/2017 04:00:32 | string1 | value1
22/06/2017 04:00:32 | string11 | value2
22/06/2017 04:00:32 | string2 | value3
22/06/2017 04:00:32 | string3 | value4
I wrote this code, it works fine, but does not account for exact string matching.
import os, csv
def filter_csv(folderpath):
list1 = [
('name1',1,'string1','value1'),
('name2',2,'string2','value2'),
('name3',3,'string3','value3'),
('name4',4,'string4','value4'),
...
]
def column(matrix, i):
return [row[i] for row in matrix]
col = column(list1,2)
for file in os.listdir("%s" % folderpath):
if file.endswith(".csv"):
new_file = 'new_'+file
filepath = os.path.join("%s" % folderpath, file)
new_filepath = os.path.join("%s" % folderpath, new_file)
with open('%s' % filepath) as csvfile:
lines = csvfile.readlines()
with open('%s' % new_filepath, 'wb') as csvfile2:
for line in lines:
for namevar in col:
if namevar in line:
csvfile2.write(line)
return
How can I add exact string matching for a column of a csv file?
source to share
Just to present a slightly different solution, using a different library, which now that I think about it might be overkill, but you might like :) You should be able to modify the example to insert into your code ...
import pandas as pd
# Dummy col
col = ["string1", "string2", "string3"]
# Read in CSV file
df = pd.read_csv("test.csv", header=None, skipinitialspace=True, delimiter ="|")
# Strip all strings so that trailing whitespace is ignored.
# csv library OP used would also be "fooled" by whitepace
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
# Select only rows with any column that has a value in col
df = df[df.isin(col).any(axis=1)]
# Write out CSV to new file
df.to_csv("test2.csv")
This will filter each CSV row, checking each column for one of the values in col
. If any column has a value in col
, it appears in the output CSV file.
I found that if the text in the CSV ends up then exact match won't work. For example, the CSV line below string1
will not exactly match due to the trailing space.
value1, value2, string1 , value3
Hence extra code to trim all lines. Tested using a library csv
and will have the same problem. If you know your CSV lines will never have spaces, you can even remove those two lines. Then the filtering code in its entirety of your example would be (borrowing from Tomalakglob
):
import pandas as pd
import glob
import os
def filter_csv(folderpath):
list1 = [
('name1',1,'string1','value1'),
('name2',2,'string2','value2'),
('name3',3,'string3','value3'),
('name4',4,'string4','value4')
]
def column(matrix, i):
return [row[i] for row in matrix]
col = column(list1,2)
for filepath in glob.glob("%s/*.csv" % folderpath):
filename = os.path.basename(filepath)
new_filepath = os.path.join(folderpath, 'new_' + filename)
df = pd.read_csv(filename, header=None, skipinitialspace=True, delimiter ="|")
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
df[df.isin(col).any(axis=1)].to_csv(new_filepath, sep="|", header=False, index=False)
But if whitespace is not a problem, you can trim the following lines from your code:
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
source to share
What you want to do is write the filtered list to a file.
The filter condition is "one of some predefined valid values must be in the CSV string", and for that we can use multiple intersections for that. We need a set of valid values, and when you cross those that have the current input line and some remain, the current input line goes through.
import os, csv
from glob import glob
def filter_csv(folderpath):
list1 = [
('name1',1,'string1','value1'),
('name2',2,'string2','value2'),
('name3',3,'string3','value3'),
('name4',4,'string4','value4'),
# ...
]
# prepare a set of valid values
valid_values = {row[2] for row in list1}
for filepath in glob("%s/*.csv" % folderpath):
filename = os.path.basename(filepath)
new_filepath = os.path.join(folderpath, 'new_' + filename)
with open(filepath, newline='') as infile, \
open(new_filepath, 'wb') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
filtered_rows = (row for row in reader if valid_values.intersection(row))
writer.writerows(filtered_rows)
Notes
-
glob()
very useful for finding files by extension -
with
can handle multiple resources - you can use
\
to expand multiple lines - optimized for finding matching values
-
{ ... }
is a collection concept - it turns a list into a collection - always parsing CSV files using CSV module - never use
.split()
or such things
source to share