Python 2.7 Matching exact string in a string of a CSV file

Question

Python 2.7 Matching exact string in a string of a CSV file

I have a csv file containing rows in a specific column and other values in others. I also have a list of strings. Moving through the lines, I want to check if exactly one of the lines in the list is contained in the line of the csv file. If so, this line should be written to a new csv file.

The csv file is a list of lines like:

22/06/2017 04:00:32 | string1 | value1   
22/06/2017 04:00:32 | string11 | value2    
22/06/2017 04:00:32 | string2 | value3   
22/06/2017 04:00:32 | string3 | value4

I wrote this code, it works fine, but does not account for exact string matching.

import os, csv

def filter_csv(folderpath):

     list1 = [
     ('name1',1,'string1','value1'),
     ('name2',2,'string2','value2'),
     ('name3',3,'string3','value3'),
     ('name4',4,'string4','value4'),
     ...
     ]

     def column(matrix, i):
         return [row[i] for row in matrix]

     col = column(list1,2)

     for file in os.listdir("%s" % folderpath):
         if file.endswith(".csv"):
             new_file = 'new_'+file
             filepath = os.path.join("%s" % folderpath, file)
             new_filepath = os.path.join("%s" % folderpath, new_file)

             with open('%s' % filepath) as csvfile:
                 lines = csvfile.readlines()

             with open('%s' % new_filepath, 'wb') as csvfile2:
                 for line in lines:
                     for namevar in col:
                         if namevar in line:
                             csvfile2.write(line)

     return

How can I add exact string matching for a column of a csv file?

+3

python string python-2.7 match

giube June 22. 17 at 14:15

source to share

2 answers

What you want to do is write the filtered list to a file.

The filter condition is "one of some predefined valid values must be in the CSV string", and for that we can use multiple intersections for that. We need a set of valid values, and when you cross those that have the current input line and some remain, the current input line goes through.

import os, csv
from glob import glob

def filter_csv(folderpath):

    list1 = [
        ('name1',1,'string1','value1'),
        ('name2',2,'string2','value2'),
        ('name3',3,'string3','value3'),
        ('name4',4,'string4','value4'),
        # ...
    ]

    # prepare a set of valid values
    valid_values = {row[2] for row in list1}

    for filepath in glob("%s/*.csv" % folderpath):
        filename = os.path.basename(filepath)
        new_filepath = os.path.join(folderpath, 'new_' + filename)

        with open(filepath, newline='') as infile, \
            open(new_filepath, 'wb') as outfile:

            reader = csv.reader(infile)
            writer = csv.writer(outfile)
            filtered_rows = (row for row in reader if valid_values.intersection(row))

            writer.writerows(filtered_rows)

Notes

glob()

very useful for finding files by extension
with

can handle multiple resources
you can use \

to expand multiple lines
optimized for finding matching values
{ ... }

is a collection concept - it turns a list into a collection
always parsing CSV files using CSV module - never use .split()

or such things

+1

Tomalak June 22. 17 at 15:16

source to share

Jimbo · Accepted Answer · 2017-06-22T15:09:33+0000

Just to present a slightly different solution, using a different library, which now that I think about it might be overkill, but you might like :) You should be able to modify the example to insert into your code ...

import pandas as pd

# Dummy col
col = ["string1", "string2", "string3"]

# Read in CSV file
df = pd.read_csv("test.csv", header=None, skipinitialspace=True, delimiter ="|")

# Strip all strings so that trailing whitespace is ignored. 
# csv library OP used would also be "fooled" by whitepace
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())

# Select only rows with any column that has a value in col
df = df[df.isin(col).any(axis=1)]

# Write out CSV to new file
df.to_csv("test2.csv")

This will filter each CSV row, checking each column for one of the values in col

. If any column has a value in col

, it appears in the output CSV file.

I found that if the text in the CSV ends up then exact match won't work. For example, the CSV line below string1

will not exactly match due to the trailing space.

 value1, value2, string1   , value3

Hence extra code to trim all lines. Tested using a library csv

and will have the same problem. If you know your CSV lines will never have spaces, you can even remove those two lines. Then the filtering code in its entirety of your example would be (borrowing from Tomalakglob

):

import pandas as pd
import glob
import os

def filter_csv(folderpath):

    list1 = [
        ('name1',1,'string1','value1'),
        ('name2',2,'string2','value2'),
        ('name3',3,'string3','value3'),
        ('name4',4,'string4','value4')
    ]

    def column(matrix, i):
        return [row[i] for row in matrix]

    col = column(list1,2)

    for filepath in glob.glob("%s/*.csv" % folderpath):
        filename = os.path.basename(filepath)
        new_filepath = os.path.join(folderpath, 'new_' + filename)
        df = pd.read_csv(filename, header=None, skipinitialspace=True, delimiter ="|")
        df_obj = df.select_dtypes(['object'])
        df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
        df[df.isin(col).any(axis=1)].to_csv(new_filepath, sep="|", header=False, index=False)

But if whitespace is not a problem, you can trim the following lines from your code:

 df_obj = df.select_dtypes(['object'])
 df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())

Python 2.7 Matching exact string in a string of a CSV file

More articles: