Python 2.7 Matching exact string in a string of a CSV file

I have a csv file containing rows in a specific column and other values ​​in others. I also have a list of strings. Moving through the lines, I want to check if exactly one of the lines in the list is contained in the line of the csv file. If so, this line should be written to a new csv file.

The csv file is a list of lines like:

22/06/2017 04:00:32 | string1 | value1   
22/06/2017 04:00:32 | string11 | value2    
22/06/2017 04:00:32 | string2 | value3   
22/06/2017 04:00:32 | string3 | value4

      

I wrote this code, it works fine, but does not account for exact string matching.

import os, csv

def filter_csv(folderpath):

     list1 = [
     ('name1',1,'string1','value1'),
     ('name2',2,'string2','value2'),
     ('name3',3,'string3','value3'),
     ('name4',4,'string4','value4'),
     ...
     ]

     def column(matrix, i):
         return [row[i] for row in matrix]

     col = column(list1,2)

     for file in os.listdir("%s" % folderpath):
         if file.endswith(".csv"):
             new_file = 'new_'+file
             filepath = os.path.join("%s" % folderpath, file)
             new_filepath = os.path.join("%s" % folderpath, new_file)

             with open('%s' % filepath) as csvfile:
                 lines = csvfile.readlines()

             with open('%s' % new_filepath, 'wb') as csvfile2:
                 for line in lines:
                     for namevar in col:
                         if namevar in line:
                             csvfile2.write(line)

     return

      

How can I add exact string matching for a column of a csv file?

+3


source to share


2 answers


Just to present a slightly different solution, using a different library, which now that I think about it might be overkill, but you might like :) You should be able to modify the example to insert into your code ...

import pandas as pd

# Dummy col
col = ["string1", "string2", "string3"]

# Read in CSV file
df = pd.read_csv("test.csv", header=None, skipinitialspace=True, delimiter ="|")

# Strip all strings so that trailing whitespace is ignored. 
# csv library OP used would also be "fooled" by whitepace
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())

# Select only rows with any column that has a value in col
df = df[df.isin(col).any(axis=1)]

# Write out CSV to new file
df.to_csv("test2.csv")

      

This will filter each CSV row, checking each column for one of the values ​​in col

. If any column has a value in col

, it appears in the output CSV file.

I found that if the text in the CSV ends up then exact match won't work. For example, the CSV line below string1

will not exactly match due to the trailing space.

 value1, value2, string1   , value3

      



Hence extra code to trim all lines. Tested using a library csv

and will have the same problem. If you know your CSV lines will never have spaces, you can even remove those two lines. Then the filtering code in its entirety of your example would be (borrowing from Tomalakglob

):

import pandas as pd
import glob
import os

def filter_csv(folderpath):

    list1 = [
        ('name1',1,'string1','value1'),
        ('name2',2,'string2','value2'),
        ('name3',3,'string3','value3'),
        ('name4',4,'string4','value4')
    ]

    def column(matrix, i):
        return [row[i] for row in matrix]

    col = column(list1,2)

    for filepath in glob.glob("%s/*.csv" % folderpath):
        filename = os.path.basename(filepath)
        new_filepath = os.path.join(folderpath, 'new_' + filename)
        df = pd.read_csv(filename, header=None, skipinitialspace=True, delimiter ="|")
        df_obj = df.select_dtypes(['object'])
        df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
        df[df.isin(col).any(axis=1)].to_csv(new_filepath, sep="|", header=False, index=False)

      

But if whitespace is not a problem, you can trim the following lines from your code:

 df_obj = df.select_dtypes(['object'])
 df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())  

      

0


source


What you want to do is write the filtered list to a file.

The filter condition is "one of some predefined valid values ​​must be in the CSV string", and for that we can use multiple intersections for that. We need a set of valid values, and when you cross those that have the current input line and some remain, the current input line goes through.

import os, csv
from glob import glob

def filter_csv(folderpath):

    list1 = [
        ('name1',1,'string1','value1'),
        ('name2',2,'string2','value2'),
        ('name3',3,'string3','value3'),
        ('name4',4,'string4','value4'),
        # ...
    ]

    # prepare a set of valid values
    valid_values = {row[2] for row in list1}

    for filepath in glob("%s/*.csv" % folderpath):
        filename = os.path.basename(filepath)
        new_filepath = os.path.join(folderpath, 'new_' + filename)

        with open(filepath, newline='') as infile, \
            open(new_filepath, 'wb') as outfile:

            reader = csv.reader(infile)
            writer = csv.writer(outfile)
            filtered_rows = (row for row in reader if valid_values.intersection(row))

            writer.writerows(filtered_rows)

      



Notes

  • glob()

    very useful for finding files by extension
  • with

    can handle multiple resources
  • you can use \

    to expand multiple lines
  • optimized for finding matching values
  • { ... }

    is a collection concept - it turns a list into a collection
  • always parsing CSV files using CSV module - never use .split()

    or such things
+1


source







All Articles