How can I find duplicates in a python list that are adjacent to each other and list them by indices?

I have a program that reads a CSV file, checks for any column length mismatch (comparing it to the header fields), which then returns whatever it finds as a list (and then writes it to the file). What I want to do with this list is to list the results like this:

the line numbers in which the same mismatch was found: the number of columns in that line

eg.

rows: n-m : y

      

where n and m are the numbers of rows that have the same number of columns that do not match the header.

I've looked into these topics and while the information is helpful, they don't answer the question:

Find and display duplicates in the list?

Define duplicate values ​​in a list in Python

This is where I am now:

r = csv.reader(data, delimiter= '\t')
columns = []
for row in r:
        # adds column length to a list
        colm = len(row)
        columns.append(colm)

b = len(columns)
for a in range(b):
        # checks if the current member matches the header length of columns
        if columns[a] != columns[0]:
                # if it doesnt, write the row and the amount of columns in that row to a file
                file.write("row  " + str(a + 1) + ": " + str(columns[a]) + " \n")

      

the file output looks like this:

row  7220: 0 
row  7221: 0 
row  7222: 0 
row  7223: 0 
row  7224: 0 
row  7225: 1 
row  7226: 1 

      

when the desired end result is

rows 7220 - 7224 : 0
rows 7225 - 7226 : 1

      

So what I need, as I see it, is a dictionary, where the key is the rows with the duplicate value and the value is the number of columns in the specified mismatch. What I essentially deem necessary (in badly written pseudocode now makes no sense, that I am reading it years after writing this question) is here:

def pseudoList():
    i = 1
    ListOfLists = []
    while (i < len(originalList)):
        duplicateList = []
        if originalList[i] == originalList[i-1]:
            duplicateList.append(originalList[i])
        i += 1
    ListOfLists.append(duplicateList)


def PseudocreateDict(ListOfLists):
    pseudoDict = {}
    for x in ListOfLists:
        a = ListOfLists[x][0]                   #this is the first node in the uniqueList created
        i = len(ListOfLists) - 1
        b = listOfLists[x][i]   #this is the last node of the uniqueList created
        pseudodict.update('key' : '{} - {}'.format(a,b))

      

This, however, seems to be a very confusing way for doing what I want, so I was wondering if there is a better way b) an easier way to do this?

+3


source to share


3 answers


You can use a list view to return a list of items in a column list that are different from the adjacent items that will be the endpoints of your ranges. Then list those ranges and print / write down the ones that differ from the first (header) element. An extra element is added to the range list to indicate the ending index of the list to avoid out of range indexing.

columns = [2, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 2, 1];

ranges = [[i+1, v] for i,v in enumerate(columns[1:]) if columns[i] != columns[i+1]]
ranges.append([len(columns),0]) # special case for last element 
for i,v in enumerate(ranges[:-1]):
    if v[1] != columns[0]:
        print "rows", v[0]+1, "-", ranges[i+1][0], ":", v[1]

      



output:

rows 2 - 5 : 1
rows 6 - 9 : 0
rows 10 - 11 : 1
rows 13 - 13 : 1

      

+1


source


You can also try the following code -



b = len(columns)
check = 0
for a in range(b):
        # checks if the current member matches the header length of columns
        if check != 0 and columns[a] == check:
            continue
        elif check != 0 and columns[a] != check:
            check = 0
            if start != a:
                file.write("row  " + str(start) + " - " + str(a) + ": " + str(columns[a]) + " \n")
            else:
                file.write("row  " + str(start) + ": " + str(columns[a]) + " \n")
        if columns[a] != columns[0]:
                # if it doesnt, write the row and the amount of columns in that row to a file
                start = a+1
                check = columns[a]

      

+1


source


What you want to do is a map / reduce operation, but without sorting, which is usually done between mapping and reducing.

If you output

row  7220: 0 
row  7221: 0 
row  7222: 0 
row  7223: 0 

      

In stdout, you can pipe this data to another python program that generates the groups you want.

The second python program might look something like this:

import sys
import re


line = sys.stdin.readline()
last_rowid, last_diff = re.findall('(\d+)', line)

for line in sys.stdin:
    rowid, diff = re.findall('(\d+)', line)
    if diff != last_diff:
        print "rows", last_rowid, rowid, last_diff
        last_diff = diff
        last_rowid = rowid

print "rows", last_rowid, rowid, last_diff

      

You must execute them like this in a unix environment to get the output to a file:

python yourprogram.py | python myprogram.py > youroutputfile.dat

      

If you can't run this in a unix environment, you can still use the algorithm I wrote in your program with a few changes.

0


source







All Articles