How can I find duplicates in a python list that are adjacent to each other and list them by indices?
I have a program that reads a CSV file, checks for any column length mismatch (comparing it to the header fields), which then returns whatever it finds as a list (and then writes it to the file). What I want to do with this list is to list the results like this:
the line numbers in which the same mismatch was found: the number of columns in that line
eg.
rows: n-m : y
where n and m are the numbers of rows that have the same number of columns that do not match the header.
I've looked into these topics and while the information is helpful, they don't answer the question:
Find and display duplicates in the list?
Define duplicate values ββin a list in Python
This is where I am now:
r = csv.reader(data, delimiter= '\t')
columns = []
for row in r:
# adds column length to a list
colm = len(row)
columns.append(colm)
b = len(columns)
for a in range(b):
# checks if the current member matches the header length of columns
if columns[a] != columns[0]:
# if it doesnt, write the row and the amount of columns in that row to a file
file.write("row " + str(a + 1) + ": " + str(columns[a]) + " \n")
the file output looks like this:
row 7220: 0
row 7221: 0
row 7222: 0
row 7223: 0
row 7224: 0
row 7225: 1
row 7226: 1
when the desired end result is
rows 7220 - 7224 : 0
rows 7225 - 7226 : 1
So what I need, as I see it, is a dictionary, where the key is the rows with the duplicate value and the value is the number of columns in the specified mismatch. What I essentially deem necessary (in badly written pseudocode now makes no sense, that I am reading it years after writing this question) is here:
def pseudoList():
i = 1
ListOfLists = []
while (i < len(originalList)):
duplicateList = []
if originalList[i] == originalList[i-1]:
duplicateList.append(originalList[i])
i += 1
ListOfLists.append(duplicateList)
def PseudocreateDict(ListOfLists):
pseudoDict = {}
for x in ListOfLists:
a = ListOfLists[x][0] #this is the first node in the uniqueList created
i = len(ListOfLists) - 1
b = listOfLists[x][i] #this is the last node of the uniqueList created
pseudodict.update('key' : '{} - {}'.format(a,b))
This, however, seems to be a very confusing way for doing what I want, so I was wondering if there is a better way b) an easier way to do this?
source to share
You can use a list view to return a list of items in a column list that are different from the adjacent items that will be the endpoints of your ranges. Then list those ranges and print / write down the ones that differ from the first (header) element. An extra element is added to the range list to indicate the ending index of the list to avoid out of range indexing.
columns = [2, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 2, 1];
ranges = [[i+1, v] for i,v in enumerate(columns[1:]) if columns[i] != columns[i+1]]
ranges.append([len(columns),0]) # special case for last element
for i,v in enumerate(ranges[:-1]):
if v[1] != columns[0]:
print "rows", v[0]+1, "-", ranges[i+1][0], ":", v[1]
output:
rows 2 - 5 : 1
rows 6 - 9 : 0
rows 10 - 11 : 1
rows 13 - 13 : 1
source to share
You can also try the following code -
b = len(columns)
check = 0
for a in range(b):
# checks if the current member matches the header length of columns
if check != 0 and columns[a] == check:
continue
elif check != 0 and columns[a] != check:
check = 0
if start != a:
file.write("row " + str(start) + " - " + str(a) + ": " + str(columns[a]) + " \n")
else:
file.write("row " + str(start) + ": " + str(columns[a]) + " \n")
if columns[a] != columns[0]:
# if it doesnt, write the row and the amount of columns in that row to a file
start = a+1
check = columns[a]
source to share
What you want to do is a map / reduce operation, but without sorting, which is usually done between mapping and reducing.
If you output
row 7220: 0
row 7221: 0
row 7222: 0
row 7223: 0
In stdout, you can pipe this data to another python program that generates the groups you want.
The second python program might look something like this:
import sys
import re
line = sys.stdin.readline()
last_rowid, last_diff = re.findall('(\d+)', line)
for line in sys.stdin:
rowid, diff = re.findall('(\d+)', line)
if diff != last_diff:
print "rows", last_rowid, rowid, last_diff
last_diff = diff
last_rowid = rowid
print "rows", last_rowid, rowid, last_diff
You must execute them like this in a unix environment to get the output to a file:
python yourprogram.py | python myprogram.py > youroutputfile.dat
If you can't run this in a unix environment, you can still use the algorithm I wrote in your program with a few changes.
source to share