Compare adjacent list items

I am writing a duplex file detector. To determine if two files are duplicated, I calculate the CRC32 checksum. Since this can be an expensive operation, I only want to compute checksums for files that have another file with the appropriate size. I have sorted the file list by size and loop through to compare each item with those above and below. Unfortunately, there is a problem at the beginning and end, as there will be no previous or next file, respectively. I can fix this using if statements, but it feels awkward. Here is my code:

    public void GetCRCs(List<DupInfo> dupInfos)
    {
        var crc = new Crc32();
        for (int i = 0; i < dupInfos.Count(); i++)
        {
            if (dupInfos[i].Size == dupInfos[i - 1].Size || dupInfos[i].Size == dupInfos[i + 1].Size)
            {
                dupInfos[i].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i].FullName));
            }
        }
    }

      

My question is:

  • How can I compare each entry with my neighbors without error outside?

  • Should I use a loop for this or is there a better LINQ or other feature?

Note. I left out the rest of my code to avoid confusion. If you want to see it, I can turn it on.

+2


source to share


4 answers


I have sorted the list of files by size and scroll to compare each item with those above and below.

The next logical step is to actually group your files by size. Comparing sequential files will not always be sufficient if you have more than two files of the same size. Instead, you will need to compare each file to any other file of the same size.

I suggest using this approach

  • Use LINQ .GroupBy

    to create a collection of file sizes. Then .Where

    hold on to groups with only a few files.

  • In these groups, calculate the CRC32 checksum and add it to the collection of known checksums. Compare with the previously calculated checksums. If you need to know which files are duplicates, you can use the dictionary associated with that checksum (you can achieve this with a different one GroupBy

    . Otherwise, a simple list is sufficient to detect any duplicates.

The code might look something like this:



var filesSetsWithPossibleDupes = files.GroupBy(f => f.Length)
                                      .Where(group => group.Count() > 1);

foreach (var grp in filesSetsWithPossibleDupes)
{
    var checksums = new List<CRC32CheckSum>(); //or whatever type
    foreach (var file in grp)
    {
        var currentCheckSum = crc.ComputeChecksum(file);
        if (checksums.Contains(currentCheckSum))
        {
            //Found a duplicate
        }
        else
        {
            checksums.Add(currentCheckSum);
        }
    }
}

      

Or, if you want specific objects that can be duplicated, the inner loop foreach

might look like

var filesSetsWithPossibleDupes = files.GroupBy(f => f.FileSize)
                                      .Where(grp => grp.Count() > 1);

var masterDuplicateDict = new Dictionary<DupStats, IEnumerable<DupInfo>>();
//A dictionary keyed by the basic duplicate stats
//, and whose value is a collection of the possible duplicates

foreach (var grp in filesSetsWithPossibleDupes)
{
    var likelyDuplicates = grp.GroupBy(dup => dup.Checksum)
                              .Where(g => g.Count() > 1);
    //Same GroupBy logic, but applied to the checksum (instead of file size)

    foreach(var dupGrp in likelyDuplicates)
    {
        //Create the key for the dictionary (your code is likely different)
        var sample = dupGrp.First();
        var key = new DupStats() {FileSize = sample.FileSize, Checksum = sample.Checksum};
        masterDuplicateDict.Add(key, dupGrp);
    }
}

      

Demonstration of this idea.

+1


source


First, compute the Crcs:

// It is assumed that DupInfo.CheckSum is nullable
public void GetCRCs(List<DupInfo> dupInfos)
{
  dupInfos[0].CheckSum = null ;        
  for (int i = 1; i < dupInfos.Count(); i++)
    {
       dupInfos[i].CheckSum = null ;
       if (dupInfos[i].Size == dupInfos[i - 1].Size)
       {
         if (dupInfos[i-1].Checksum==null) dupInfos[i-1].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i-1].FullName));
         dupInfos[i].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i].FullName));
       }
    }
}

      



After sorting files by size and crc, specify duplicates:

public void GetDuplicates(List<DupInfo> dupInfos) 
{
  for (int i = dupInfos.Count();i>0 i++)
  { // loop is inverted to allow list items deletion
    if (dupInfos[i].Size     == dupInfos[i - 1].Size &&
        dupInfos[i].CheckSum != null &&
        dupInfos[i].CheckSum == dupInfos[i - 1].Checksum)
     { // i is duplicated with i-1
       ... // your code here
       ... // eventually, dupInfos.RemoveAt(i) ; 
     }
   }
}

      

+2


source


I think the for loop should be: for (int i = 1; i <dupInfos.Count () - 1; i ++)

var grps= dupInfos.GroupBy(d=>d.Size);
grps.Where(g=>g.Count>1).ToList().ForEach(g=>
{
    ...
});

      

+1


source


Can you make a union between your two lists? If you have a list of filenames and have a concatenation, this should only result in a list of overlapping files. I can write an example if you want, but this link should give you a general idea.

fooobar.com/questions/114712 / ...

Edit: Sorry, for some reason I thought you were comparing the file name, not the size.

So here's the real answer.

using System;
using System.Collections.Generic;
using System.Linq;


public class ObjectWithSize
{
    public int Size {get; set;}
    public ObjectWithSize(int size)
    {
        Size = size;
    }
}

public class Program
{
    public static void Main()
    {
        Console.WriteLine("start");
        var list = new List<ObjectWithSize>();
        list.Add(new ObjectWithSize(12));
        list.Add(new ObjectWithSize(13));
        list.Add(new ObjectWithSize(14));
        list.Add(new ObjectWithSize(14));
        list.Add(new ObjectWithSize(18));
        list.Add(new ObjectWithSize(15));
        list.Add(new ObjectWithSize(15));
        var duplicates = list.GroupBy(x=>x.Size)
              .Where(g=>g.Count()>1);
        foreach (var dup in duplicates)
            foreach (var objWithSize in dup)
                Console.WriteLine(objWithSize.Size);
    }
}

      

Will open

14
14
15
15

      

Here's a netfiddle for that. https://dotnetfiddle.net/0ub6Bs

Final note. I really think your answer looks better and will run faster. It was just a Linq implementation.

0


source







All Articles