Trying to find a more efficient way to filter files

Given the list of deviceIds, I'm trying to come up with a more efficient way to handle duplicates. When a duplicate is found in the deviceId list, I only need to keep the last file and delete the rest. What I came up with seems to be working fine, but I am wondering if it can be made more efficient? My current method doesn't seem to scale very well, for example it processes 25,000 files in 5 seconds but takes 70 seconds for 100,000 files. Any thoughts?

List<File> filteredList;
        for(int i = 0; i < deviceIds.size(); i++) {
            if(i < (deviceIds.size()-1) && deviceIds.get(i).equals(deviceIds.get(i+1))) {
                filteredList = Lists.newArrayList(Iterables.filter(fileList, new DeviceIdFilter(deviceIds.get(i))));
                Collections.sort(filteredList, new OldestFileComparator());
                for(int t = 0; t < (filteredList.size()-1); t++) {
                    filteredList.get(t).delete();
                }
            }
        }

private static class DeviceIdFilter implements Predicate<File> {
    private String deviceId;
    private DeviceIdFilter(final String deviceId) {
        this.deviceId = deviceId;
    }
    @Override
    public boolean apply(final File file) {
        return file.getName().contains(deviceId);
    }
}

public class OldestFileComparator implements Comparator<File> {
    public int compare(File filea, File fileb) {
        if (filea.lastModified() > fileb.lastModified()) {
            return +1;
        } else if (filea.lastModified() < fileb.lastModified()) {
            return -1;
        } else {
            return 0;
        }
    }
}

      

Edit:

I implemented TacticalCoders' solution which worked great, processing 100,000 files in 0.60 seconds.

    Map<String, List<File>> fileMap = new HashMap<String,List<File>>();
    String deviceId;
    List<File> deviceFileList;
    for(File file : fileList) {
        deviceId = getDeviceId(file.getName());
        if(fileMap.containsKey(deviceId)) {
            fileMap.get(deviceId).add(file);
        } else {
            deviceFileList = new LinkedList<File>();
            deviceFileList.add(file);
            fileMap.put(deviceId, deviceFileList);
        }
    }

    for (Map.Entry<String, List<File>> mapEntry : fileMap.entrySet()) {
        deviceFileList = mapEntry.getValue();
        if(deviceFileList.size() > 1) {
            Collections.sort(deviceFileList, new OldestFileComparator());
            for(int t = 0; t < (deviceFileList.size()-1); t++) {
                deviceFileList.get(t).delete();
            }
        }

      

+3


source to share


1 answer


My current method doesn't seem to scale very well, for example it processes 25,000 files in 5 seconds but takes 70 seconds for 100,000 files. Any thoughts?

This is because you have an O (n ^ 2) algorithm (it could potentially degenerate much worse than O (n ^ 2) if you have mostly duplicates, in which case you would be doing O (n log n) sort in addition to your two loops, but I suppose you don't have 100,000 files that basically always match the same).

If I read the problem correctly, you could just do the first pass where you would construct Map <String, List <File -> (where the key would be a (sub) String corresponding to the device id).

After that, first skip every file that has a duplicate will be on the list with at least two entries, while every file that has no duplicates will be on its own list.



Then you iterate over your map and every time you find a List <File> with more than one entry, you sort that list according to the date and delete everything but the last file.

Will this work?

EDIT , you have to be careful with your device IDs: I don't know what they look like at all, but if one ID could be say "nop100" and another device ID could be say "nop1000" then if you if you process "nop100" to "nop1000", you may have problems calling the contains method (because "nop1000" is mistaken for the device ID of the "nop100" devices). As far as I can tell, this problem also exists in the partial code you posted. There are workarounds of course, but it's hard to go any further without knowing more about the types of files you are processing.

+2


source







All Articles