Synchronizing two lists of objects

Problem

I have two lists of objects. Each object contains the following:

  • GUID

    (allows you to determine if objects are the same - from a business point of view).
  • Timestamp

    (the current UTC is updated every time the object is changed)
  • Version

    (positive integer, each time the object is increased the object has changed)
  • Deleted

    (boolean flag, switches to "true" instead of actually deleting the object)
  • Data

    (payload)
  • Any other fields, if necessary

Next, I need to synchronize two lists according to these rules:

  • If an object with some GUID

    is present in only one list, it should be copied to another list
  • If an object with some GUID

    is present in both lists, the instance with the lower value Version

    should be replaced with the one with more Version

    (do nothing if the versions are equal)

Real requirements:

  • Each list has 50k + objects, each object is about 1 Kb
  • The lists are hosted on different computers connected via the Internet (for example, a mobile application and a remote server), so the algorithm should not be very bandwidth or CPU intensive.
  • Most of the time (say 96%) of the lists are already synchronized before the synchronization process, so the algorithm should figure it out with minimal effort.
  • If there are any differences, most of the time they are pretty small (3-5 objects changed / added)
  • Should act OK if one list is empty (and the others still have 50k + items)

Decision # 1 (currently implemented)

  • The client keeps the time of the last sync (say T

    )
  • Both lists are requested for all objects that have Timestamp

    > T

    (i.e. recently changed, during production it ...> ( T

    - day) for better stability)
  • These lists of recently changed objects sync naively:
    • items present only in the first list are kept in the second list
    • items present only in the second list are kept in the first list
    • other elements have Version

      that are compared and stored in the corresponding list (if needed)

Procs:

  • Works great with small changes
  • Almost meets the requirements

Minuses:

  • Depends on T

    what makes the algorithm fragile: it's easy to sync the latest updates, but it's hard to make sure the lists are fully in sync (using a minimal T

    like 1970-01-01 just hangs the sync process)

My questions:

  • Is there a general / better / tested way to keep object lists synchronized?
  • Are there any better [than # 1] solutions in my case?

PS Already Viewed, Not Duplicates:

+3


source to share


3 answers


Summary

All answers have some merit. To summarize, here is the compiled answer I was looking for, based on a finally implemented working sync system:

  • In general, use Merkle trees . They are significantly effective when comparing large amounts of data.

  • If you can, rebuild your hash tree from scratch every time you need it . Check the time it takes to restore the hash tree. Most likely, it is quite fast (for example, in my case on Nexus 4, restoring a tree for 20 thousand elements takes ~ 2 s: 1.8 s to collect data from the database + 0.2 s to build a tree, the server does ~ 20 times faster), so you don't want to, t need to save the tree to the DB and maintain it as the data changes (my first attempt was rebuilt only for the respective branches - it's not too hard to implement, but very fragile).

  • However, it is ok for the cache and tree reuse if no data changes have been made at all. After the modification has occurred, invalidate the entire cache.

Technical details

  • 32 character GUID without hyphens / curly braces, in lower case;
  • I am using a 16-ary tree with height 4 where each branch is associated with a GUID char. It can be implemented as an actual tree or map:
    • 0000

      β†’ (hash of items with GUID 0000*

      )
    • 0001

      β†’ (hash of items with GUID 0001*

      )
    • ...
    • ffff

      β†’ (hash of items with GUID ffff*

      );
    • 000

      β†’ (hash hashes 000_

      )
    • ...
    • 00

      β†’ (hash hashes 00_

      )
    • ...
    • () β†’ (root hash, i.e. hash of hashes _

      )


Thus, the tree has 65536 leaves and requires 2 MB of memory; each sheet covers ~ N / 65536 data items. Binary trees will be twice as memory efficient, but harder to implement.

  • I had to implement these methods:

    • getHash()

      - returns the hash of the root; used for the initial check (as already mentioned, at 96%, that's all we need to check);
    • getHashChildren(x)

      - returns a list of hashes x_

      (no more than 16); used for efficient difference of discovery data with one request;
    • findByIdPrefix(x)

      - returns items with GUIDs x*

      , x must contain exactly 4 characters; used to query sheet items;
    • count(x)

      - returns the number of items with a GUID x*

      ; when we are small enough, we can reject the checkout of a branch on the tree branch and pass a bunch of items with a single request;
  • As far as the synchronization performed on each branch transferring small amounts of data is concerned, it is very responsive (you can check the progress at any time) + very reliable for unexpected termination (for example, due to a network failure), and easily restarts from the last point if necessary.

  • IMPORTANT : sometimes you run into conflicting state: { version_1

    = version_2

    , but hash_1

    ! = hash_2

    }: In this case you have to make some decision (perhaps with user help or compare timestamps as a last resort) and rewrite the item with another one to resolve the conflict otherwise you will end up with unsynchronized and unsynchronized hash trees. li>

Possible improvements

  • Implement the transfer of pairs (GUID, Version)

    without payload for facilitating queries.
+2


source


Two sentences come to mind: First, you may already be doing:

1) Don't send entire lists of items with timestamps> T. Instead, send a list (UUID, Version) of tuples of objects with timestamps> T. Then the other side can figure out which objects need to be updated from this. Send UUIDs back from them to query for actual objects. This avoids sending complete objects if they have a timestamp> T but are nevertheless newer (or present with the latest version) on the other side.



2) Do not process the full list at once, but in chunks, i.e. the first sync is 10%, then the next 10%, etc., to avoid transferring too much data at the same time for large syncs (and to ensure point restarting if the connection should break ). This can be done, for example, starting with all UUIDs with a checksum equivalent to 1 modulo 10, then 1 modulo 10, etc.

Another possibility might be proactive synchronization, for example. by sending odds asynchronously, possibly via UCP (unreliable as opposed to TCP). You still need to sync when you need current information, but chances are most of them are current.

+1


source


You don't want to store the time of the last sync, but the state of the objects (for example, the hash of the object's data) during the last sync. Then you compare each list with the saved list and find which objects have changed on each side.

This is much more reliable than relying on time, because time requires both sides to synchronize a timer that gives the exact time (and this is not the case on most systems). For the same reason, your idea of ​​detecting changes based on time + version may be more error prone than meets the eye.

Also, you will not transfer object data, but only the GUID.

By the way, we have created a framework (free with source code) that exactly solves your problems. I am not giving a link because some alternative talented people will complain.

+1


source







All Articles