Google data flow pipeline with Local Cache instance + External REST API
We want to create a Cloudflow streaming stream that consumes events from Pubsub and performs multiple ETL-like operations for each individual event. One of these operations is that each event has a device ID that needs to be mapped to a different value (lets call it mapped-id), mapped to id-> mapped-id provided by an external REST API service. The same device ID can be repeated across multiple events, so these mappings to id-id-id-> can be cached and reused. Since we can deal with 3M events per second at the peak of the pipeline, the REST API call should be as short as possible, and also optimized when the calls are actually needed.
With this setup in mind, I have the following questions.
-
To optimize the REST API call, does the data flow support any built-in optimizations like pooling, or if we chose to use our own mechanisms, are there any limitations / constraints that we should keep in mind?
-
We are looking into some in-memory cache options to cache mappings locally, some of which are also supported by local files. So, are there any limits on how much memory (as a fraction of the total instance memory) this cache can use without affecting normal data flow operations in workers? if we are using a file-backed cache, is there a path for each worker that is safe for the application itself to use to create these files?
-
The number of unique device identifiers can be on the order of several million, so not all of these mappings can be stored in every instance. Therefore, to make better use of the local cache, we need to get some affinity between the device ID and the workers where they are being processed. I can make a group based on the device id before the stage this conversion takes place. If I do this, is there any guarantee that the same device ID will more or less be handled by the same worker? if there is some reasonable guarantee, then I wouldn't have to hit the external REST API most of the time, other than the first call, which should be fine. Or is there a better way to ensure that kind of closeness between the ides and the workers.
thank
source to share
Here are some things you can do:
- Your DoFn can have instance variables and you can put a cache in there.
- It is also possible to use normal Java static variables for the cache local to the VM, as long as you properly control the multithreaded access to it. Guava CacheBuilder can be really helpful here.
- Using the regular Java APIs for temp files on the desktop is safe (but, again, remember to multithread / multiprocess your files and don't forget to clear them - you may find
DoFn
@Setup
it@Teardown
helpful). - You can make
GroupByKey
the device identifier; then most of the time, at least with the Cloud Dataflow runner, the same key will be processed by the same worker (although the key assignment may change when the pipeline starts, but not too often). You probably want to set an on / off strategy on immediate startup.
source to share