Spark streaming - Does reduceByKeyAndWindow () use persistent memory?
I am playing with the idea of ββhaving long-term congestions (maybe a one day window). I understand that the other solutions on this site say you should use batch processing for this.
I am especially interested in understanding this feature. It looks like it would use constant space to aggregate across the window, one interval at a time. If true, it sounds like a one-day aggregation would be viable (especially since it uses a control check if it fails).
Does anyone know if this is the case?
This function is documented as: https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html
A more efficient version of reduceByKeyAndWindow () above, where the reduction value of each window is calculated incrementally using the reduction values ββof the previous window. It does this by shrinking new data that enters the sliding window and "back-reducing" old data that leaves the window. An example would be an example of "adding" and "subtracting" key counts on a window slide. However, it only applies to "reversible reduction functions", that is, those that reduce functions that have a corresponding "back reduction" function (taken as a parameter to invFunc). As with reduceByKeyAndWindow, the number of tasks to reduce is configurable using an optional argument. Note,that checkpoint must be enabled to use this operation.
source to share
After exploring this in the MapR forums, it looks like it will definitely use a persistent level of memory, which will make a daily window possible if you can fit one day of data into allocated resources.
The two minutes are as follows:
- The daily aggregation may take only 20 minutes to complete. Running a window throughout the day means that you are using all of those cluster resources on an ongoing basis, not just 20 minutes a day. Thus, offline batch aggregations are much more efficient.
- It is difficult to deal with late data when you translate exactly every other day. If your data is dated, you need to wait until all of your data arrives. A 1-day streaming window would be nice if you literally do an analysis of the last 24 hours of data regardless of its content.
source to share