Spark streaming - Does reduceByKeyAndWindow () use persistent memory?

I am playing with the idea of ​​having long-term congestions (maybe a one day window). I understand that the other solutions on this site say you should use batch processing for this.

I am especially interested in understanding this feature. It looks like it would use constant space to aggregate across the window, one interval at a time. If true, it sounds like a one-day aggregation would be viable (especially since it uses a control check if it fails).

Does anyone know if this is the case?

This function is documented as: https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html

A more efficient version of reduceByKeyAndWindow () above, where the reduction value of each window is calculated incrementally using the reduction values ​​of the previous window. It does this by shrinking new data that enters the sliding window and "back-reducing" old data that leaves the window. An example would be an example of "adding" and "subtracting" key counts on a window slide. However, it only applies to "reversible reduction functions", that is, those that reduce functions that have a corresponding "back reduction" function (taken as a parameter to invFunc). As with reduceByKeyAndWindow, the number of tasks to reduce is configurable using an optional argument. Note,that checkpoint must be enabled to use this operation.

+3


source to share


1 answer


After exploring this in the MapR forums, it looks like it will definitely use a persistent level of memory, which will make a daily window possible if you can fit one day of data into allocated resources.

The two minutes are as follows:



  • The daily aggregation may take only 20 minutes to complete. Running a window throughout the day means that you are using all of those cluster resources on an ongoing basis, not just 20 minutes a day. Thus, offline batch aggregations are much more efficient.
  • It is difficult to deal with late data when you translate exactly every other day. If your data is dated, you need to wait until all of your data arrives. A 1-day streaming window would be nice if you literally do an analysis of the last 24 hours of data regardless of its content.
0


source







All Articles