Efficiency of Rethinkdb with multivalued eqJoin value on composite index

UPDATE: If you look at this more, I think I just need to denormalize the data more and just add the appropriate location information (city / state / etc) to each row of the referrals table, rather than doing a concatenation of a large set of postal indices. Am I on the right track?

Original question:

I have two tables, one of which contains zip code information and a referral table that contains about 4 million rows, each with a ZIP code and some other information. Given a specific zip code, I want all the zip addresses to be in the same state and / or city (just specify in the example below) and then pull out all the rows from the referral table that match those zip files and some others category data that comes from some other parts of the system.

I am using this query below which works but takes about 10 seconds.

r.table('all_geocodes').getAll('77019',
            {index:'postal_code'}).pluck('state_name')
  .eqJoin(r.row('state_name'), r.table('all_geocodes'),
      {index: 'state_name'}).zip().pluck('location_id').distinct()

.eqJoin(function(v) { 
         return [v('location_id'), '207P00000X', 1]; 
   }, 
   r.table('fact_referrals_aggregates'),
     {index: 'location_and_taxonomy_and_entity_type_code'})
.zip().count()

      

Several relevant numbers:

  • The first part of the query returns about 2700 zip codes.
  • getAll(['207P00000X', 1])

    on referrals returns about 100,000 lines and inner join compared to post codes is much slower than mine.
  • It also contains about 100,000 lines that only match postal codes.
  • There are about 8000 lines that match both in ZIP index and parameters ['207P00000X', 1]

    .

As you can see from the query parser output below, ReDB inserts a concatmap that takes all the time. This is my first day using RethinkDB, so I'm sure I'm missing something, but I don't know what it is. Any suggestions on how to improve the performance for this query? Is it possible to avoid concatMap by structuring things differently?

    {
    "description": "Evaluating count.",
    "duration(ms)": 9428.348954,
    "sub_tasks": [
      {
        "description": "Evaluating zip.",
        "duration(ms)": 9394.828064,
        "sub_tasks": [
          {
            "description": "Evaluating eq_join.",
            "duration(ms)": 9198.099333,
            "sub_tasks": [
              {
                "description": "Evaluating concatmap.",
                "duration(ms)": 9198.095406,
                "sub_tasks": [
                  {
                    "description": "Evaluating distinct.", // Distinct ZIP codes
                    "duration(ms)": 114.880663,
                    "sub_tasks": [
                      { *snipped for brevity* },
                      {
                        "description": "Evaluating elements in distinct.",
                        "mean_duration(ms)": 0.001039,
                        "n_samples": 2743
                      }
                    ]
                  },
                  {
                    "description": "Evaluating func.",
                    "duration(ms)": 0.004119,
                    "sub_tasks": []
                  },
                  {
                    "description": "Evaluating stream eagerly.",
                    "mean_duration(ms)": 1.0934,
                    "n_samples": 8285
                  }
                ]
              }
            ]
          },
          {
            "description": "Evaluating stream eagerly.",
            "mean_duration(ms)": 0.001005,
            "n_samples": 8285
          }
        ]
      }
    ]

      

+3


source to share


1 answer


Denormalization turned out to be what was here. I added fields for state and city to the referral table, eliminating the massive join to the ZIP table, and the speed has improved significantly.



+2


source







All Articles