How to filter jsonb with multiple criteria?

I have the following table structure:

CREATE TABLE mytable (
  id   serial PRIMARY KEY,
  data jsonb
);

      

And the following data (partial for brevity ... note the randomness of years and yrs sales / expenses do not match each other):

INSERT INTO mytable (data)
VALUES
('{"employee": "Jim Romo", 
 "sales": [{"value": 10, "yr": "2012"}, {"value": 5, "yr": "2013"}, {"value": 40, "yr": "2014"}],
 "expenses": [{"value": 2, "yr": "2007"}, {"value": 1, "yr": "2013"}, {"value": 3, "yr": "2014"}], 
 "product": "tv", "customer": "1", "updated": "20150501"
}'),
('{"employee": "Jim Romo", 
 "sales": [{"value": 10, "yr": "2012"}, {"value": 5, "yr": "2013"}, {"value": 41, "yr": "2014"}],
 "expenses": [{"value": 2, "yr": "2009"}, {"value": 3, "yr": "2013"}, {"value": 3, "yr": "2014"}], 
 "product": "tv", "customer": "2", "updated": "20150312"
}'),
('{"employee": "Jim Romo", 
 "sales": [{"value": 20, "yr": "2012"}, {"value": 25, "yr": "2013"}, {"value": 33, "yr": "2014"}],
 "expenses": [{"value": 8, "yr": "2012"}, {"value": 12, "yr": "2014"}, {"value": 5, "yr": "2009"}], 
 "product": "radio", "customer": "2", "updated": "20150311"
}'),
('{"employee": "Bill Baker", 
 "sales": [{"value": 1, "yr": "2010"}, {"value": 2, "yr": "2009"}, {"value": 3, "yr": "2014"}],
 "expenses": [{"value": 3, "yr": "2011"}, {"value": 1, "yr": "2012"}, {"value": 7, "yr": "2013"}], 
 "product": "tv", "customer": "1", "updated": "20150205"
}'),
('{"employee": "Bill Baker", 
 "sales": [{"value": 10, "yr": "2010"}, {"value": 12, "yr": "2011"}, {"value": 3, "yr": "2014"}],
 "expenses": [{"value": 4, "yr": "2011"}, {"value": 7, "yr": "2009"}, {"value": 4, "yr": "2013"}], 
 "product": "radio", "customer": "1", "updated": "20150204"
}'),
('{"employee": "Jim Romo",
 "sales": [{"value": 22, "yr": "2009"}, {"value": 17, "yr": "2013"}, {"value": 35, "yr": "2014"}],
 "expenses": [{"value": 14, "yr": "2011"}, {"value": 13, "yr": "2014"}, {"value": 8, "yr": "2013"}], 
 "product": "tv", "customer": "3", "updated": "20150118"
}')

      

For each employee, I need to evaluate the most recent updated row and find employees with sales in 2014 that are over 30. From there I need to further filter for employees with average TV expenses less than 5. For the average, I just need to take ALL their TV expenses. not just the last line.

My expected output would be 1 line:

employee    | customer | 2014 tv sales   |  2013 avg tv expenses
------------+----------+-----------------+----------------------
Jim Romo    |    1     |   40            |  4

      

I can (kindof) do 1 or the other, but not both:

and. Get 2014 sales> 30 (but couldn't get your latest "tv" sales; (

SELECT * FROM mytable WHERE (SELECT (a->>'value')::float FROM
    (SELECT jsonb_array_elements(data->'sales') as a) as b 
    WHERE a @> json_object(ARRAY['yr', '2014'])::jsonb) > 30

      

b. Get the average expenses for 2013 (this should be Avg tv expenses)

SELECT avg((a->>'value')::numeric) FROM  
  (SELECT jsonb_array_elements(data->'expenses') as a FROM mytable) as b
  WHERE a @> json_object(ARRAY['yr', '2013'])::jsonb

      

EDIT: This will potentially be a very large table, so any comments on performance and indexing needs would be appreciated as I'm new to both postgresql and jsonb.

EDIT # 2: I tried both answers and neither was effective for a large table; (

+3


source to share


2 answers


This is a (rather long) answer to your problem. Comments within a request should explain the different parts. The main ideas I followed are: 1) keep each operation simple, try to create the correct result first, and then optimize; 2) convert as large (but not significant) json structure as possible to a more "relational structure" since relationships have more powerful operators that json data in postgres. There is room at the root to simplify the query and even create a more efficient version, but at least this is a starting point.



with mytable1 as   -- transform the table in a more "relational-like" structure (just for clarity)
  (select id, data->>'employee' as employee, data->>'product' as product, 
      (data->>'updated')::integer as updated, (data->>'customer')::integer as customer,
          data->'sales' as sales, data->'expenses' as expenses 
   from mytable),
avg_exp_for_2013_tv as -- find the average expenses for tv in 2013 for each employee
   (select employee, avg(expenses.value) as avg2013_expenses
    from mytable1 , jsonb_to_recordset(expenses) as expenses(yr text, value float)
    where product = 'tv' and expenses.yr = '2013'
    group by employee),
most_recent_updates_employees as  -- find the most recent updates for each employee 
   (select employee, max(updated) as updated
    from mytable1 t1
    group by employee),
most_recent_updated_rows as   -- find the rows with the most recent updates
   (select t1.*
    from mytable1 t1, most_recent_updates_employees m
    where t1.employee = m.employee and t1.updated = m.updated),
employees_with_2014_tv_sales_gt_30 as
   (select employee, customer, sales.value as sales_value
    from most_recent_updated_rows m, jsonb_to_recordset(m.sales) as sales(yr text, value float)
    where yr = '2014' and value > 30)
select e1.employee, e1.customer, e1.sales_value as "2014 tv sales", e2.avg2013_expenses as "2013 avg tv expenses"
from employees_with_2014_tv_sales_gt_30 e1, avg_exp_for_2013_tv e2
where e1.employee = e2.employee and avg2013_expenses < 5

      

0


source


The best way to unpack multi-level jsons is to create entries step by step for each level and each array, picking the required values ​​along the way. This way you can get a nice logical hierarchical query. In the case described in the question, you need two related queries, since one of them has to calculate the average over other conditions.

select distinct on (employee) employee, customer, sales_2014, avg_expenses_2013::numeric(20,2)
from (
    select s.employee, customer, updated, sales_2014, avg_expenses_2013
    from (
        select employee, customer, updated, (sales->>'value')::int sales_2014
        from (
            select employee, customer, updated, jsonb_array_elements(sales) sales
            from (
                select c.*
                from
                    mytable,
                    jsonb_to_record(data) 
                        as c(employee text, product text, customer text, updated text, sales jsonb)
                ) alias
                where product = 'tv'
            ) alias
        where sales->>'yr' = '2014'
    ) s
    join (
        select employee, avg((expenses->>'value')::numeric) avg_expenses_2013
        from (
            select employee, jsonb_array_elements(expenses) expenses
            from (
                select c.*
                from
                    mytable,
                    jsonb_to_record(data) 
                        as c(employee text, product text, expenses jsonb)
                ) alias
                where product = 'tv'
            ) alias
        where expenses->>'yr' = '2013'
        group by 1
    ) e
    on s.employee = e.employee
    where sales_2014 > 30
) alias
order by employee, updated desc;

  employee  | customer | sales_2014 | avg_expenses_2013
------------+----------+------------+-------------------
 Jim Romo   | 1        | 40         | 4.00
(1 row) 

      


Query performance on a large table will be very disappointing, even with some optimizations such as indexes. If you need to do this kind of analysis on this data, you must revise a data model that is poorly designed for this purpose. I am missing a few key data to responsibly suggest appropriate changes:



  • Is the table part of a larger model?
  • Is there more data about employees and customers in the model?
  • In what form and how often is the table updated?
  • What other analyzes have been done based on this table?
  • Is it possible to delete from the table records, except the last ones?

However, one thing seems certain. You have to unpack the json data into normalized tables with regular column types. The model might look like this:

create table employees (
    employee_id serial primary key,
    employee_name text);

create table reports (    -- stipulated name of rows of your table
    report_id serial primary key,
    employee_id int references employees,
    product_name text,   -- product_id references products?
    cutomer_no int,      -- customer_id references customers?
    updated_at date);

create table sales (
    sale_id serial primary key,
    report_id int references reports,
    year_no int;
    total int);     -- numeric?

create table expenses (
    expense_id serial primary key,
    report_id int references reports,
    year_no int;
    total int);

      

0


source







All Articles