How do I count the columns where the values differ?
I have a large table and I need to check similar rows. I don't want all the column values to be the same. Rows should not be "deleted" (determined by a query against another table), no value can be too different (I have already queried for these conditions), and most of the other values should be the same. I should expect some ambiguity, so one or two different values should not break the "similarity" (well, I could get better performance by only accepting "completely equal" strings, but this simplification can cause errors, I'll do it as an option) ...
The way I'm going to solve this is PL / pgSQL: make FOR LOOP iterate over the results of previous queries. For each column, I have an IF check to see if it is different; if so, i increment the difference counter and continue. At the end of each loop, I compare the value against the threshold and see if I should keep the string as "similar" or not.
This PL / pgSQL-heavy approach seems to be slow compared to a pure SQL or SQL query with some PL / pgSQL functionality. It would be easy to check lines with all but X equivalent lines if I knew which lines should be different, but the difference can occur on any of the 40 lines. Is there a way to solve this in a single query? If not, is there a faster way than going through all the rows?
EDIT: I mentioned a table, it is actually a group of six tables linked in a 1: 1 relationship. I don't like to explain what it is, another question . Extrapolating this to one table in my situation is easy for me. So I simplified this (but not simplified it - it should demonstrate all the difficulties I have) and made an example to demonstrate what I need. Zero and everything else should be considered "different". There is no need to test the entire script - I just need to figure out if something can be done more efficiently than I thought.
The point is, I don't need to count rows (as usual) , but columns.
EDIT2: previous fiddle - it wasn't that short, so I only allowed it here for archiving purposes.
EDIT3: Simplified example here is just NOT NULL integers, preprocessor omitted. Current data state:
select * from foo;
id | bar1 | bar2 | bar3 | bar4 | bar5
----+------+------+------+------+------
1 | 4 | 2 | 3 | 4 | 11
2 | 4 | 2 | 4 | 3 | 11
3 | 6 | 3 | 3 | 5 | 13
When I run select similar_records( 1 );
, I should only get row 2 (2 columns with different values, this is within), not 3 (4 different values are outside of the maximum of two differences).
source to share
To find rows that differ only from a given maximum number of columns:
WITH cte AS (
SELECT id
,unnest(ARRAY['bar1', 'bar2', 'bar3', 'bar4', 'bar5']) AS col -- more
,unnest(ARRAY[bar1::text, bar2::text, bar3::text
, bar4::text, bar5::text]) AS val -- more
FROM foo
)
SELECT b.id, count(a.val <> b.val OR NULL) AS cols_different
FROM (SELECT * FROM cte WHERE id = 1) a
JOIN (SELECT * FROM cte WHERE id <> 1) b USING (col)
GROUP BY b.id
HAVING count(a.val <> b.val OR NULL) < 3 -- max. diffs allowed
ORDER BY 2;
I ignored all other distractions in your question.
Demonstration with 5 columns. Add as much as possible.
If columns can be NULL
, you can use IS DISTINCT FROM
instead <>
.
It uses a somewhat unorthodox but convenient parallelunnest()
. Both arrays must have the same number of elements to work. Details:
SQL Fiddle (based on yours).
source to share
Instead of a loop, to compare each line with all the others, do a self-join
select f0.id, f1.id
from foo f0 inner join foo f1 on f0.id < f1.id
where
f0.bar1 = f1.bar1 and f0.bar2 = f1.bar2
and
@(f0.bar3 - f1.bar3) <= 1
and
f0.bar4 = f1.bar4 and f0.bar5 = f1.bar5
or
f0.bar4 = f1.bar5 and f0.bar5 = f1.bar4
and
@(f0.bar6 - f1.bar6) <= 2
and
f0.bar7 is not null and f1.bar7 is not null and @(f0.bar7 - f1.bar7) <= 5
or
f0.bar7 is null and f1.bar7 <= 3
or
f1.bar7 is null and f0.bar7 <= 3
and
f0.bar8 = f1.bar8
and
@(f0.bar11 - f1.bar11) <= 5
;
id | id
----+----
1 | 4
1 | 5
4 | 5
(3 rows)
select * from foo;
id | bar1 | bar2 | bar3 | bar4 | bar5 | bar6 | bar7 | bar8 | bar9 | bar10 | bar11
----+------+------+------+------+------+------+------+------+------+-------+-------
1 | abc | 4 | 2 | 3 | 4 | 11 | 7 | t | t | f | 42.1
2 | abc | 5 | 1 | 6 | 2 | 8 | 39 | t | t | t | 19.6
3 | xyz | 4 | 2 | 3 | 5 | 14 | 82 | t | f | | 95
4 | abc | 4 | 2 | 4 | 3 | 11 | 7 | t | t | f | 42.1
5 | abc | 4 | 2 | 3 | 4 | 13 | 6 | t | t | | 37.7
Do you know what the operator and
has priority over or
? I am asking because it looks like the suggestion where
in your function is not what you want. I mean that in your expression it is enough to f0.bar7 is null and f1.bar7 <= 3
be true
to include the pair
source to share