Oracle: fastest way to compare tables containing CLOBs and get diff
Suppose I have two tables with columns Col1, Col2 and Col3, which are VARCHAR2, CLOB and NUMBER respectively.
How can I get the diff of these tables? (i.e. List of records that exist in table B but not in table A)
Table A:
βββββββββ¦ββββββββββββββββββ¦βββββββ
β Col1 β Col2 β Col3 β
β ββββββββ¬ββββββββββββββββββ¬βββββββ£
β P1111 β some_long_text1 β 1234 β
β P1111 β some_long_text1 β 1233 β
β P1111 β some_long_text2 β 1233 β
βββββββββ©ββββββββββββββββββ©βββββββ
Table B:
βββββββββ¦ββββββββββββββββββ¦βββββββ
β Col1 β Col2 β Col3 β
β ββββββββ¬ββββββββββββββββββ¬βββββββ£
β P1111 β some_long_text1 β 1234 β
β P1111 β some_long_text1 β 1235 β
β P1112 β some_long_text2 β 1233 β
βββββββββ©ββββββββββββββββββ©βββββββ
Expected results:
βββββββββ¦ββββββββββββββββββ¦βββββββ
β Col1 β Col2 β Col3 β
β ββββββββ¬ββββββββββββββββββ¬βββββββ£
β P1111 β some_long_text1 β 1235 β
β P1112 β some_long_text2 β 1233 β
βββββββββ©ββββββββββββββββββ©βββββββ
source to share
In Oracle I think you can do this
SELECT * FROM TableB WHERE (Col1, Col2, Col3) NOT IN (SELECT Col1, Col2, Col3 from TABLEA)
Other DBS you have joined:
SELECT * FROM TableB left outer join TableA
on (a.Col1=b.Col1 and a.Col2=b.Col2 and a.Col3=b.Col3)
WHERE a.col1 is null
You will probably need to perform a checksum / hash on the CLOB column, but if included in the comparison.
source to share
As per your expected results, it looks like Left Join will work well.
Something because of this:
Select B.Col1
,B.Col2
,B,Col3
FROM TableB B
LEFT OUTER JOIN TableA A
ON B.Col1 = A.Col1
WHERE B.Col2 = A.Col2
AND B.Col3 = A.Col3
AND A.Col1 IS NULL
Since the condition WHERE
filters the dataset, it makes sense to match everything, including the values ββthat NULL
match Table A. Including only A.Col1 IS NULL
in WHERE
, then you're sure to only see values ββfrom TableB that don't have a matching TableA value.
The entrances are expensive and the Left Joins are even more expensive. Joining only one key value should help efficiency (especially since you want to match all records anyway). By placing other join predicates in the clause WHERE
, you can use a filter.
Now - as far as suits yours CLOB
- there may or may not be any benefit to hashing these values. It depends on the size of the data and the hashing algorithm used. The Oracle optimizer can automatically select the hash column to compare, or you can force it to use a function.
I believe that the author should be allowed to make this choice - and I am sure there will be others who disagree with me, and they will all have good reasons. My point is, why force an extra step if it's not necessary when the optimizer can make that decision on its own?
If this is going to be a widely used query (such as a storage routine) that will be called frequently, then it might be useful to create a column that stores the precomputed hash CLOB
for easy comparison. This change will almost certainly eliminate the overhead by requiring each CLOB to be hashed at runtime, which can be very CPU intensive. I personally would not recommend indexing the hashed column as I expect each CLOB record to be likely to be unique. If so, then the PK of the table should be sufficient for matching based on row uniqueness.
source to share
1) You must create a UDT for the lob objects.
create or replace type lob_wrapper is object
( x clob,
hash varchar2(100),
constructor function lob_wrapper(p_x clob) return self as result,
MAP MEMBER FUNCTION get_hash RETURN varchar2
)
;
create or replace type body lob_wrapper
as
constructor function lob_wrapper(p_x clob)
return self as result
as
temp_ varchar(1000) := p_x;
begin
self.x := p_x;
-- add here better implementation of hashing clob.
select ora_hash(temp_) into self.hash from dual;
return;
end;
MAP MEMBER FUNCTION get_hash RETURN varchar2 is
v_hash varchar2(4000);
begin
return hash;
end;
end;
The object construct computes the hash for the clob. In the example I use ora_hash
, but you should choose the best solution ( dbms_crypto.hash
).
The map function get_hash
on an object is called when the db tries to compare two objects.
select col1,lob_wrapper(col2) col2 ,col3 from test_clob_b
minus
select col1,lob_wrapper(col2) col2 ,col3 from test_clob_a
Add another selection to get the original value from the object.
select col1,t.col2.x oringal_value,col3, t.col2.hash hash_value from (
select col1,lob_wrapper(col2) col2 ,col3 from test_clob_b
minus
select col1,lob_wrapper(col2) col2 ,col3 from test_clob_a
) t;
source to share