Save data both numeric and double
I have a table with a numeric field, say:
create table data (
id bigserial PRIMARY KEY,
quantity numeric(21,8) NOT NULL
)
I need a type numeric
because some queries need a level of precision that cannot be obtained from doubles.
But I also have queries that are millions of these values, don't care about rounding issues, and should be as fast as possible.
Is there a standard strategy for this or should I just duplicate every numeric number:
create table data (
id bigserial PRIMARY KEY,
quantity_exact numeric(21,8) NOT NULL,
quantity double precision NOT NULL
)
source to share
Please read the updates below for a full view.
Let's compare both cases that you indicated:
- use only one column (which one is
float8
ornumeric(21,8)
?) or - keep both of them (
keep two
).
Some observations.
-
If both columns are preserved, we are talking about duplicate data, which contradicts normalization and introduces ambiguity into the system, which requires a special approach. This makes it
-1
for the occasionkeep two
. -
Column size:
SELECT 'float8'::coltyp, pg_column_size(random()::float8) UNION ALL SELECT 'numeric(21,8)', pg_column_size(random()::numeric(21,8));
Keeping both columns in this case will require almost twice as much space. Thus, from
-1
tokeep two
, as well as+0.5
to the variantfloat8
, since it is slightly smaller in size. -
Speed ββtests show the following:
SET work_mem TO '2000MB'; -- to avoid usage of temp files EXPLAIN (analyze,buffers,verbose) SELECT ((random()*1234567)::float8 / 2 + 3) * 5 FROM generate_series(1,(1e7)::int) s; EXPLAIN (analyze,buffers,verbose) SELECT ((random()*1234567)::numeric(21,8) / 2 + 3) * 5 FROM generate_series(1,(1e7)::int) s;
On my i7 2.3GHz MBP, I got (based on 5 runs):
- no more
3135.238ms
forfloat8
and - no more
17325.514ms
fornumeric(21,8)
.
So, here we have a clear one
+1
for the casefloat8
. This is a memory-only check and query against the table (and cold one) will take much longer. - no more
It seems like sticking float8
is the obvious way to go ( +1.5
vs -2
) given your performance requirements. And you can create a view on top of this table that will advertise both original float8
and cast numeric(21,8)
to suit your exacting demands.
UPDATE: After commenting, a_horse_with_no_name
I decided to repeat the check, this time using real tables. I went for 9.4beta3 as 9.4 comes with a very nice pg_prewarm
module.
This is what I did:
export PGDATA=$HOME/9.4b3
initdb -k -E UTF8 -A peer
pg_ctl start
Then I changed some of the defaults using a new function ALTER SYSTEM
:
ALTER SYSTEM SET shared_buffers TO '1280MB';
ALTER SYSTEM SET checkpoint_segments TO '99';
ALTER SYSTEM SET checkpoint_completion_target TO '0.9';
Restart the server via pg_ctl restart
and now test:
SELECT id::int, 1::int AS const, (random()*1234567)::float8 as val
INTO f FROM generate_series(1,(1e7)::int) id;
SELECT id::int, 1::int AS const, (random()*1234567)::numeric(21,8) as val
INTO n FROM generate_series(1,(1e7)::int) id;
CREATE EXTENSION pg_prewarm;
VACUUM ANALYZE;
SELECT pg_prewarm('f');
SELECT pg_prewarm('n');
-- checking table size
SELECT relname,pg_size_pretty(pg_total_relation_size(oid))
FROM pg_class WHERE relname IN ('f','n');
-- checking sped
EXPLAIN (analyze, buffers, verbose) SELECT min(id), max(id), sum(val) FROM f;
EXPLAIN (analyze, buffers, verbose) SELECT min(id), max(id), sum(val) FROM n;
The results are now completely different:
- size
422 MB
vs498 MB
- average time for
float8
-2272.833ms
- and for
numeric(21,8)
this3289.542ms
Now this doesn't exactly reflect the real situation, but in my opinion:
- using
numeric
will add something (for me it is 20%) to the size of the relationship; - makes queries somewhat slow (44% for me).
I was very surprised by these numbers, to be honest. Both tables are fully cached, so time was wasted only on tuple processing and math. Me though it will make a big difference.
Personally, I would go for numeric
now, given the not very high difference in performance and data accuracy that it offers.
source to share
I add my 5 cents. PG 9.4
CREATE TABLE orders(
count integer not null
...
cost character varying(15) -- cost as string '10.22' for example
ncost numeric(10,2) -- same cost as numeric 10.22
)
~ 260,000 lines:
explain analysis
select sum(count*ncost) from orders
"Total execution time: 743.259 ms" (hot data after 10 tests) explain the analysis
select sum(count*cost::numeric(10,2)) from orders
"Total execution time: 577.289 ms" So, keep the cost as a string faster for sum ().
source to share