Reconcile file with table in postgres and determine compliance level

I have a table with 4 columns and it is loaded with data. Data can range from 1000 to a maximum of 2 million.

I am getting a file (say a tab) as part of a daily process with data for 4 columns. I have to prepare a report where for every column match I have to increase the match level.

eg:

The file.col1 data exists in db.col1; then matchlevel = 1

The files file.col2 exist in db.col2; then matchlevel = 2

The file.col3 data exists in db.col3; then matchlevel = 3

The file.col4 data exists in db.col4; then matchlevel = 4

Example of data in the database: (all fields are strings)

1367    37991   11111   sometext1
1365    37993   11112   sometext2
1369    34521   sample1 sometext1
1359    76583   sample2 sometext2

      

Sample data in file: (all fields are strings)

1367    37991   11111   sometext1
1365    8993    sample3  sometext5
1369    34521   sample4 sometext6
1359    76583   sample2 sometext7
1651    875637  notpresentindb    notpresentindb

      

The way out should be

id from file    Match Level

1367            Max  (all fields match in db)
1365            Low  (only column1 matches 
1369            Med  (column1 and 2 match)
1359            High (First 3 columns match)
1651            No Match (no columns match)

      

I am currently doing this Java. Declared 4 arraylists and copied all the data in the columns into the arraylists and used the apache collections to determine if the data of each column exists in each of the 4 arraylists for testing. But this is redflagged because we need to load 4 columns of data with 1 million records and all the server memory can be used.

Sample code below: Arraylist1, Arraylist2, Arraylist3, Arraylist4 are defined before the code below and loaded with data from db.

while ((sCurrentLine = br.readLine()) != null) {

                String[] temp;
                temp = sCurrentLine.split("\\t");
                value = "no match";
                valueInt = 0;
                if(arraylist1.contains(temp[0])){
                    value = "low";
                    int retval=arraylist1.indexOf(temp[0]);
                    if (arraylist2.get(retval).equals(temp[1]))  {                          
                        value = "med";

                        if (arraylist3.get(retval).equals(temp[2])){
                            value = "High";

                            if (arraylist4.get(retval).equals(temp[3])) {
                                value = "Max";

                            }
                        }
                    }
                }

      

+3


source to share


2 answers


Since Postgresql 9.1 you can use external data wrappers to view files as tables. http://www.postgresql.org/docs/9.4/static/file-fdw.html

Once you see your Csv as a table, you can use the power of SQL to generate a report.



CREATE TABLE t (
    id SERIAL PRIMARY KEY,
    c1 text,
    c2 text,
    c3 text
) ;

CREATE EXTENSION file_fdw;
CREATE SERVER my_csv_server FOREIGN DATA WRAPPER file_fdw;

CREATE TABLE csv (
    id integer,
    c1 text,
    c2 text,
    c3 text
) 
SERVER my_csv_server 
OPTIONS ( filename '/home/me/data.csv', format 'csv' );

SELECT
     id,
     CASE 
         WHEN t.c1 = csv.c1 AND t.c2 = csv.c2 AND t.c3 = csv.c3 THEN
             'Max'
         WHEN t.c1 = csv.c1 AND t.c2 = csv.c2 THEN
             'High'
         WHEN t.c1 = csv.c1 THEN
             'Mid'
         ELSE
             'Low' -- only id match
    END as report
FROM 
    t JOIN csv USING (id)

      

0


source


If you want to do the processing in the database, you must first download the contents of the file first. I can think of two ways to do this:

  • create another table and import the contents of the file into it eg. using commandcopy

  • use external data wrapper to directly access file contents.

Once you can access the contents of the file via SQL, you can do so using a simple outer join:

select fc.col1,
       case 
         when fc.col1 = bt.col1 and fc.col2 = bt.col2 and fc.col3 = bt.col3 and fc.col4 = bt.col4 then 'Max'
         when fc.col1 = bt.col1 and fc.col2 = bt.col2 and fc.col3 = bt.col3 then 'High'
         when fc.col1 = bt.col1 and fc.col2 = bt.col2 then 'Med'
         when fc.col1 = bt.col1 then 'Low'
         else 'No Match'
       end as match_level
from file_content fc
  left join base_table bt on fc.col1 = bt.col1;

      



If you also need to identify lines that are in base_table

but not contained in the file, you need full outer join

, not the left join:

select fc.col1,
       case 
         when fc.col1 = bt.col1 and fc.col2 = bt.col2 and fc.col3 = bt.col3 and fc.col4 = bt.col4 then 'Max'
         when fc.col1 = bt.col1 and fc.col2 = bt.col2 and fc.col3 = bt.col3 then 'High'
         when fc.col1 = bt.col1 and fc.col2 = bt.col2 then 'Med'
         when fc.col1 = bt.col1 then 'Low'
         else 
           case when bt.col1 is null then 'Not in database'
           else 'Not in file'
         end
       end as match_level
from file_content fc
  full join base_table bt on fc.col1 = bt.col1;

      

SQLFiddle example: http://sqlfiddle.com/#!15/1ff38/2

+1


source







All Articles