Strategy for localizing sensitive paginated sort

I am working on an application that is deployed to the internet. Part of the application is the search functions that present the result in a sorted list. The app is intended for users in several countries using different locales (= sorting rules). I need to find a solution to sort correctly for all users.

I am currently sorting with ORDER BY in my SQL query, so the sort is done according to the locale (or LC_LOCATE) setting for the database. These rules are incorrect for users whose locale differs from the one set for the database.

Also, to further complicate the problem, I am using pagination in the application, so when I query the database I am asking for rows 1-15, 16-30, etc. depending on the page you want. However, because the sort is wrong, each page contains records that are not correctly sorted. In the worst case, the entire result set for a given page may be out of order, depending on the locale / collation rules of the current user.

If I had to sort (server-side) code, I need to get all the rows from the database and then sort. This leads to an enormous impact, given the amount of data. So I would like to avoid this.

Does anyone have a strategy (or even a technical solution) to attack this issue that will result in correctly sorted lists without having to degrade performance when loading all the data?

Technical info: PostgreSQL 8.3 database, EJB3 application using EJB QL to query data, running on JBoss 4.5.

+2


source to share


5 answers


Are you ready to develop a small custom Postgres function module in C? (Probably just a few days for a seasoned C coder.)

strxfrm()

is a function that converts a language-specific text string based on the current LC_COLLATE parameter (more or less the current language) into a converted string which results in the correct sort order in that language if sorted as a binary byte (for example strcmp()

).

If you implement this for Postgres, say it takes the string and the collation, then you can order strxfrm (textbox, collation_order). I think you can even create multiple functional indexes on a text column (for example, one per language) using this function to store the strxfrm () results so that the optimizer uses the index.



Alternatively, you can join the Postgres developers implementing this mostly Postgres. Below are the wiki pages about these issues: Collation , ICU (which is also used by Java as far as I know).


As a less complicated solution, if data input is only through Java, you can compute these strxfrm () values ​​in Java (Java will probably have a different name for this concept) when you add data to the database and then let Postgres index and order these are pre-calculated values.

+1


source


How are you tied to PostgreSQL? the documentation doesn't promise:

The nature of some categories of locales is that their meaning must be corrected for the lifetime of the database cluster. That is, once initdb is up, you can no longer modify them. LC_COLLATE

and LC_CTYPE

- these categories. They affect the sort order of indexes, so they must be fixed or indexes on text columns become corrupted. PostgreSQL does this by writing the values LC_COLLATE

and LC_CTYPE

that are visible to initdb. The server automatically assumes these two values ​​at startup.

(Sorting rules determine how text is sorted.)



Google is throwing out the patch discussed :

PostgreSQL currently only supports one collation at a time, captured by the LC_COLLATE variable during database cluster initialization.

I'm not sure if I want to manage this outside of the database, although I would be interested to know how this can be done. (Anyone wanting a good technical overview of issues should check out Sorting Your Linguistic Data in an Oracle Database on the Oracle Globalization site .)

0


source


I don't know how to switch the database order order by

. Therefore, other decisions must be considered.

If the number of results is really large (hundreds of thousands?), I have no solution other than displaying only the number of results and asking the user to make a more precise query. Otherwise, the server side might work depending on the exact conditions ....

In particular, using the cache can improve the situation significantly. The first database query (no limit) will not be as slow as for a query limited by the number of results. And subsequent requests will be much faster. Often times paging and re-sorting is done for multiple requests, so the cache will work well (even with a few minutes).

I am using EhCache as a technical solution. Sorting and paging go together, sorting and then paging. The original results can be stored in the cache.

To reduce performance, here are some tips:

  • you can run the query once for the size of the result set and alert the user if there are too many results (ask to either confirm the slow query or add some select fields)
  • query only the columns that you need , drop all other columns (usually some data is not shown at once for all results, but for example is displayed when you move the mouse, this data can only be queried as needed, so shortening the columns requested for all results)
  • if you have computed values cache less between database columns and computed values
  • If you have duplicate values in multiple results, you can query the data / columns separately (so you fetch from the database once and cache them only once), only fetching the key (usually, and id) in the main query.
0


source


You might want to check out this package: http://www.fi.muni.cz/~adelton/l10n/postgresql-nls-string/ . It hasn't been updated for a long time and may not work anymore, but it seems like a sane starting point if you want to create a function that can do this for you.

0


source


This module does not work for Postgres 8.4.3. I fixed it - you can download the corrected version from http://www.itreport.eu/__cw_files/.01/.17/.ee7844ba6716aa36b19abbd582a31701/nls_string.c and you will have to compile and install it manually (as described in the respective README and INSTALL from the original module), but sorting does not work correctly anyway. I tried this on FreeBSD 8.0, LC_COLLATE is cs_CZ.UTF-8

0


source







All Articles