Postgres, JavaScript and sorting

This week, while setting up local project for work, I encountered some weired issue during the unit test and this has something to do with postgres and its default settings under Windows and other OS.

Problem

As an example, consider the following array: [ 'D', 'd', 'a', 'A', 'c', 'b', 'CD', 'Capacitor' ]

Sorting this in JavaScript results in case sensitive result, where upper case always come first:

1
2
>>> [ 'D', 'd', 'a', 'A', 'c', 'b', 'CD', 'Capacitor' ].sort()
[ "A", "CD", "Capacitor", "D", "a", "b", "c", "d" ]

Sorting this in Postgres SQL with default installation will yield a case insensitive sorting where upper and lower case are mixed:

1
2
3
4
5
6
7
8
9
10
11
12
SELECT regexp_split_to_table('D d a A c b CD Capacitor', ' ') ORDER BY 1;

regexp_split_to_table
-----------------------
a
A
b
c
Capacitor
CD
d
D

The goal here is to make sorting consistent, so we can either fix the Postgres side or fix the JavaScript side.

Investigation

Looking carefully at each of these results, it is not difficult to realize that the sorting in JavaScript is based on ASCII, i.e. character A has an ASCII code of 32 where as a has ASCII code of 97, so A comes first. This IS NOT the proper way to do string sorting in JavaScript.

For the result that Postgres gave, it is a bit more complex. Postgres uses LC_COLLATE to determine the sort order of the array.This variable comes from the system and different OS have different implementation, when using locale C or POSIX, strings are sorted according to their ASCII value, any other locales will result in a case insensitive result.

Solution

Here comes the solution part, as mentioned earlier, we can either fix the JavaScript or fix the Postgres, so I'll present the solutions separately.

There are several ways to make the sorting case insensitive in JavaScript:

  • [ 'D', 'd', 'a', 'A', 'c', 'b', 'CD', 'Capacitor' ].sort((a,b) => a.localeCompare(b))
    This uses localeCompare from string prototype which have some performance implication on larger array.

  • [ 'D', 'd', 'a', 'A', 'c', 'b', 'CD', 'Capacitor' ].sort(new Intl.Collator('en_us').compare)
    This is the recommended by MDN to sort larger arrays.

For Postgres, the first thing that needs to be noted is that Postgres recommends against using locales if it can be avoided, from their documentation:

The drawback of using locales other than C or POSIX in PostgreSQL is its performance impact. It slows character handling and prevents ordinary indexes from being used by LIKE. For this reason use locales only if you actually need them.

The one-off way to fix the sorting is by specifiying the LC_COLLATE value when creating the database, for example:

1
2
3
4
5
CREATE DATABASE db 
WITH TEMPLATE = template0
ENCODING = 'UTF8'
LC_COLLATE = 'C'
LC_CTYPE = 'C';

The created database (db in this case) will use C as LC_COLLATE overriding the default LC_COLLATE value from the OS. With the new database created, you easily verify it will sort in a case sensitive way by the ASCII value once you connect to the database and run the query presented previously.

This one-off way is good enough only if you care about creating such database once. Imaging next time you create a new database, you will still have to manually override the LC_COLLATE value. So the way to go is to modify the template database, because LC_COLLATE can't be changed once the database has been created, we will have to create a new database and set it as template.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

--- Unset template1 as template before dropping
UPDATE pg_database
SET datistemplate='false'
WHERE datname='template1';

--- Create a new database that uses C as locale
CREATE DATABASE template1
IS_TEMPLATE true
ENCODING = 'UTF8'
LC_COLLATE = 'C'
LC_CTYPE = 'C'
CONNECTION LIMIT = -1
TEMPLATE template0;

--- Create a new databse now should have C as locale.
CREATE DATABASE db3;

--- Should return C
SHOW lc_collate;

Another way to do this, is to initialize the database cluster with C locale as below:

1
2
chown -R postgres:postgres /var/lib/postgres/
su - postgres -c "initdb --locale C -D '/var/lib/postgres/data'"

this will create template database with C locale.

Takeaways

String sorting with the default method in JavaScript and most other languages is merely a comparsion based on the ASCII code, this results in upper case letters always comes first.

String sorting in PostgreSQL depends on the LC_COLLATE setting of the table which depends on the setting of the operating system, default sorting will yield results that mixes upper case and lower case, in other words, sorting is not case sensitive. There are many ways to get case sensitive sorting, but the most reliable way should be specifiying the LC_COLLATE when creating the database.

References

  • https://www.postgresql.org/docs/13/locale.html
  • https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/localeCompare
  • Stackexchange question
  • Postgres FAQ