There are times when you need to sort in a locale-aware manner.
One of the more obvious cases is probably when generating cryptographic signatures for web services. These often require you to create a hash-based message authentication code (HMAC) based on inputs including a canonicalized URI, several HTTP header values including a timestamp, a secret key, and perhaps other items. These items normally have to be sorted so that the server end can reproduce the same HMAC by calculation, and that means both ends have to agree on the collating sequence.
Often you can get away with a lot because most of the characters are going to fall within the 7-bit ASCII range. But when they don't you need to be sure you are using the "invariant, string-oriented" collating sequence and not your user session collating sequence or one that takes language quirks into account.
Many HMAC sigs require that you hash the UTF-8 too, but it works if you first sort UTF-16LE Unicode and then re-encode as UTF-8 (same sequence).
And of course sorting gets used all over - though most uses aren't as sensitive as crypto processes can be.
Subtleties
Accented characters may sort earlier or later depending on the language. Ligatures (e.g. mediæval vs. mediaeval) need to be considered. "String compare" and "linguistic compare" differ. And on and on it goes.
Demo
This demo uses a simplistic Insertion Sort. This is quick and dirty, understood by most sorting fans, and importantly it is a stable sort so it will help showcase my point here.
Basically there is nothing special about it except that it uses CompareStringEx() in Kernel32.dll to compare strings within the sort. For those still using the unsupported Windows XP or earlier you may have to hack it a bit to make use of the aging CompareString() instead.
While the new entrypoint accepts locale string values instead of LCIDs, it may be worth noting that the older one comes in both ANSI and Unicode flavors.
The demo includes a sample list of string data as a Unicode text file. You can modify this with interesting cases you may know of. It has a brief list of "western" languages. You can add or remove values to that list within the code ot change the program to load them from a file too.
The list is loaded up and displayed in a flexgrid with back-colors from white through deepening greenish-blue shades that help make sorting differences easier to see when you try various collating sequence modifications. Because of the not-so-clever way this is done a string list of more than 255 elements will crash the program. ;)
![Name: sshot.png
Views: 113
Size: 25.4 KB]()
Requirements
VB6, because VB6 comes with MSHFlexgrid which is Unicode-aware. VB5 will work if you substitute another Unicode grid or use the crusty old MSFlexgrid and avoid "invalid in your locale's ANSI" characters.
Windows Vista or later, because of the new CompareStringEx() used here. If you modify the program to call CopmareSting() instead it works on downlevel unsupported Windows versions but you can't use locale strings and will have to change the pick list to use LCID values instead.
Sticking with Unicode support means "eastern" languages can be tested too.
Running the Demo
Nothing special required, and it should just unzip, open, and run even without compiling to EXE first. MSHFlexgrid comes with VB6 so you're set. VB5 users see Requirements section above.
Click the "Sort" button. Change the settings and "Sort" again. Scroll through the list of interesting cases - the scroll position should be stable between "Sorts" so look at the O'Leary case and flip sorting between "String Sort" and "Linguistic Sort" (i.e. "String Sort" not chosen). Ancien Régime is another interesting case.
One of the more obvious cases is probably when generating cryptographic signatures for web services. These often require you to create a hash-based message authentication code (HMAC) based on inputs including a canonicalized URI, several HTTP header values including a timestamp, a secret key, and perhaps other items. These items normally have to be sorted so that the server end can reproduce the same HMAC by calculation, and that means both ends have to agree on the collating sequence.
Often you can get away with a lot because most of the characters are going to fall within the 7-bit ASCII range. But when they don't you need to be sure you are using the "invariant, string-oriented" collating sequence and not your user session collating sequence or one that takes language quirks into account.
Many HMAC sigs require that you hash the UTF-8 too, but it works if you first sort UTF-16LE Unicode and then re-encode as UTF-8 (same sequence).
And of course sorting gets used all over - though most uses aren't as sensitive as crypto processes can be.
Subtleties
Accented characters may sort earlier or later depending on the language. Ligatures (e.g. mediæval vs. mediaeval) need to be considered. "String compare" and "linguistic compare" differ. And on and on it goes.
Demo
This demo uses a simplistic Insertion Sort. This is quick and dirty, understood by most sorting fans, and importantly it is a stable sort so it will help showcase my point here.
Basically there is nothing special about it except that it uses CompareStringEx() in Kernel32.dll to compare strings within the sort. For those still using the unsupported Windows XP or earlier you may have to hack it a bit to make use of the aging CompareString() instead.
While the new entrypoint accepts locale string values instead of LCIDs, it may be worth noting that the older one comes in both ANSI and Unicode flavors.
The demo includes a sample list of string data as a Unicode text file. You can modify this with interesting cases you may know of. It has a brief list of "western" languages. You can add or remove values to that list within the code ot change the program to load them from a file too.
The list is loaded up and displayed in a flexgrid with back-colors from white through deepening greenish-blue shades that help make sorting differences easier to see when you try various collating sequence modifications. Because of the not-so-clever way this is done a string list of more than 255 elements will crash the program. ;)
Requirements
VB6, because VB6 comes with MSHFlexgrid which is Unicode-aware. VB5 will work if you substitute another Unicode grid or use the crusty old MSFlexgrid and avoid "invalid in your locale's ANSI" characters.
Windows Vista or later, because of the new CompareStringEx() used here. If you modify the program to call CopmareSting() instead it works on downlevel unsupported Windows versions but you can't use locale strings and will have to change the pick list to use LCID values instead.
Sticking with Unicode support means "eastern" languages can be tested too.
Running the Demo
Nothing special required, and it should just unzip, open, and run even without compiling to EXE first. MSHFlexgrid comes with VB6 so you're set. VB5 users see Requirements section above.
Click the "Sort" button. Change the settings and "Sort" again. Scroll through the list of interesting cases - the scroll position should be stable between "Sorts" so look at the O'Leary case and flip sorting between "String Sort" and "Linguistic Sort" (i.e. "String Sort" not chosen). Ancien Régime is another interesting case.