1 | # pyuca: Python Unicode Collation Algorithm implementation |
---|
2 | (http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/) |
---|
3 | |
---|
4 | This is my preliminary attempt at a Python implementation of the |
---|
5 | [Unicode Collation Algorithm (UCA)](http://unicode.org/reports/tr10/). |
---|
6 | I originally posted it to my blog in 2006 but it seems to get enough |
---|
7 | usage it really belongs here (and in PyPI). |
---|
8 | |
---|
9 | What do you use it for? In short, sorting non-English strings properly. |
---|
10 | |
---|
11 | The core of the algorithm involves multi-level comparison. For example, |
---|
12 | ``café`` comes before ``caff`` because at the primary level, the accent |
---|
13 | is ignored and the first word is treated as if it were ``cafe``. |
---|
14 | The secondary level (which considers accents) only applies then to words |
---|
15 | that are equivalent at the primary level. |
---|
16 | |
---|
17 | The Unicode Collation Algorithm and pyuca also support contraction and |
---|
18 | expansion. **Contraction** is where multiple letters are treated as a |
---|
19 | single unit. In Spanish, ``ch`` is treated as a letter coming between |
---|
20 | ``c`` and ``d`` so that, for example, words beginning ``ch`` should |
---|
21 | sort after all other words beginnings with ``c``. **Expansion** is where |
---|
22 | a single letter is treated as though it were multiple letters. In German, |
---|
23 | ``ä`` is sorted as if it were ``ae``, i.e. after ``ad`` but before ``af``. |
---|
24 | |
---|
25 | ## Here is how to use the ``pyuca`` module: |
---|
26 | `` |
---|
27 | git clone https://github.com/jtauber/pyuca.git |
---|
28 | cd pyuca |
---|
29 | pip install pyuca |
---|
30 | `` |
---|
31 | |
---|
32 | **Usage example:** |
---|
33 | `` |
---|
34 | from pyuca import Collator |
---|
35 | c = Collator("allkeys.txt") |
---|
36 | |
---|
37 | sorted_words = sorted(words, key=c.sort_key) |
---|
38 | `` |
---|
39 | |
---|
40 | ``allkeys.txt`` (1 MB) is available at |
---|
41 | |
---|
42 | http://www.unicode.org/Public/UCA/latest/allkeys.txt |
---|
43 | |
---|