2005-02-09 [長年日記]
λ. Text.Regexで日本語が扱えない問題を鬼車で解決
久しぶりのHaskellネタ。GHC-6.2.2現在のText.Regexでは非ASCII文字はまともに扱えないことに段々むかついてきたので、鬼車(Oniguruma)を使って解決してみた。
Current(GHC-6.2.2) implementaton of Text.Regex doesn't handle full range of Unicode. It can handle only ASCII (or maybe LATIN-1) characters which is a very small subset of Unicode. This limits the usefulness of Text.Regex since there are many natural languages (such as Japanese) that need non-ASCII characters.
Therefore I modified the implementation of Text.Regex to use Oniguruma(鬼車) which is a very powerful regular expression library. To a wonderful thing, it supports UTF-32. So that the new implementation can handle full range of Unicode.
- for Hugs98-Mar2005 (updated 2005-05-07)
-
- Install Oniguruma version 3 (or later).
- Apply hugs98-Mar2005-oniguruma3-2.patch to the Hugs source tree.
- Run autoconf in the top directory and `libraries/base' directory.
- Build hugs as usual.
- for GHC-6.4 (updated 2005-05-04)
-
- Install Oniguruma version 3 (or later).
- Apply ghc-6.4-oniguruma3-1.patch to the GHC source tree.
- Run autoconf in `libraries/base' directory.
- Build GHC as usual.
- for GHC-6.2.2
-
ghc-6.2.2-onigd20050204-1.patch.gz
このパッチをあてて、autoconfを実行し、./configure に --enable-oniguruma を指定してビルドすると鬼車が使われるようになります。