トップ «前の日記(2005-01-30) 最新 次の日記(2005-02-12)» 月表示 編集

日々の流転


2005-02-09 [長年日記]

λ. Text.Regexで日本語が扱えない問題を鬼車で解決

久しぶりのHaskellネタ。GHC-6.2.2現在のText.Regexでは非ASCII文字はまともに扱えないことに段々むかついてきたので、鬼車(Oniguruma)を使って解決してみた。

Current(GHC-6.2.2) implementaton of Text.Regex doesn't handle full range of Unicode. It can handle only ASCII (or maybe LATIN-1) characters which is a very small subset of Unicode. This limits the usefulness of Text.Regex since there are many natural languages (such as Japanese) that need non-ASCII characters.

Therefore I modified the implementation of Text.Regex to use Oniguruma(鬼車) which is a very powerful regular expression library. To a wonderful thing, it supports UTF-32. So that the new implementation can handle full range of Unicode.

for Hugs98-Mar2005 (updated 2005-05-07)
  1. Install Oniguruma version 3 (or later).
  2. Apply hugs98-Mar2005-oniguruma3-2.patch to the Hugs source tree.
  3. Run autoconf in the top directory and `libraries/base' directory.
  4. Build hugs as usual.
for GHC-6.4 (updated 2005-05-04)
  1. Install Oniguruma version 3 (or later).
  2. Apply ghc-6.4-oniguruma3-1.patch to the GHC source tree.
  3. Run autoconf in `libraries/base' directory.
  4. Build GHC as usual.
for GHC-6.2.2
ghc-6.2.2-onigd20050204-1.patch.gz
このパッチをあてて、autoconfを実行し、./configure に --enable-oniguruma を指定してビルドすると鬼車が使われるようになります。
Tags: haskell