unicode - Getting MySQL to properly distinguish Japanese characters in SELECT calls -
i'm setting database linguistic analysis, , japanese kana giving me bit of trouble.
unlike other questions on far, don't know it's encoding issue, per se. i've set coallation utf8_unicode_ci, , on surface it's saving , recalling things right.
the problem, however, when related kana, such キ (ki) , ギ (gi). sorting purposes, japanese doesn't distinguish between 2 unless in direct conflict. example:
- ぎ (gi) comes before きかい (kikai)
- きる (kiru) comes before ぎわく (giwaku)
- き (ki) comes before ぎ (gi)
it's behavior think @ root of problem. when loading data set external file, had select call verify specific readings in japanese had not been logged. if there, fetch id paired headword; otherwise new entry added , paired thereafter.
what noticed after put in wherever 2 such similar readings occurred, first 1 encountered logged , show false positive other if showed up. example:
- キョウ (kyou) appeared first, characters ギョウ (gyou) got paired kyou instead
- ズ (zu) appeared before ス (su), likewise more characters got incorrectly matched.
i can go through , manually sort out if need be, set database take stricter view regarding differentiating between characters (e.g. if characters have 2 different utf-8 code points, treat them different characters). there way behavior?
you can use utf8_bin collation compares characters unicode code points.
the utf8_general_ci collation distinguishes キョウ , ギョウ.
Comments
Post a Comment