问题

utf8_general_ci utf8_unicode_ci 之间,性能方面是否有任何差异?



解决方法

这两个归类都是用于UTF-8字符编码.区别在于如何对文本进行排序和比较.

注意:自从MySQL 5.5.3以来,你应该使用 utf8mb4 ,而不是 utf8 .它们都指的是UTF-8编码,但是旧的 utf8 有一个MySQL特有的限制,阻止使用0xFFFD以上的字符.

  • Accuracy

    utf8mb4_unicode_ci is based on the Unicode standard for sorting and comparison, which sorts accurately in a very wide range of languages.

    utf8mb4_general_ci fails to implement all of the Unicode sorting rules, which will result in undesirable sorting in some situations, such as when using particular languages or characters.

  • Performance

    utf8mb4_general_ci is faster at comparisons and sorting, because it takes a bunch of performance-related shortcuts.

    On modern servers, this performance boost will be all but negligible. It was devised in a time when servers had a tiny fraction of the CPU performance of today's computers.

    utf8mb4_unicode_ci, which uses the Unicode rules for sorting and comparison, employs a fairly complex algorithm for correct sorting in a wide range of languages and when using a wide range of special characters. These rules need to take into account language-specific conventions; not everybody sorts their characters in what we would call 'alphabetical order'.

就拉丁语(即"欧洲")语言而言,Unicode排序与简化的MySQL中的 utf8mb4_general_ci 排序没有多大区别,但仍有一些区别: p>

  • For examples, the Unicode collation sorts "ß" like "ss", and "Œ" like "OE" as people using those characters would normally want, whereas utf8mb4_general_ci sorts them as single characters (presumably like "s" and "e" respectively).

  • Some Unicode characters are defined as ignorable, which means they shouldn't count toward the sort order and the comparison should move on to the next character instead. utf8mb4_unicode_ci handles these properly.

在非拉丁语言(例如亚洲语言或不同字母的语言)中,Unicode排序和简化的 utf8mb4_general_ci 排序之间可能存在很多更多差异. utf8mb4_general_ci 的适用性将在很大程度上取决于所使用的语言.对于某些语言,这将是相当不足.

您应该使用什么?

几乎肯定没有理由使用 utf8mb4_general_ci 了,因为我们已经留下了CPU速度足够低,性能差异很重要的地方.您的数据库几乎肯定会受到其他瓶颈的限制.

性能的差异只有在非常特殊的情况下是可以测量的,如果是你,你可能已经知道了.如果您遇到排序缓慢,几乎在所有情况下,这将是您的索引/查询计划的问题.更改排序规则函数不应该在要排除故障的列表上.

过去,有些人建议使用 utf8mb4_general_ci ,除非准确的排序足够重要,足以证明性能成本.今天,性能成本几乎消失了,开发商们正在更加严肃地对待国际化.

另外一件事我要补充的是,即使你知道你的应用程序只支持英语,它可能仍然需要处理人的名字,通常可以包含在其他语言中使用的字符,它是同样重要以正确排序.使用Unicode规则为一切帮助添加安心,非常聪明的Unicode人已经非常努力,使排序工作正常.




相关问题推荐