我想知道在对字符向量进行排序时R排序算法如何工作
a = c("aa(150)", "aa(1)S")
sort(a)
# [1] "aa(150)" "aa(1)S"
a = c("aa(150)", "aa(1)")
sort(a)
# [1] "aa(1)" "aa(150)"
R不会从左到右一一比较字符的整数值吗?为什么添加字符可以改变结果?
我认为排序由“ 5”和“)”字符决定,之后的字符将被忽略.
与Python比较
In [1]: a=["aa(150)","aa(1)"]
In [2]: sorted(a)
Out[2]: ['aa(1)', 'aa(150)']
In [3]: a=["aa(150)","aa(1)S"]
In [4]: sorted(a)
Out[4]: ['aa(1)S', 'aa(150)']
解决方法:
在大多数情况下,将语言环境设置为默认设置,它将关闭特定于语言环境的排序:
Sys.setlocale("LC_COLLATE", "C")
a=c("aa(150)","aa(1)S")
sort(a)
#[1] "aa(1)S" "aa(150)"
由于语言差异,字符串排序规则必须是国际特定的.从帮助?排序:
The sort order for character vectors will depend on the collating
sequence of the locale in use: see Comparison.
然后,我们可以转到?Comparsons进行以下比较:
Comparison of strings in character vectors is lexicographic within the
strings using the collating sequence of the locale in use: see
locales. The collating sequence of locales such as en_US is normally
different from C (which should use ASCII) and can be surprising.
Beware of making any assumptions about the collation order: e.g. in
Estonian Z comes between S and T, and collation is not necessarily
character-by-character – in Danish aa sorts as a single letter, after
z. In Welsh ng may or may not be a single sorting unit: if it is it
follows g.
如前所述,由于每种语言以不同的方式使用字母,因此语言环境对于排序至关重要.