「Linux」- 识别文件编码、转换文件编码 @20210213

2022-01-11 08:59:29

问题描述

当我们使用编辑器打开文件时，文件有时候会显示为乱码，也就是说编辑器没有使用正确的编码方式打开文件。此时，我们则需要切换编辑器的编码方式，使用正确的编码方式打开文件。

但是，我们应该如何得知文件的正确编码呢？（除了被告知以外）

该笔记将记录：在 Linix 中，如何获取文件的编码方式，以及如何进行文件编码转换。

问题原因

直接获取文件的编码是不太可能的。虽然文件头可能暗示了文件编码（但其实并没有类似的规范），但也有例外。例如，以 0xEF,0xBB,0xBF 开头的可能是 UTF-8 编码文件，但也可能是 ISO-8859-1 编码文件的 ï»¿ 字符串。或者，其他的编码体系中的字符。

主流编辑器识别文件编码的方式是通过猜测（即使 file 命令，有时也会给出错误的文件编码提示），所以编辑器里才会由 File Encoding 功能来切换编码的功能。

解决方案：获取文件编码

注意事项：
1）如上所述，工具获取文件编码的方式是通过猜测，因此工具很有可能会返回错误的文件编码，只是可靠的工具返回的结果更加可靠。
2）当工具猜测出编码之后，我们可以尝试使用该编码方式打开文件，以验证猜测结果是正确；

方案一、使用 file 命令

使用 file 命令，可以获取文件编码：

# file Fontconfig_-_fonts.conf.txt 
Fontconfig_-_fonts.conf.txt: text/x-zim-wiki, UTF-8 Unicode text

方案二、使用 enca 命令

# apt-get install -y enca

# enca Fontconfig_-_fonts.conf.txt 
enca: Cannot determine (or understand) your language preferences.
Please use `-L language', or `-L none' if your language is not supported
(only a few multibyte encodings can be recognized then).
Run `enca --list languages' to get a list of supported languages.

# enca --list language
belarusian: CP1251 IBM866 ISO-8859-5 KOI8-UNI maccyr IBM855 KOI8-U
 bulgarian: CP1251 ISO-8859-5 IBM855 maccyr ECMA-113
     czech: ISO-8859-2 CP1250 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK
  estonian: ISO-8859-4 CP1257 IBM775 ISO-8859-13 macce baltic
  croatian: CP1250 ISO-8859-2 IBM852 macce CORK
 hungarian: ISO-8859-2 CP1250 IBM852 macce CORK
lithuanian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic
   latvian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic
    polish: ISO-8859-2 CP1250 IBM852 macce ISO-8859-13 ISO-8859-16 baltic CORK
   russian: KOI8-R CP1251 ISO-8859-5 IBM866 maccyr
    slovak: CP1250 ISO-8859-2 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK
   slovene: ISO-8859-2 CP1250 IBM852 macce CORK
 ukrainian: CP1251 IBM855 ISO-8859-5 CP1125 KOI8-U maccyr
   chinese: GBK BIG5 HZ
      none:

# enca -L chinese Fontconfig_-_fonts.conf.txt 
Universal transformation format 8 bits; UTF-8

如下示例，file 命令并没有猜测出文件编码，而 enca 文件则返回正确文件编码：

# file './html/gndy/jddy/20201217/60852.html'
./html/gndy/jddy/20201217/60852.html: HTML document, Non-ISO extended-ASCII text, with very long lines, with CRLF line terminators

# enca -L chinese './html/gndy/jddy/20201217/60852.html'
Simplified Chinese National Standard; GB2312
  CRLF line terminators

解决方案：文件编码转换

命令 enconv 支持文件编码转换（与 enca 同时安装），但是我们使用 iconv 命令。

使用 iconv 命令的方法如下：

# iconv -c -f gb2312 -t UTF-8//IGNORE --output='outputfile' 'inputfile'

更多使用方法，参考 man 1 iconv 手册。

参考文献

How to detect the encoding of a file? - Software Engineering Stack Exchange
shell - How to find encoding of a file via script on Linux? - Stack Overflow
text processing - iconv illegal input sequence- why? - Unix & Linux Stack Exchange
utf 8 - wget and encoding. how to force utf-8? - Stack Overflow

码农公寓