如何检测或判断一个文件或字节流（无BOM）是什么编码类型

2022-10-11 19:29:07

前言：

昨天，在文章：终于等到你：CYQ.Data V5系列（ORM数据层，支持.NET Core）最新版本开源了中，

不小心看到一条留言：

然后就去该地址看了一下，这一看，顺带折腾了一天。

今天，就和大伙分享下折腾的感觉。

在该开源地址中，代码有C++和C#两个版本，编码的整体风格倾向与于C++。

主要的时间，花了在对于检测无BOM的部分，顺带重温了各种编码的基础。

建议在看此文之前，先了解下编码、和BOM的概念。

有BOM的编码检测

对于一个文件，或者字节流，就是一堆二进制：

如果传输的过程，有指定BOM，就是前面两三个字节是固定的255,254之类的，那么解码起来就很简单了。

像之前IOHelper内部读文件的代码是这么写的：

 /// <summary>

        /// 先自动识别UTF8，否则归到Default编码读取

        /// </summary>

        /// <returns></returns>

        public static string ReadAllText(string fileName)

        {

            return ReadAllText(fileName, DefaultEncoding);

        }

        public static string ReadAllText(string fileName, Encoding encoding)

        {

            try

            {

                if (!File.Exists(fileName))

                {

                    return string.Empty;

                }

                Byte[] buff = null;

                lock (GetLockObj(fileName.Length))

                {

                    if (!File.Exists(fileName))//多线程情况处理

                    {

                        return string.Empty;

                    }

                    buff = File.ReadAllBytes(fileName);

                }

                if (buff.Length == ) { return ""; }

                if (buff[] ==  && buff[] ==  && buff[] == )

                {

                    return Encoding.UTF8.GetString(buff, , buff.Length - );

                }

                else if (buff[] ==  && buff[] == )

                {

                    return Encoding.Unicode.GetString(buff, , buff.Length - );

                }

                else if (buff[] ==  && buff[] == )

                {

                    if (buff.Length >  && buff[] ==  && buff[] == )

                    {

                        return Encoding.UTF32.GetString(buff, , buff.Length - );

                    }

                    return Encoding.BigEndianUnicode.GetString(buff, , buff.Length - );

                }

                return encoding.GetString(buff);

            }

            catch (Exception err)

            {

                Log.WriteLogToTxt(err);

            }

            return string.Empty;

        }

代码说白了，就是检测BOM头，然后识别编码，用对应的编码解码。

测试的结果：

中文都能正确显示。

windows下文本的另存为只有：ANSI、UTF8､Unicode(UTF16LE)、BigEndianUnicode（UTF16BE）。

这四种有BOM的都是轻松检测了。

那如果文件或字节没有BOM头呢？如果用默认的编码，由有一定概率会乱码。

无BOM的编码检测

如果一堆字节流，没有指定BOM，就要分析出编码类型，还是挺有难度的。

这需要对各种编码的规则有一定的熟悉度。

先看看网友给出的Github上的原始源码：

public Encoding DetectEncoding(byte[] buffer, int size)

        {

            // First check if we have a BOM and return that if so

            Encoding encoding = CheckBom(buffer, size);

            if (encoding != Encoding.None)

            {

                return encoding;

            }

            // Now check for valid UTF8

            encoding = CheckUtf8(buffer, size);

            if (encoding != Encoding.None)

            {

                return encoding;

            }

            // Now try UTF16

            encoding = CheckUtf16NewlineChars(buffer, size);

            if (encoding != Encoding.None)

            {

                return encoding;

            }

            encoding = CheckUtf16Ascii(buffer, size);

            if (encoding != Encoding.None)

            {

                return encoding;

            }

            // ANSI or None (binary) then

            if (!DoesContainNulls(buffer, size))

            {

                return Encoding.Ansi;

            }

            // Found a null, return based on the preference in null_suggests_binary_

            return _nullSuggestsBinary ? Encoding.None : Encoding.Ansi;

        }

代码流程（和内涵）翻译下来是这样的：

1､检测BOM头，这个很Easy。

2､检测UTF8编码（这个还是很有创意的），如果编码的规则完全符合UTF8,则认为是UTF8。

3､检测字节中是否有换行符（根据换行符中的0的位置，区分是Utf16的BE大尾还是LE小尾）。

这个概率要看字节抽样的长度，带不带换行符。

4､检测字节中，单偶数出现的0的概率，设定了一个期望值来预判（对于中文而言，基本没用），大概是老外写的，只根据英文情况分析的概率。

5､检测字节中，有没有出现0，如果没有，返回系统默认编码（不同系统环境编码是不同的）。

首先，不得不说，原作者还是有一定想法的。

虽然代码中除了UTF8按规则写的分析外，其它的都无法代入中文环境里通过。

但至少思路上，就能得到不少启发。

于是，坑了我大半天，进行重写，改造，代入中文环境测试。

无BOM代码检测的改造过程：

改造后的代码流程是这样的：

public Encoding DetectWithoutBom(byte[] buffer, int size)

        {

            // Now check for valid UTF8

            Encoding encoding = CheckUtf8(buffer, size);

            if (encoding != Encoding.None)

            {

                return encoding;

            }

            // ANSI or None (binary) then 一个零都没有情况。

            if (!ContainsZero(buffer, size))

            {

                CheckChinese(buffer, size);

                return Encoding.Ansi;

            }

            // Now try UTF16  按寻找换行字符先进行判断

            encoding = CheckByNewLineChar(buffer, size);

            if (encoding != Encoding.None)

            {

                return encoding;

            }

            // 没办法了，只能按0出现的次数比率，做大体的预判

            encoding = CheckByZeroNumPercent(buffer, size);

            if (encoding != Encoding.None)

            {

                return encoding;

            }

            // Found a null, return based on the preference in null_suggests_binary_

            return Encoding.None;

        }

用中文解释流程是这样的：

､UTF8编码的检测规则，这个是通用的有效，可以保留。

､调整顺序：先检测字节有没有0字节，若无，补一个是否中文的编码的检测（GB2312､GBK、Big5)。

这个后续有点用。

､检测换行符：增加UTF-32编码的检测（原来的思路只有UTF16)。

､预判概率：改造成同时适应中文环境。

测试的结果是这样的：

A、纯中文的：

该测试下，对于BigEndianUnicode的会产生乱码。

B、非纯中文的

一切编码正常通用。

改进后的完整源码：

using System;

using System.Collections.Generic;

using System.IO;

using System.Text;

namespace CYQ.Data.Tool

{

    internal static class IOHelper

    {

        internal static Encoding DefaultEncoding = Encoding.Default;

        private static List<object> tenObj = new List<object>();

        private static List<object> TenObj

        {

            get

            {

                if (tenObj.Count == )

                {

                    for (int i = ; i < ; i++)

                    {

                        tenObj.Add(new object());

                    }

                }

                return tenObj;

            }

        }

        private static object GetLockObj(int length)

        {

            int i = length % ;

            return TenObj[i];

        }

        /// <summary>

        /// 先自动识别UTF8，否则归到Default编码读取

        /// </summary>

        /// <returns></returns>

        public static string ReadAllText(string fileName)

        {

            return ReadAllText(fileName, DefaultEncoding);

        }

        public static string ReadAllText(string fileName, Encoding encoding)

        {

            try

            {

                if (!File.Exists(fileName))

                {

                    return string.Empty;

                }

                Byte[] buff = null;

                lock (GetLockObj(fileName.Length))

                {

                    if (!File.Exists(fileName))//多线程情况处理

                    {

                        return string.Empty;

                    }

                    buff = File.ReadAllBytes(fileName);

                    return BytesToText(buff, encoding);

                }

            }

            catch (Exception err)

            {

                Log.WriteLogToTxt(err);

            }

            return string.Empty;

        }

        public static bool Write(string fileName, string text)

        {

            return Save(fileName, text, false, DefaultEncoding, true);

        }

        public static bool Write(string fileName, string text, Encoding encode)

        {

            return Save(fileName, text, false, encode, true);

        }

        public static bool Append(string fileName, string text)

        {

            return Save(fileName, text, true, true);

        }

        internal static bool Save(string fileName, string text, bool isAppend, bool writeLogOnError)

        {

            return Save(fileName, text, true, DefaultEncoding, writeLogOnError);

        }

        internal static bool Save(string fileName, string text, bool isAppend, Encoding encode, bool writeLogOnError)

        {

            try

            {

                string folder = Path.GetDirectoryName(fileName);

                if (!Directory.Exists(folder))

                {

                    Directory.CreateDirectory(folder);

                }

                lock (GetLockObj(fileName.Length))

                {

                    using (StreamWriter writer = new StreamWriter(fileName, isAppend, encode))

                    {

                        writer.Write(text);

                    }

                }

                return true;

            }

            catch (Exception err)

            {

                if (writeLogOnError)

                {

                    Log.WriteLogToTxt(err);

                }

                else

                {

                    Error.Throw("IOHelper.Save() : " + err.Message);

                }

            }

            return false;

        }

        internal static bool Delete(string fileName)

        {

            try

            {

                if (File.Exists(fileName))

                {

                    lock (GetLockObj(fileName.Length))

                    {

                        if (File.Exists(fileName))

                        {

                            File.Delete(fileName);

                            return true;

                        }

                    }

                }

            }

            catch

            {

            }

            return false;

        }

        public static bool IsLastFileWriteTimeChanged(string fileName, ref DateTime compareTimeUtc)

        {

            bool isChanged = false;

            IOInfo info = new IOInfo(fileName);

            if (info.Exists && info.LastWriteTimeUtc != compareTimeUtc)

            {

                isChanged = true;

                compareTimeUtc = info.LastWriteTimeUtc;

            }

            return isChanged;

        }

        public static string BytesToText(byte[] buff, Encoding encoding)

        {

            if (buff.Length == ) { return ""; }

            //if (buff[0] == 239 && buff[1] == 187 && buff[2] == 191)

            //{

            //    return Encoding.UTF8.GetString(buff, 3, buff.Length - 3);

            //}

            //else if (buff[0] == 255 && buff[1] == 254)

            //{

            //    return Encoding.Unicode.GetString(buff, 2, buff.Length - 2);

            //}

            //else if (buff[0] == 254 && buff[1] == 255)

            //{

            //    if (buff.Length > 3 && buff[2] == 0 && buff[3] == 0)

            //    {

            //        return Encoding.UTF32.GetString(buff, 4, buff.Length - 4);

            //    }

            //    return Encoding.BigEndianUnicode.GetString(buff, 2, buff.Length - 2);

            //}

            //else

            //{

            TextEncodingDetect detect = new TextEncodingDetect();

            //检测Bom

            switch (detect.DetectWithBom(buff))

            {

                case TextEncodingDetect.Encoding.Utf8Bom:

                    return Encoding.UTF8.GetString(buff, , buff.Length - );

                case TextEncodingDetect.Encoding.UnicodeBom:

                    return Encoding.Unicode.GetString(buff, , buff.Length - );

                case TextEncodingDetect.Encoding.BigEndianUnicodeBom:

                    return Encoding.BigEndianUnicode.GetString(buff, , buff.Length - );

                case TextEncodingDetect.Encoding.Utf32Bom:

                    return Encoding.UTF32.GetString(buff, , buff.Length - );

            }

            if (encoding != DefaultEncoding && encoding != Encoding.ASCII)//自定义设置编码，优先处理。

            {

                return encoding.GetString(buff);

            }

            switch (detect.DetectWithoutBom(buff, buff.Length >  ?  : buff.Length))//自动检测。

            {

                case TextEncodingDetect.Encoding.Utf8Nobom:

                    return Encoding.UTF8.GetString(buff);

                case TextEncodingDetect.Encoding.UnicodeNoBom:

                    return Encoding.Unicode.GetString(buff);

                case TextEncodingDetect.Encoding.BigEndianUnicodeNoBom:

                    return Encoding.BigEndianUnicode.GetString(buff);

                case TextEncodingDetect.Encoding.Utf32NoBom:

                    return Encoding.UTF32.GetString(buff);

                case TextEncodingDetect.Encoding.Ansi:

                    if (IsChineseEncoding(DefaultEncoding) && !IsChineseEncoding(encoding))

                    {

                        if (detect.IsChinese)

                        {

                            return Encoding.GetEncoding("gbk").GetString(buff);

                        }

                        else//非中文时，默认选一个。

                        {

                            return Encoding.Unicode.GetString(buff);

                        }

                    }

                    else

                    {

                        return encoding.GetString(buff);

                    }

                case TextEncodingDetect.Encoding.Ascii:

                    return Encoding.ASCII.GetString(buff);

                default:

                    return encoding.GetString(buff);

            }

            // }

        }

        private static bool IsChineseEncoding(Encoding encoding)

        {

            return encoding == Encoding.GetEncoding("gb2312") || encoding == Encoding.GetEncoding("gbk") || encoding == Encoding.GetEncoding("big5");

        }

    }

    internal class IOInfo : FileSystemInfo

    {

        public IOInfo(string fileName)

        {

            base.FullPath = fileName;

        }

        public override void Delete()

        {

        }

        public override bool Exists

        {

            get

            {

                return File.Exists(base.FullPath);

            }

        }

        public override string Name

        {

            get

            {

                return null;

            }

        }

    }

    /// <summary>

    /// 字节文本编码检测

    /// </summary>

    internal class TextEncodingDetect

    {

        private readonly byte[] _UTF8Bom =

        {

            0xEF,

            0xBB,

            0xBF

        };

        //utf16le _UnicodeBom

        private readonly byte[] _UTF16LeBom =

        {

            0xFF,

            0xFE

        };

        //utf16be _BigUnicodeBom

        private readonly byte[] _UTF16BeBom =

        {

            0xFE,

            0xFF

        };

        //utf-32le

        private readonly byte[] _UTF32LeBom =

        {

            0xFF,

            0xFE,

            0x00,

            0x00

        };

        //utf-32Be

        //private readonly byte[] _UTF32BeBom =

        //{

        //    0x00,

        //    0x00,

        //    0xFE,

        //    0xFF

        //};

        /// <summary>

        /// 是否中文

        /// </summary>

        public bool IsChinese = false;

        public enum Encoding

        {

            None, // Unknown or binary

            Ansi, // 0-255

            Ascii, // 0-127

            Utf8Bom, // UTF8 with BOM

            Utf8Nobom, // UTF8 without BOM

            UnicodeBom, // UTF16 LE with BOM

            UnicodeNoBom, // UTF16 LE without BOM

            BigEndianUnicodeBom, // UTF16-BE with BOM

            BigEndianUnicodeNoBom, // UTF16-BE without BOM

            Utf32Bom,//UTF-32LE with BOM

            Utf32NoBom //UTF-32 without BOM

        }

        public Encoding DetectWithBom(byte[] buffer)

        {

            if (buffer != null)

            {

                int size = buffer.Length;

                // Check for BOM

                if (size >=  && buffer[] == _UTF16LeBom[] && buffer[] == _UTF16LeBom[])

                {

                    return Encoding.UnicodeBom;

                }

                if (size >=  && buffer[] == _UTF16BeBom[] && buffer[] == _UTF16BeBom[])

                {

                    if (size >=  && buffer[] == _UTF32LeBom[] && buffer[] == _UTF32LeBom[])

                    {

                        return Encoding.Utf32Bom;

                    }

                    return Encoding.BigEndianUnicodeBom;

                }

                if (size >=  && buffer[] == _UTF8Bom[] && buffer[] == _UTF8Bom[] && buffer[] == _UTF8Bom[])

                {

                    return Encoding.Utf8Bom;

                }

            }

            return Encoding.None;

        }

        /// <summary>

        ///     Automatically detects the Encoding type of a given byte buffer.

        /// </summary>

        /// <param name="buffer">The byte buffer.</param>

        /// <param name="size">The size of the byte buffer.</param>

        /// <returns>The Encoding type or Encoding.None if unknown.</returns>

        public Encoding DetectWithoutBom(byte[] buffer, int size)

        {

            // Now check for valid UTF8

            Encoding encoding = CheckUtf8(buffer, size);

            if (encoding != Encoding.None)

            {

                return encoding;

            }

            // ANSI or None (binary) then 一个零都没有情况。

            if (!ContainsZero(buffer, size))

            {

                CheckChinese(buffer, size);

                return Encoding.Ansi;

            }

            // Now try UTF16  按寻找换行字符先进行判断

            encoding = CheckByNewLineChar(buffer, size);

            if (encoding != Encoding.None)

            {

                return encoding;

            }

            // 没办法了，只能按0出现的次数比率，做大体的预判

            encoding = CheckByZeroNumPercent(buffer, size);

            if (encoding != Encoding.None)

            {

                return encoding;

            }

            // Found a null, return based on the preference in null_suggests_binary_

            return Encoding.None;

        }

        /// <summary>

        ///     Checks if a buffer contains text that looks like utf16 by scanning for

        ///     newline chars that would be present even in non-english text.

        ///     以检测换行符标识来判断。

        /// </summary>

        /// <param name="buffer">The byte buffer.</param>

        /// <param name="size">The size of the byte buffer.</param>

        /// <returns>Encoding.none, Encoding.Utf16LeNoBom or Encoding.Utf16BeNoBom.</returns>

        private static Encoding CheckByNewLineChar(byte[] buffer, int size)

        {

            if (size < )

            {

                return Encoding.None;

            }

            // Reduce size by 1 so we don't need to worry about bounds checking for pairs of bytes

            size--;

            int le16 = ;

            int be16 = ;

            int le32 = ;//检测是否utf32le。

            int zeroCount = ;//utf32le 每4位后面多数是0

            uint pos = ;

            while (pos < size)

            {

                byte ch1 = buffer[pos++];

                byte ch2 = buffer[pos++];

                if (ch1 == )

                {

                    if (ch2 == 0x0a || ch2 == 0x0d)//\r \t 换行检测。

                    {

                        ++be16;

                    }

                }

                if (ch2 == )

                {

                    zeroCount++;

                    if (ch1 == 0x0a || ch1 == 0x0d)

                    {

                        ++le16;

                        if (pos +  <= size && buffer[pos] ==  && buffer[pos + ] == )

                        {

                            ++le32;

                        }

                    }

                }

                // If we are getting both LE and BE control chars then this file is not utf16

                if (le16 >  && be16 > )

                {

                    return Encoding.None;

                }

            }

            if (le16 > )

            {

                if (le16 == le32 && buffer.Length %  == )

                {

                    return Encoding.Utf32NoBom;

                }

                return Encoding.UnicodeNoBom;

            }

            else if (be16 > )

            {

                return Encoding.BigEndianUnicodeNoBom;

            }

            else if (buffer.Length %  ==  && zeroCount >= buffer.Length / )

            {

                return Encoding.Utf32NoBom;

            }

            return Encoding.None;

        }

        /// <summary>

        /// Checks if a buffer contains any nulls. Used to check for binary vs text data.

        /// </summary>

        /// <param name="buffer">The byte buffer.</param>

        /// <param name="size">The size of the byte buffer.</param>

        private static bool ContainsZero(byte[] buffer, int size)

        {

            uint pos = ;

            while (pos < size)

            {

                if (buffer[pos++] == )

                {

                    return true;

                }

            }

            return false;

        }

        /// <summary>

        ///     Checks if a buffer contains text that looks like utf16. This is done based

        ///     on the use of nulls which in ASCII/script like text can be useful to identify.

        ///     按照一定的空0数的概率来预测。

        /// </summary>

        /// <param name="buffer">The byte buffer.</param>

        /// <param name="size">The size of the byte buffer.</param>

        /// <returns>Encoding.none, Encoding.Utf16LeNoBom or Encoding.Utf16BeNoBom.</returns>

        private Encoding CheckByZeroNumPercent(byte[] buffer, int size)

        {

            //单数

            int oddZeroCount = ;

            //双数

            int evenZeroCount = ;

            // Get even nulls

            uint pos = ;

            while (pos < size)

            {

                if (buffer[pos] == )

                {

                    evenZeroCount++;

                }

                pos += ;

            }

            // Get odd nulls

            pos = ;

            while (pos < size)

            {

                if (buffer[pos] == )

                {

                    oddZeroCount++;

                }

                pos += ;

            }

            double evenZeroPercent = evenZeroCount * 2.0 / size;

            double oddZeroPercent = oddZeroCount * 2.0 / size;

            // Lots of odd nulls, low number of even nulls 这里的条件做了修改

            if (evenZeroPercent < 0.1 && oddZeroPercent > )

            {

                return Encoding.UnicodeNoBom;

            }

            // Lots of even nulls, low number of odd nulls 这里的条件也做了修改

            if (oddZeroPercent < 0.1 && evenZeroPercent > )

            {

                return Encoding.BigEndianUnicodeNoBom;

            }

            // Don't know

            return Encoding.None;

        }

        /// <summary>

        ///     Checks if a buffer contains valid utf8.

        ///     以UTF8 的字节范围来检测。

        /// </summary>

        /// <param name="buffer">The byte buffer.</param>

        /// <param name="size">The size of the byte buffer.</param>

        /// <returns>

        ///     Encoding type of Encoding.None (invalid UTF8), Encoding.Utf8NoBom (valid utf8 multibyte strings) or

        ///     Encoding.ASCII (data in 0.127 range).

        /// </returns>

        /// <returns></returns>

        private Encoding CheckUtf8(byte[] buffer, int size)

        {

            // UTF8 Valid sequences

            // 0xxxxxxx  ASCII

            // 110xxxxx 10xxxxxx  2-byte

            // 1110xxxx 10xxxxxx 10xxxxxx  3-byte

            // 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx  4-byte

            //

            // Width in UTF8

            // Decimal      Width

            // 0-127        1 byte

            // 194-223      2 bytes

            // 224-239      3 bytes

            // 240-244      4 bytes

            //

            // Subsequent chars are in the range 128-191

            bool onlySawAsciiRange = true;

            uint pos = ;

            while (pos < size)

            {

                byte ch = buffer[pos++];

                if (ch == )

                {

                    return Encoding.None;

                }

                int moreChars;

                if (ch <= )

                {

                    // 1 byte

                    moreChars = ;

                }

                else if (ch >=  && ch <= )

                {

                    // 2 Byte

                    moreChars = ;

                }

                else if (ch >=  && ch <= )

                {

                    // 3 Byte

                    moreChars = ;

                }

                else if (ch >=  && ch <= )

                {

                    // 4 Byte

                    moreChars = ;

                }

                else

                {

                    return Encoding.None; // Not utf8

                }

                // Check secondary chars are in range if we are expecting any

                while (moreChars >  && pos < size)

                {

                    onlySawAsciiRange = false; // Seen non-ascii chars now

                    ch = buffer[pos++];

                    if (ch <  || ch > )

                    {

                        return Encoding.None; // Not utf8

                    }

                    --moreChars;

                }

            }

            // If we get to here then only valid UTF-8 sequences have been processed

            // If we only saw chars in the range 0-127 then we can't assume UTF8 (the caller will need to decide)

            return onlySawAsciiRange ? Encoding.Ascii : Encoding.Utf8Nobom;

        }

        /// <summary>

        /// 是否中文编码（GB2312、GBK、Big5）

        /// </summary>

        private void CheckChinese(byte[] buffer, int size)

        {

            IsChinese = false;

            if (size < )

            {

                return;

            }

            // Reduce size by 1 so we don't need to worry about bounds checking for pairs of bytes

            size--;

            uint pos = ;

            bool isCN = false;

            while (pos < size)

            {

                //GB2312

                //0xB0-0xF7(176-247)

                //0xA0-0xFE（160-254）

                //GBK

                //0x81-0xFE（129-254）

                //0x40-0xFE（64-254）

                //Big5

                //0x81-0xFE（129-255）

                //0x40-0x7E（64-126）  OR 0xA1－0xFE（161-254）

                byte ch1 = buffer[pos++];

                byte ch2 = buffer[pos++];

                isCN = (ch1 >=  && ch1 <=  && ch2 >=  && ch2 <= )

                    || (ch1 >=  && ch1 <=  && ch2 >=  && ch2 <= )

                    || (ch1 >=  && ((ch2 >=  && ch2 <= ) || (ch2 >=  && ch2 <= )));

                if (!isCN)

                {

                    return;

                }

            }

            IsChinese = true;

        }

    }

}

后续更新地址：https://github.com/cyq1162/cyqdata/blob/master/Tool/IOHelper.cs

总结：

1、考虑到UTF7已经过时了，所以直接无视了。

2、对于纯中文情况，UTF16下是BE还是LE，暂时没有想到好的检测方法，所以默认返回了常用的LE，即Unicode。

3、其它一切都安好，全国公开的C#版本，应该就此一份了。

码农公寓

前言：

有BOM的编码检测

测试的结果：

无BOM的编码检测

无BOM代码检测的改造过程：

改进后的完整源码：

总结：

相关文章