第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

為了賬號(hào)安全,請(qǐng)及時(shí)綁定郵箱和手機(jī)立即綁定

檢測(cè)字節(jié)流是否是UTF8編碼

標(biāo)簽:
架構(gòu)

几天前偶尔看到有人发帖子问“如何自动识别判断url中的中文参数是GB2312还是Utf-8编码”

也拜读了wcwtitxu使用巨牛的正则表达式检测UTF8编码的算法。

使用无数或条件的正则表达式用起来却是性能不高。

刚好曾经在项目中有类似的需求,这里把处理思路和整理后的源代码贴出来供大家参考

先聊聊原理:

UTF8的编码规则如下表

UTF8 Encoding Rule

看起来很复杂,总结起来如下:

ASCII码(U+0000 - U+007F),不编码

其余编码规则为

•第一个Byte二进制以形式为n个1紧跟个0 (n >= 2), 0后面的位数用来存储真正的字符编码,n的个数说明了这个多Byte字节组字节数(包括第一个Byte)
•结下来会有n个以10开头的Byte,后6个bit存储真正的字符编码。
因此对整个编码byte流进行分析可以得出是否是UTF8编码的判断。

根据这个规则,我给出的C#代码如下:

?

/// <summary>///   Determines whether the given <paramref name="inputStream"/>is UTF8 encoding bytes./// </summary>/// <param name="inputStream">///    The input stream.///  </param>/// <returns>///   <see langword="true"/> if given bystes stream is in UTF8 encoding; otherwise, <see langword="false"/>./// </returns>/// <remarks>///   All ASCII chars will regards not UTF8 encoding./// </remarks>public static bool IsTextUTF8(ref byte[] inputStream){    int encodingBytesCount = 0;    bool allTextsAreASCIIChars = true;     for (int i = 0; i < inputStream.Length; i++)    {        byte current = inputStream[i];         if ((current & 0x80) == 0x80)        {                                allTextsAreASCIIChars = false;        }        // First byte        if (encodingBytesCount == 0)        {            if ((current & 0x80) == 0)            {                // ASCII chars, from 0x00-0x7F                continue;            }             if ((current & 0xC0) == 0xC0)            {                encodingBytesCount = 1;                current <<= 2;                 // More than two bytes used to encoding a unicode char.                // Calculate the real length.                while ((current & 0x80) == 0x80)                {                    current <<= 1;                    encodingBytesCount++;                }            }                                else            {                // Invalid bits structure for UTF8 encoding rule.                return false;            }        }                        else        {            // Following bytes, must start with 10.            if ((current & 0xC0) == 0x80)            {                                        encodingBytesCount--;            }            else            {                // Invalid bits structure for UTF8 encoding rule.                return false;            }        }    }     if (encodingBytesCount != 0)    {        // Invalid bits structure for UTF8 encoding rule.        // Wrong following bytes count.        return false;    }     // Although UTF8 supports encoding for ASCII chars, we regard as a input stream, whose contents are all ASCII as default encoding.    return !allTextsAreASCIIChars;}

 

 

再附上单元测试代码:

 

?

/// <summary>///This is a test class for EncodingHelperTest and is intended///to contain all EncodingHelperTest Unit Tests///</summary>[TestClass()]public class EncodingHelperTest{    /// <summary>    ///  Normal test for this method.    ///</summary>    [TestMethod()]    public void IsTextUTF8Test()    {        for (int i = 0; i < 1000; i++)        {            List<Char> chars = new List<char>();            chars.Add('中');             List<UnicodeCategory> temp = new List<UnicodeCategory>();            Random rd = new Random((int)(DateTime.Now.Ticks & 0x7FFFFFFF));             for (int j = 0; j < 255; j++)            {                char ch = (char)rd.Next(0xFFFF);                UnicodeCategory uc = System.Globalization.CharUnicodeInfo.GetUnicodeCategory(ch);                if (uc == UnicodeCategory.Surrogate || // Single surrogate could not be encoding correctly.                    uc == UnicodeCategory.PrivateUse || // Private use blocks should be excluded.                    uc == UnicodeCategory.OtherNotAssigned                    )                {                    j--;                }                else                {                    chars.Add(ch);                    temp.Add(uc);                }            }             string str = new string(chars.ToArray());             byte[] inputStream = Encoding.UTF8.GetBytes(str);            bool expected = true;             bool actual;            actual = EncodingHelper.IsTextUTF8(ref inputStream);            Assert.AreEqual(expected, actual, string.Format("UTF8_Assert Fails at:{0}", str));             inputStream = Encoding.GetEncoding(932).GetBytes(str);            expected = false;             actual = EncodingHelper.IsTextUTF8(ref inputStream);            Assert.AreEqual(expected, actual, string.Format("ShiftJIS_Assert Fails at:{0}", str));        }    }     /// <summary>    ///   Check with All ASCII chars    /// </summary>    [TestMethod]    public void IsTextUTF8Test_AllASCII()    {        string str = "ABCDEFGHKLHSJKLDFHJKLHAJKLSHJKLHAJKLSHDJKLAHSDJKLHAJKLSDHJKLASHDJKLHASJKLDHJKLASD";         byte[] inputStream = Encoding.UTF8.GetBytes(str);        bool expected = false;        bool actual;        actual = EncodingHelper.IsTextUTF8(ref inputStream);        Assert.AreEqual(expected, actual, string.Format("UTF8_Assert Fails at:{0}", str));      }}

 

另:

如果是判断一个文件是否使用了UTF8编码,不一定非用这种方法,因为通常以UTF8格式保存的文件最初两个字符是BOM头,标示该文件使用了UTF8编码。

参考:

维基百科:http://en.wikipedia.org/wiki/UTF-8

點(diǎn)擊查看更多內(nèi)容
TA 點(diǎn)贊

若覺(jué)得本文不錯(cuò),就分享一下吧!

評(píng)論

作者其他優(yōu)質(zhì)文章

正在加載中
  • 推薦
  • 評(píng)論
  • 收藏
  • 共同學(xué)習(xí),寫(xiě)下你的評(píng)論
感謝您的支持,我會(huì)繼續(xù)努力的~
掃碼打賞,你說(shuō)多少就多少
贊賞金額會(huì)直接到老師賬戶
支付方式
打開(kāi)微信掃一掃,即可進(jìn)行掃碼打賞哦
今天注冊(cè)有機(jī)會(huì)得

100積分直接送

付費(fèi)專(zhuān)欄免費(fèi)學(xué)

大額優(yōu)惠券免費(fèi)領(lǐng)

立即參與 放棄機(jī)會(huì)
微信客服

購(gòu)課補(bǔ)貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動(dòng)學(xué)習(xí)伙伴

公眾號(hào)

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號(hào)

舉報(bào)

0/150
提交
取消