【292】Python 关于中文字符串的操作

2023-08-05 08:11:10

一、相关说明

Python 中关于字符串的操作只限于英文字母，当进行中文字符的操作的时候就会报错，以下将相关操作方法进行展示。

写在前面：如何获得系统的默认编码？

>>> import sys

>>> print sys.getdefaultencoding()

ascii

通过如下代码查询不同的字符串所使用的编码，具体操作详见：用chardet判断字符编码的方法

由此可见英文字符与中文字符用的是不一样的编码，因此需要将中文字符转为 Unicode 编码才能正常的计算了！

>>> import chardet

>>> print chardet.detect("abc")

{'confidence': 1.0, 'language': '', 'encoding': 'ascii'}

>>> print chardet.detect("我是中国人")

{'confidence': 0.9690625, 'language': '', 'encoding': 'utf-8'}

>>> print chardet.detect("abc-我是中国人")

{'confidence': 0.9690625, 'language': '', 'encoding': 'utf-8'}

通过 decode('utf-8') 将中文字符串解码，便可以正常操作，要相对中文字符进行相关操作，涉及到字符串函数的，需要按如下操作。

decode 的作用是将其他编码的字符串转换成 unicode 编码，如 str1.decode('utf-8')，表示将 utf-8 编码的字符串 str1 转换成 unicode 编码。
encode 的作用是将 unicode 编码转换成其他编码的字符串，如 str2.encode('utf-8')，表示将 unicode 编码的字符串 str2 转换成 utf-8 编码。

>>> m = "我是中国人"

>>> m

'\xe6\x88\x91\xe6\x98\xaf\xe4\xb8\xad\xe5\x9b\xbd\xe4\xba\xba'

>>> print m

我是中国人

>>> # 为解码前长度为15，utf-8编码

>>> len(m)

15

>>> n = m.decode('utf-8')

>>> n

u'\u6211\u662f\u4e2d\u56fd\u4eba'

>>> print n

我是中国人

>>> # 解码后长度为5，可以正常的操作，Unicode编码

>>> len(n)

5

将 utf-8 与 Unicode 编码转化函数如下：

def decodeChinese( string ):

	"将中文 utf-8 编码转为 Unicode 编码"

	tmp = string.decode('utf-8')

	return tmp

def encodeChinese( string ):

	"将 Unicode 编码转为 utf-8 编码"

	tmp = string.encode('utf-8')

	return tmp

二、截取中英文字符串

代码如下：

def cutChinese(string, *se):

	"实现汉字截取方法 —— 默认start为开始索引，不写end就是到结尾，否则到end"

	start = se[0]

	if len(se)>1:

		end = se[1]

	else:

		end = len(string)

	tmp = string.decode('utf-8')[start:end].encode('utf-8')

	return tmp

调用方法如下：

>>> from win_diy import *

>>> print win.cutChinese("我是一个abc", 2)

一个abc

>>> print win.cutChinese("我是一个abc", 2, 4)

一个

>>> print win.cutChinese("我是一个abc", 2, 5)

一个a

>>> print win.cutChinese("我是一个abc", 2, 6)

一个ab

参考：python截取中文字符串

三、判断变量编码格式

通过 isinstance 函数或 type 函数可以判断字符串类型
通过 chardet.detect 函数可以判断字符串的编码格式

>>> import chardet

>>> a = "abc"

>>> isinstance(a, str)

True

>>> chardet.detect(a)['encoding']

'ascii'

>>> isinstance(a, unicode)

False

>>> b = "中国"

>>> isinstance(b, str)

True

>>> chardet.detect(b)['encoding']

'utf-8'

>>> isinstance(b, unicode)

False

>>> # 用chardet.detect函数判断Unicode会报错

>>> c = b.decode('utf-8')

>>> isinstance(c, unicode)

True

参考：Python 字符编码判断

码农公寓

相关文章