利用chardet检测网页编码

环境:Win7_x64 + python3.4.3

需要先下载chardet并进行安装,下载地址:https://pypi.python.org/packages/source/c/chardet/chardet-2.3.0.tar.gz

安装:进入解压后的目录,在命令窗口执行: Python setup.py install

写个测试的python脚本吧(DetectURLCoding.py):

#coding:utf-8
'''''python 3.x''' import sys
import urllib.request
import chardet # 将data写入文件fname
def writeFile(fname, data):
f = open(fname, "wb")
if f:
f.write(data)
f.close() def blog_detect(blogurl):
'''''检测编码方式'''
try:
fp = urllib.request.urlopen(blogurl)
except Exception as e:
print(e)
print('download exception-[%s]' %blogurl)
return 0
blog = fp.read() # python3.x read the html as html code bytearray
fp.close()
#writeFile("t.html", blog) # get encoding string
codedetect = chardet.detect(blog)['encoding']
print('%s <- %s' %(blogurl, codedetect))
return 1 if __name__=='__main__':
if len(sys.argv) == 1:
print('''''usage:
python DetectURLCoding.py http://xxx.com''')
else:
v = blog_detect(sys.argv[1])
print(v) # 何问起 hovertree.com

运行结果:

D:\profile\Desktop>PYTHON de.py http://hovertree.com/
http://hovertree.com/ <- utf-8
1 D:\profile\Desktop>PYTHON de.py http://photo.cankaoxiaoxi.com/roll10/2015/0318/709734.shtml
http://photo.cankaoxiaoxi.com/roll10/2015/0318/709734.shtml <- utf-8
1

web前端:http://www.cnblogs.com/roucheng/p/texiao.html

上一篇:java中构造方法及其作用


下一篇:关于checkbox的checked属性和change事件