遇到问题:从kafka里读出来数据当中有中文的话解码出不来我们想要的结果。
解决方法:使用python的urllib
解决方案:
# 从kafka里读出来得数据格式为
keyword = b'{"@timestamp":"2021-06-22T06:29:26.241Z","@metadata":{"beat":"filebeat","type":"doc","version":"6.7.1","topic":"sdk"},"message":"{\\"timestamp\\":\\"1624343365.918\\",\\"remote_addr\\":\\"58.63.214.220\\",\\"domain\\":\\"www.miaoshou.net\\",\\"url\\":\\"https://www.miaoshou.net/voice/293732.html\\",\\"referrer\\":\\"https://www.so.com/link?m=b2Go08A1NtCZtl2Vwc%2BGH2Nowlffi3fJBmbpIJYK%2B9ajwOnRqFSfJZ5z8yNSfuNqdlYB%2Bud71f7Nq4y47MLl1RPJdQ8h8W2bRo5uBFmAxFYOFimzZ6daToIhDLnqSZjT1ztcKwkTPR5fY3nXJWOc6ePogmFV0uEdnT1B7JnyQJf55lW4P2zwY1nH46fvOh0CS\\",\\"http_user_agent\\":\\"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36\\",\\"u_utrace\\":\\"9129aba83d10e2bf58e4f05718f12e50\\",\\"u_uid\\":\\"0\\",\\"pagetitle\\":\\"\\\\xE5\\\\x92\\\\xB3\\\\xE5\\\\x97\\\\xBD\\\\xE5\\\\x92\\\\xB3\\\\xE5\\\\x87\\\\xBA\\\\xE8\\\\xA1\\\\x80_\\\\xE7\\\\x86\\\\x8A\\\\xE6\\\\xB4\\\\xAA\\\\xE6\\\\xB5\\\\xB7\\\\xE5\\\\x8C\\\\xBB\\\\xE7\\\\x94\\\\x9F\\\\xE7\\\\x9A\\\\x84\\\\xE8\\\\xAF\\\\xAD\\\\xE9\\\\x9F\\\\xB3\\\\xE7\\\\xA7\\\\x91\\\\xE6\\\\x99\\\\xAE_\\\\xE5\\\\xA6\\\\x99\\\\xE6\\\\x89\\\\x8B\\\\xE5\\\\x8C\\\\xBB\\\\xE7\\\\x94\\\\x9F\\",\\"click_id\\":\\"\\",\\"click_url\\":\\"\\",\\"params\\":\\"{\\\\x22id\\\\x22:\\\\x22293732\\\\x22,\\\\x22title\\\\x22:\\\\x22\\\\x5Cu54b3\\\\x5Cu55fd\\\\x5Cu54b3\\\\x5Cu51fa\\\\x5Cu8840\\\\x22,\\\\x22url\\\\x22:\\\\x22http:\\\\x5C/\\\\x5C/www.miaoshou.net\\\\x5C/video\\\\x5C/293732.html\\\\x22,\\\\x22fkname\\\\x22:\\\\x22\\\\x5Cu5185\\\\x5Cu79d1\\\\x22,\\\\x22skname\\\\x22:\\\\x22\\\\x5Cu666e\\\\x5Cu901a\\\\x5Cu5185\\\\x5Cu79d1\\\\x22,\\\\x22type_name\\\\x22:\\\\x22\\\\x5Cu97f3\\\\x5Cu9891\\\\x22,\\\\x22device_source\\\\x22:\\\\x22pc\\\\x22}\\"}"}'
keyword = keyword.decode('utf-8')
keyword = json.loads(keyword)
keyword = keyword.get('message')
keyword = str(keyword).replace('\\x', '%')
keyword = json.loads(keyword)
print(keyword)
print(type(keyword))
for k, v in dict(keyword).items():
keyword[k] = urllib.parse.unquote(v)
print(keyword)
实际问题就是解决
keyword = '%E5%92%B3%E5%97%BD%E5%92%B3%E5%87%BA%E8%A1%80_%E7%86%8A%E6%B4%AA%E6%B5%B7%E5%8C%BB%E7%94%9F%E7%9A%84%E8%AF%AD%E9%9F%B3%E7%A7%91%E6%99%AE_%E5%A6%99%E6%89%8B%E5%8C%BB%E7%94%9F'
怎么才能让keyword输出成中文
import urllib
keyword = urllib.parse.unquote(keyword)
print(keyword)
# 咳嗽咳出血_熊洪海医生的语音科普