python爬虫 -- xpath处理emoji问题

前言

 

本篇文章很短,就是记录一个偶然遇到的问题

 

问题复现

 

是这样的,在用xpath解析某网站的时候,由于网站数据格式是普通的html,而非json字符串,所以只能解析DOM对象,有的能用正则表达式的我都尽量用正则表达式了,没法用正则的我都用beautifulsoup库或者pyquery了,但是没法,通用型还是没法跟xpath比,而且我已经写好一版,在有限的时间改的话就很烦了

不多说,先看问题

 

 

首先部分的网站源码如下:

 

<article class="_55wo _5rgr _5gh8 _3drq async_like"
         data-ft='{"mf_story_key":"10159935560038463","top_level_post_id":"10159935560038463","tl_objid":"10159935560038463","content_owner_id_new":"8245623462","throwback_story_xxid":"10159935560038463","page_id":"8245623462","story_location":4,"story_attachment_style":"video_inline","tds_flgs":3,"ott":"AX90AyHPzJSMfPjF","tn":"-R"}'
         data-sigil="story-div story-popup-metadata story-popup-metadata feed-ufi-metadata"
         data-store='{"linkdata":"mf_story_key.10159935560038463:top_level_post_id.10159935560038463:tl_objid.10159935560038463:content_owner_id_new.8245623462:throwback_story_xxid.10159935560038463:page_id.8245623462:story_location.4:story_attachment_style.video_inline:tds_flgs.3:ott.AX90AyHPzJSMfPjF","share_id":"10159935560038463","feedback_target":"10159935560038463","feedback_source":0,"action_source":0,"actor_id":100065274592441}'
         data-xt="2.mf_story_key.10159935560038463:top_level_post_id.10159935560038463:tl_objid.10159935560038463:content_owner_id_new.8245623462:throwback_story_xxid.10159935560038463:page_id.8245623462:story_location.4:story_attachment_style.video_inline:tds_flgs.3:ott.AX90AyHPzJSMfPjF"
         data-xt-vimp='{"pixel_in_percentage":0,"duration_in_ms":1,"subsequent_gap_in_ms":60000,"log_initial_nonviewable":false,"should_batch":true,"require_horizontally_onscreen":false}'
         id="u_0_5_iv">
    <div class="story_body_container">
        <header class="_7om2 _1o88 _77kd _5qc1">
            <div class="_5s61 _2pii _5i2i _52wc">
                <div class="_5xu4">
                    <div class="_67lm _77kc" data-gt='{"tn":"~"}' data-sigil="feed_story_ring8245623462"><a
                            data-click='{"event":"click_post_avatar_image","target_id":"10159935560038463"}'
                            data-gt='{"tn":"~"}' href="/nba/?__tn__=%7E%7E-R"><i aria-label="NBA, profile picture"
                                                                                 class="img _1-yc profpic" role="img"
                                                                                 ></i></a>
                    </div>
                </div>
            </div>
            <div class="_4g34 _5i2i _52we">
                <div class="_5xu4">
                    <div class="_7om2 _52wc">
                        <div class="_4g34"><h3 class="_52jd _52jb _52jh _5qc3 _4vc- _3rc4 _4vc-" data-gt='{"tn":"C"}'>
                            <span><strong><a href="/nba/?__tn__=C-R">NBA</a></strong><span aria-label="Verified Page"
                                                                                           class="_56_f _5dzy _5dz- _3twv"
                                                                                           id="u_0_e_x2"
                                                                                           role="img"></span></span>
                        </h3>
                            <div class="_52jc _5qc4 _78cz _24u0 _36xo" data-sigil="m-feed-voice-subtitle"><a
                                    href="/story.php?story_xxid=10159935560038463&id=8245623462&__tn__=-R"><abbr>6
                                hrs</abbr></a><span aria-hidden="true"> · </span><span><div class="_7jwi"><span
                                    data-sigil="audience-icon"><i aria-label="Public"
                                                                  class="feedAudienceIcon img sp_eXcmc5QyINt_2x sx_e966fc"
                                                                  role="img"></i></span><div class="_7jwh"></div></div></span>
                            </div>
                        </div>
                        <div class="_5s61">
                            <div class="_2pir" id="feed_story_fan_8245623462"></div>
                        </div>
                        <div class="_5s61"></div>
                        <div class="_5s61 _2pis">
                            <div class="_yff" data-sigil="story-popup-causal-init"
                                 data-store='{"feedobjectsIdentifiers":"S:_I8245623462:10159935560038463","feedContext":"{\"use_m_feed\":true,\"m_entstream_source\":\"timeline\",\"is_pages_timeline\":true,\"story_node_id\":\"u_0_5_iv\",\"show_attachments\":true,\"is_attached_story\":false}"}'
                                 id="u_0_b_35"><a aria-haspopup="true" class="_4s19 sec" data-sigil="touchable" href="#"
                                                  role="button"></a><i class="img sp_eXcmc5QyINt_2x sx_b9866d"
                                                                       data-sigil="story-popup-context-init"><u>More
                                options</u></i></div>
                        </div>
                    </div>
                </div>
            </div>
        </header>
        <div class="_5rgt _5nk5 _5msi" data-ft='{"tn":"*s"}' data-gt='{"tn":"*s"}' style="">
            <div><span><p>Watch the BEST DEEP 3'S from the <a href="/LAClippers/?__tn__=%2As-R">L.A. Clippers</a> during the <a
                    class="_5ayv _qdx" href="/hashtag/nbaplayoffs?__tn__=%2As-R"><span class="_5aw4 _qdz">#</span><span
                    class="_5ayu">NBAPlayoffs</span></a>! </p><p> <a class="_5ayv _qdx"
                                                                     href="/hashtag/thatsgame?__tn__=%2As-R"><span
                    class="_5aw4 _qdz">#</span><span class="_5ayu">ThatsGame</span></a> <span class="_5mfr"><span
                    class="_6qdm"
                    style='height: 16px; width: 16px; font-size: 16px; background-image: url("https://static.xx.xxcdn.net/images/emoji.php/v9/tdf/2/16/1f4a5.png")'>
上一篇:数据挖掘相关算法


下一篇:php闭包类外操作私有属性