前言
本篇文章很短,就是记录一个偶然遇到的问题
问题复现
是这样的,在用xpath解析某网站的时候,由于网站数据格式是普通的html,而非json字符串,所以只能解析DOM对象,有的能用正则表达式的我都尽量用正则表达式了,没法用正则的我都用beautifulsoup库或者pyquery了,但是没法,通用型还是没法跟xpath比,而且我已经写好一版,在有限的时间改的话就很烦了
不多说,先看问题
首先部分的网站源码如下:
<article class="_55wo _5rgr _5gh8 _3drq async_like" data-ft='{"mf_story_key":"10159935560038463","top_level_post_id":"10159935560038463","tl_objid":"10159935560038463","content_owner_id_new":"8245623462","throwback_story_xxid":"10159935560038463","page_id":"8245623462","story_location":4,"story_attachment_style":"video_inline","tds_flgs":3,"ott":"AX90AyHPzJSMfPjF","tn":"-R"}' data-sigil="story-div story-popup-metadata story-popup-metadata feed-ufi-metadata" data-store='{"linkdata":"mf_story_key.10159935560038463:top_level_post_id.10159935560038463:tl_objid.10159935560038463:content_owner_id_new.8245623462:throwback_story_xxid.10159935560038463:page_id.8245623462:story_location.4:story_attachment_style.video_inline:tds_flgs.3:ott.AX90AyHPzJSMfPjF","share_id":"10159935560038463","feedback_target":"10159935560038463","feedback_source":0,"action_source":0,"actor_id":100065274592441}' data-xt="2.mf_story_key.10159935560038463:top_level_post_id.10159935560038463:tl_objid.10159935560038463:content_owner_id_new.8245623462:throwback_story_xxid.10159935560038463:page_id.8245623462:story_location.4:story_attachment_style.video_inline:tds_flgs.3:ott.AX90AyHPzJSMfPjF" data-xt-vimp='{"pixel_in_percentage":0,"duration_in_ms":1,"subsequent_gap_in_ms":60000,"log_initial_nonviewable":false,"should_batch":true,"require_horizontally_onscreen":false}' id="u_0_5_iv"> <div class="story_body_container"> <header class="_7om2 _1o88 _77kd _5qc1"> <div class="_5s61 _2pii _5i2i _52wc"> <div class="_5xu4"> <div class="_67lm _77kc" data-gt='{"tn":"~"}' data-sigil="feed_story_ring8245623462"><a data-click='{"event":"click_post_avatar_image","target_id":"10159935560038463"}' data-gt='{"tn":"~"}' href="/nba/?__tn__=%7E%7E-R"><i aria-label="NBA, profile picture" class="img _1-yc profpic" role="img" ></i></a> </div> </div> </div> <div class="_4g34 _5i2i _52we"> <div class="_5xu4"> <div class="_7om2 _52wc"> <div class="_4g34"><h3 class="_52jd _52jb _52jh _5qc3 _4vc- _3rc4 _4vc-" data-gt='{"tn":"C"}'> <span><strong><a href="/nba/?__tn__=C-R">NBA</a></strong><span aria-label="Verified Page" class="_56_f _5dzy _5dz- _3twv" id="u_0_e_x2" role="img"></span></span> </h3> <div class="_52jc _5qc4 _78cz _24u0 _36xo" data-sigil="m-feed-voice-subtitle"><a href="/story.php?story_xxid=10159935560038463&id=8245623462&__tn__=-R"><abbr>6 hrs</abbr></a><span aria-hidden="true"> · </span><span><div class="_7jwi"><span data-sigil="audience-icon"><i aria-label="Public" class="feedAudienceIcon img sp_eXcmc5QyINt_2x sx_e966fc" role="img"></i></span><div class="_7jwh"></div></div></span> </div> </div> <div class="_5s61"> <div class="_2pir" id="feed_story_fan_8245623462"></div> </div> <div class="_5s61"></div> <div class="_5s61 _2pis"> <div class="_yff" data-sigil="story-popup-causal-init" data-store='{"feedobjectsIdentifiers":"S:_I8245623462:10159935560038463","feedContext":"{\"use_m_feed\":true,\"m_entstream_source\":\"timeline\",\"is_pages_timeline\":true,\"story_node_id\":\"u_0_5_iv\",\"show_attachments\":true,\"is_attached_story\":false}"}' id="u_0_b_35"><a aria-haspopup="true" class="_4s19 sec" data-sigil="touchable" href="#" role="button"></a><i class="img sp_eXcmc5QyINt_2x sx_b9866d" data-sigil="story-popup-context-init"><u>More options</u></i></div> </div> </div> </div> </div> </header> <div class="_5rgt _5nk5 _5msi" data-ft='{"tn":"*s"}' data-gt='{"tn":"*s"}' style=""> <div><span><p>Watch the BEST DEEP 3'S from the <a href="/LAClippers/?__tn__=%2As-R">L.A. Clippers</a> during the <a class="_5ayv _qdx" href="/hashtag/nbaplayoffs?__tn__=%2As-R"><span class="_5aw4 _qdz">#</span><span class="_5ayu">NBAPlayoffs</span></a>! </p><p> <a class="_5ayv _qdx" href="/hashtag/thatsgame?__tn__=%2As-R"><span class="_5aw4 _qdz">#</span><span class="_5ayu">ThatsGame</span></a> <span class="_5mfr"><span class="_6qdm" style='height: 16px; width: 16px; font-size: 16px; background-image: url("https://static.xx.xxcdn.net/images/emoji.php/v9/tdf/2/16/1f4a5.png")'>