爬虫动态渲染页面爬取之Splash的介绍和使用

2022-11-14 19:44:57

Splash是一个JavaScript渲染服务，是一个带有HTTP API的轻量级浏览器，同时它对接了Python中的Twisted和QT库。
利用它，我们同样可以实现动态渲染页面的抓取。

1. 功能介绍和基本实例

### Splash的使用

'''

Splash是一个JavaScript渲染服务，是一个带有HTTP API的轻量级浏览器，同时它对接了Python中的Twisted和QT库。

利用它，我们同样可以实现动态渲染页面的抓取。

'''

## 功能介绍

# 1.异步方式处理多个网页渲染过程

# 2.获取渲染后页面的源代码或截图

# 3.通过关闭图片渲染或者使用Adblock规则来加快页面渲染速度

# 4.可执行特定的JavaScript脚本

# 5.可通过Lua脚本来控制页面渲染过程

# 6.获取渲染的详细过程并通过HAR（HTTP Archive）格式呈现

## 基本实例

function main(splash, args)

  splash:go("http://www.baidu.com")

  splash:wait(0.5)

  local title = splash:evaljs("document.title")

  return {

    title = title

  }

end

2. Splash用lua脚本爬取网页的基本使用介绍

2.1 异步处理

## 异步处理

# ipairs，为集合元素进行编号（编号从1开始），类似于python的enumerate

# lua脚本语言中字符串拼接用 ..

# splash:wait()类似于python中的time.sleep()

# 当Splash执行wait方法时，它会转而去处理其他任务，等到指定时间结束后再回来进行继续处理

function main(splash, args)

  local example_urls = {"www.baidu.com", "www.taobao.com", "www.zhihu.com"}

  local urls = args.urls or example_urls

  local results = {}

  for index, url in ipairs(urls) do

    local ok, reason = splash:go("http://" .. url)

    if ok then

      splash:wait(2)

      results[url] = splash:png()

    end

  end

  return results

end

2.2 Splash的对象属性

2.2.1 args，main方法中的第二个args属性即加载到splash中，即 args.url == splash.args.url

2.2.2 js_enabled，页面JavaScript的执行开关，默认为true

## js_enabled，页面JavaScript的执行开关，默认为True

function main(splash, args)

  splash:go("http://www.baidu.com")

  splash.js_enabled = False

  local title = splash:evaljs("document.title")

  return {

    title = title

  }

end

'''

执行结果：

{

    "error": 400,

    "type": "ScriptError",

    "description": "Error happened while executing Lua script",

    "info": {

        "type": "SPLASH_LUA_ERROR",

        "message": "[string \"function main(splash, args)\r...\"]:3: setAttribute(self, QWebSettings.WebAttribute, bool): argument 2 has unexpected type 'NoneType'",

        "source": "[string \"function main(splash, args)\r...\"]",

        "line_number": 3,

        "error": "setAttribute(self, QWebSettings.WebAttribute, bool): argument 2 has unexpected type 'NoneType'"

    }

}

'''

2.2.3 resource_timeout，页面加载超时时间，单位是秒

## resource_timeout，页面加载超时时间，单位是秒

function main(splash, args)

  splash.resource_timeout = 0.01

  assert(splash:go("http://www.baidu.com"))

  return splash:png()

end

'''

{

    "error": 400,

    "type": "ScriptError",

    "description": "Error happened while executing Lua script",

    "info": {

        "source": "[string \"function main(splash, args)\r...\"]",

        "line_number": 3,

        "error": "network5",

        "type": "LUA_ERROR",

        "message": "Lua error: [string \"function main(splash, args)\r...\"]:3: network5"

    }

}

'''

2.2.4 images_enabled，页面图片是否加载，默认为true

## images_enabled，页面图片是否加载，默认为True

## 禁用图片可以节省网络流量以及加快网页加载速度，当加载页面可能会影响JavaScript的渲染

## Splash使用了缓存，即访问过一次页面的图片，即使禁用页面图片加载后，依旧可以访问到缓存中的图片

## 如下面示例，返回的phg页面就不会带有图片

function main(splash, args)

  splash.images_enabled = false

  assert(splash:go("https://www.bilibili.com/"))

  return {

    png = splash:png()

  }

end

2.2.4 plugins_enabled，控制浏览器插件（如flash等）是否开启，默认为false

2.2.5 scroll_position，控制页面的上下左右滚动，用x定位左右，y定位上下；如下示例，定位到y=800的位置

## scroll_position，控制页面的上下左右滚动，用x定位左右，y定位上下；如下示例，定位到y=800的位置

function main(splash, args)

  assert(splash:go("https://www.mi.com"))

  splash.scroll_position = {y = 800}

  return {png = splash:png()}

end

2.3 Splash对象的方法

2.3.1 go方法，用来请求某个链接，可以模拟get和post请求，同时传入请求头、表单等数据，赋值传入变量时用{}

## go方法，用来请求某个链接，可以模拟get和post请求，同时传入请求头、表单等数据，赋值传入变量时用{}

## go方法有2个返回值，ok和reason，ok为空代表网页加载出现错误，此时reason变量中包含错误原因

# url：请求的URL

# baseurl：可选参数，默认为空，表示自愿加载的相对路径

# headers：可选参数，默认为空，请求头

# http_method：可选参数，默认为get，可以支持post

# body：可选参数，默认为空，发post请求时的表单数据，传入的数据内容类型为json

# formdata：可选参数，默认为空，发post请求时的表单数据，传入的数据内容类型为x-www-form-urlencoded

function main(splash, args)

  local ok, reason = splash:go{url="http://httpbin.org/post", http_method="POST", body="name=dmr"}

  if ok then

    return splash:html()

  end

end

2.3.2 wait方法，控制页面的等待时间，可以传入多值，赋值传入变量时用{}

## wait方法，控制页面的等待时间，可以传入多值，赋值传入变量时用{}

## 同样有2个返回值，ok和reason，ok为空代表网页加载出现错误，此时reason变量中包含错误原因

## ok, reason = splash:go{time, cancel_on_redirect=false, cancel_on_error=false}

# time：等待的秒数

# cancel_on_redirect：可选参数，默认为false，表示如果发生了重定向则停止等待并返回重定向结果

# cancel_on_error：可选参数，默认为false，表示如果发生了错误就停止等待

function main(splash, args)

  local ok, reason = splash:go("http://www.mi.com")

  splash:wait(2)

  if ok then

    return splash:html()

  end

end

2.3.3 JavaScript等的操作方法

## jsfunc方法，可以直接调用JavaScript定义的方法，所调用的方法要用双括号包围

## 如下示例，通过构造JavaScript方法来获取访问页面的title和div数量并返回

function main(splash, args)

  local get_div_count = splash:jsfunc([[

    function(){

    var title = document.title;

    var body = document.body;

    var divs = body.getElementsByTagName('div');

    var div_count = divs.length;

    return {div_count, title};

      }

    ]])

  ok, reason = splash:go("https://www.mi.com")

  result = get_div_count()

  return ("This page'title is %s, there are %s divs"):format(result.title, result.div_count)

end

## evaljs方法，可以执行JavaScript代码并返回最后一条JavaScript的返回结果

## 如下示例，只返回最后一条JavaScript语句的执行结果

function main(splash, args)

  splash:go("https://www.mi.com")

  result = splash:evaljs("document.title;document.body.getElementsByTagName('div').length;")

  return result

end

## runjs方法，可以执行JavaScript代码，与evaljs类似，但是更偏向于执行某些动作或声明某些方法

## 如下示例，用runjs方法构造了一个JavaScript定义的方法，然后用evaljs执行此方法获取返回结果

function main(splash, args)

  splash:go("https://www.mi.com")

  splash:runjs("fofo = function(){return 'dmr'}")

  result = splash:evaljs('fofo()')

  return result

end

## autoload方法，可以设置每个页面访问时自动加载的对象，在Lua语言中nil相当于python的None

## ok, reason = splash:autoload{source_or_url, source=nil, url=nil}

# source_or_url：JavaScript代码或JavaScript库链接

# source：JavaScript代码

# url：JavaScript库链接

# 示例1，通过构造一个get_path_title对象方法，用evaljs调用执行获取返回结果，在这里与runjs类似，不过构造方法需要用[]中括号

function main(splash, args)

  splash:autoload([[

    function get_path_title(){

    return document.title;

  }

    ]])

  splash:go("https://www.mi.com")

  result = splash:evaljs('get_path_title()')

  return result

end

# 示例2，用autoload加载jquery方法库

function main(splash, args)

  assert(splash:autoload("https://code.jquery.com/jquery-2.2.4.min.js"))

  assert(splash:go("https://www.baidu.com"))

  local version = splash:evaljs("$.fn.jquery")

  return 'Version is '..version

end

2.3.4 call_later方法，设置定时任务进行延时执行，并且可以在执行前通过cancel()方法重新执行定时任务

## call_later方法，设置定时任务进行延时执行，并且可以在执行前通过cancel()方法重新执行定时任务

# 如下示例，构造一个timer定时任务，当访问页面时，等待0.2秒获取页面的截图，再等待1秒后获取页面的截图

# 第一次获取页面的截图页面还没加载出来，所以获取到的是空白页

function main(splash, args)

  local pngs = {}

  local timer = splash:call_later(function()

    pngs['a'] = splash:png()

    splash:wait(1)

    pngs['b'] = splash:png()

    end, 0.2)

  splash:go("https://www.mi.com")

  return pngs

end

2.3.5 http_get方法，模拟发送http的get请求

## http_get方法，模拟发送http的get请求

## response = splash:http_get{url, headers=nil, follow_redirects=true}

# url：请求URL

# headers：可选参数，默认为空，请求头

# follwo_redirects：可选参数，表示是否启动自动重定向，默认为true

function main(splash, args)

  local t = require("treat")

  local response = splash:http_get("https://www.taobao.com")

  if response.status == 200 then

    return {

      b_html = response.body,

      html = t.as_string(response.body),

      url = response.url,

      status = response.status,

    }

  end

end

2.3.6 http_post方法，模拟发送http的post请求，与http_get方法类似，不过多个body表单参数

## http_post方法，模拟发送http的post请求，与http_get方法类似，不过多个body表单参数

## response = splash:http_get{url, headers=nil, follow_redirects=true， body=nil}

# url：请求URL

# headers：可选参数，默认为空，请求头

# follwo_redirects：可选参数，表示是否启动自动重定向，默认为true

# body：可选参数，默认为空，表单数据

# 如下示例，将表单数据提交到了json中

function main(splash, args)

  local t = require("treat")

  local json = require("json")

  local response = splash:http_post{

    "http://httpbin.org/post",

    body=json.encode({name="dmr"}),

    headers={["content-type"]="application/json"}

  }

  if response.status == 200 then

    return {

      'response',

      b_html = response.body,

      html = t.as_string(response.body),

      url = response.url,

      status = response.status,

    }

  end

end

2.3.7 set_content方法，用来设置页面的内容

## set_content方法，用来设置页面的内容

## 如下示例，可以看到页面中有dmr的内容

function main(splash, args)

  assert(splash:set_content("<h1>dmr</h1>"))

  return splash:png()

end

2.3.8 html方法，用来获取网页的源代码

## html方法，用来获取网页的源代码

function main(splash, args)

  assert(splash:go("https://www.baidu.com"))

  return splash:html()

end

2.3.9 png方法和jpeg方法，用来获取网页页面png格式或jpeg格式的截图

## png方法和jpeg方法，用来获取网页页面png格式或jpeg格式的截图

function main(splash, args)

  assert(splash:go("https://www.baidu.com"))

  return {

    png = splash:png(),

    jpeg = splash:jpeg()

    }

end

2.3.10 har方法，用来获取页面加载过程描述

## har方法，用来获取页面加载过程描述

function main(splash, args)

  assert(splash:go("https://www.baidu.com"))

  return {

    har = splash:har()

    }

end

2.3.11 url方法，用来获取当前正在访问页面的url

## url方法，用来获取当前正在访问页面的url

function main(splash, args)

  assert(splash:go("https://www.baidu.com"))

  return {url = splash:url()}

end

2.3.12 Cookies操作

## get_cookies方法，用来获取当前页面的cookies

function main(splash, args)

  assert(splash:go("https://www.baidu.com"))

  return {cookies = splash:get_cookies()}

end

## add_cookie方法，为当前页面添加cookie

## splash:add_cookie{name, value, path=nil, domain=nil, expires=nil, httpOnly=nil, secure=nil}

function main(splash, args)

  splash:add_cookie{'name', 'dmr'}

  assert(splash:go("https://www.baidu.com"))

  return {cookies = splash:get_cookies()}

end

## clear_cookies方法，清楚所有的cookies

function main(splash, args)

  splash:add_cookie{'name', 'dmr'}

  assert(splash:go("https://www.baidu.com"))

  splash:clear_cookies()

  return {cookies = splash:get_cookies()}

end

2.3.13 浏览器视图操作

## get_viewport_size方法，用来获取当前页面的大小，即宽高

function main(splash, args)

  assert(splash:go("https://www.taobao.com"))

  return splash:get_viewport_size()

end

## set_viewport_size方法，设置当前浏览器页面的大小，即宽高

function main(splash, args)

  splash:set_viewport_size(400, 400)

  assert(splash:go("https://www.taobao.com"))

  return splash:png()

end

## set_viewport_full方法，用来设置浏览器全屏显示

function main(splash, args)

  splash:set_viewport_full()

  assert(splash:go("https://www.taobao.com"))

  return splash:png()

end

2.3.14 set_user_agent方法，用来设置浏览器的User-Agent

## set_user_agent方法，用来设置浏览器的User-Agent

function main(splash, args)

  splash:set_user_agent('Splash')

  assert(splash:go("https://httpbin.org/get"))

  return splash:html()

end

2.3.15 set_custom_headers方法，可以用来设置请求头

## set_custom_headers方法，可以用来设置请求头

function main(splash, args)

  splash:set_custom_headers({

      ["User-Agent"] = "Splash",

      ["Host"] = "Splash.org"

    })

  assert(splash:go("https://httpbin.org/get"))

  return splash:html()

end

2.3.16 select方法，查找符合条件的第一个节点，用的是CSS选择器

## select方法，查找符合条件的第一个节点，用的是CSS选择器

function main(splash, args)

  assert(splash:go("https://www.taobao.com"))

  input = splash:select("#q")

  input:send_text("数码")

  splash:wait(2)

  return splash:jpeg()

end

2.3.17 select_all方法，查找符合条件的所有节点，用的是CSS选择器

## select_all方法，查找符合条件的所有节点，用的是CSS选择器

function main(splash, args)

  local treat = require("treat")

  assert(splash:go("https://movie.douban.com/top250"))

    assert(splash:wait(1))

  local items = splash:select_all(".inq")

  local sum = {}

  for index, item in ipairs(items) do

    sum[index] = item.node.innerHTML

  end

  return {

    obj1 = sum,

    obj2 = treat.as_array(sum)

  }

end

2.3.18 mouse_click方法，模拟鼠标点击操作

## mouse_click方法，模拟鼠标点击操作

## 1.传入x和y进行点击操作

## 2.查找到相关节点，调用此方法进行点击操作

function main(splash, args)

  splash:go("https://www.baidu.com")

  input = splash:select("#kw")

  input:send_text("Python")

  splash:wait(1)

  search = splash:select('#su')

  search:mouse_click()

  splash:wait(2)

  return splash:png()

end

3. Splash API的调用，以Python为例

　　　　官方API文档：https://splash.readthedocs.io/en/stable/api.html

3.1 render.html页面，此接口用于获取JavaScript渲染的页面的HTML代码

## render.html页面，此接口用于获取JavaScript渲染的页面的HTML代码

## 获取百度页面源代码url示例：http://localhost:8050/render.html?url=https://www.baidu.com

# 示例，通过调用render.html页面获取百度的源代码并且设置等待时间为4秒

import requests

url = 'http://10.0.0.100:8050/render.html?url=https://www.baidu.com&wait=4'

response = requests.get(url)

print(response.text)

3.2 render.png和render.jpeg，此接口获取网页截图，返回的是二进制数据

## render.png和render.jpeg，此接口获取网页截图，返回的是二进制数据

## 配置参数：url，wait，width，height

## render.jpeg多了个参数quality，用来调整图片的质量，取值1-100，默认值为75，应尽量避免取95以上的数值

## url示例：http://localhost:8050/render.png?url=https://www.baidu.com&wait=2&width=400&height=400

# 示例，通过render.png接口获取页面宽400高400的页面截图并保存到文件中

import requests

url = 'http://10.0.0.100:8050/render.png?url=https://www.taobao.com&wait=5&width=1000&height=700'

url2 = 'http://10.0.0.100:8050/render.jpeg?url=https://www.taobao.com&wait=5&width=1000&height=700&quality=90'

response = requests.get(url)

with open('taobao.png', 'wb') as f:

    f.write(response.content)

response2 = requests.get(url2)

with open('taobao.jpeg', 'wb') as f:

    f.write(response2.content)

3.3 render.har，此接口用来获取页面加载的HAR数据，返回的是json格式的数据

## render.har，此接口用来获取页面加载的HAR数据，返回的是json格式的数据

## url示例：http://localhost:8050/render.har?url=https://www.baidu.com&wait=2

import requests, json

url = 'http://10.0.0.100:8050/render.har?url=https://www.baidu.com&wait=2'

response = requests.get(url)

print(response.content)

with open('har.text', 'w') as f:

    f.write(json.dumps(json.loads(response.content), indent=2))

3.4 render.json，此接口包含了前面接口的所有功能，返回结果是json格式

## render.json，此接口包含了前面接口的所有功能，返回结果是json格式

## url示例：http://localhost:8050/render.json?url=https://www.baidu.com&html=1&png=1&jpeg=1&har=1

## 默认返回：{'url': 'https://www.baidu.com/', 'requestedUrl': 'https://www.baidu.com/', 'geometry': [0, 0, 1024, 768], 'title': '百度一下，你就知道'}

## 通过将html、png、jpeg、har参数置为1获取相关的页面数据

import requests, json

url = 'http://10.0.0.100:8050/render.json?url=https://www.baidu.com&html=1&png=1&jpeg=1&har=1'

response = requests.get(url)

data = json.loads(response.content)

print(data)

print(data.get('html'))

print(data.get('png'))

print(data.get('jpeg'))

print(data.get('har'))

3.5 execute，功能强大，此接口可实现与Lua脚本的对接，实现交互性操作

## execute，此接口可实现与Lua脚本的对接，可实现交互性操作

## url示例：http://localhost:8050/execute?lua_source=

# 示例1，简单示例，返回lua的执行结果

from urllib.parse import quote

import requests

lua = '''

function main(splash)

  return 'hello'

end

'''

url = 'http://10.0.0.100:8050/execute?lua_source=%s' % quote(lua)

response = requests.get(url)

print(response.text)

# 示例2，通过execute执行lua脚本获取页面的url，png，html

from urllib.parse import quote

import requests, json

lua = '''

function main(splash)

  splash:go('https://www.baidu.com')

  splash:wait(2)

  return {

    html = splash:html(),

    png = splash:png(),

    url = splash:url()

  }

end

'''

url = 'http://10.0.0.100:8050/execute?lua_source=%s' % quote(lua)

response = requests.get(url)

print(type(response.text), response.text)

dic = json.loads(response.text)

print(type(dic), len(dic), dic.keys())

4. Splash通过Nginx配置负载均衡基本思路

1. 配置多台splash服务器

2. 选中其中一台或者另起一台服务器安装nginx服务

3. 配置nginx.conf配置文件，内容大致为：

http {

    upstream splash {

        least_conn; #最少连接负载均衡，不配置这进行轮询，ip_hash配置ip散列负载均衡

        server 10.0.0.100:8050;

        server 10.0.0.99:8050;

        server 10.0.0.98:8050;

        server 10.0.0.97:8050;

    }

    server {

        listen 8050;

        location / {

            proxy_pass http://splash;   # 指定域名

            auth_basic "Restricted";    # 配置认证

            auth_basic_user_file /etc/nginx/conf.d/.htpasswd;   # 指定认证的用户密码文件

        }

    }

}

4. 如不配置认证，直接重载nginx：sudo nginx -s reload；配置了认证则需要构建密码文件，建议用htpasswd命令构建，在重载nginx

5. 进行相关测试，测试脚本

from urllib.parse import quote

import requests, re

lua = '''

function main(splash)

  local treat = require('treat')

  response = splash:http_get('https://www.baidu.com')

  return treat.as_string(response.body)

end

'''

url = 'http://splash/execute?lua_source=%s' % quote(lua)

response = requests.get(url)

ip = re.search('(\d+\.\d+\.\d+\.\d+)', response.text).group(1)

print(ip)

码农公寓

Splash是一个JavaScript渲染服务，是一个带有HTTP API的轻量级浏览器，同时它对接了Python中的Twisted和QT库。利用它，我们同样可以实现动态渲染页面的抓取。

相关文章

Splash是一个JavaScript渲染服务，是一个带有HTTP API的轻量级浏览器，同时它对接了Python中的Twisted和QT库。
利用它，我们同样可以实现动态渲染页面的抓取。