前题:安装docker并能使用
安装完在JSON文件中加入国内镜像,阿里云需要自己申请。
"registry-mirrors": [
"https://********.mirror.aliyuncs.com",
"https://registry.docker-cn.com",
"http://hub-mirror.c.163.com",
"https://docker.mirrors.ustc.edu.cn"
]
其他前题:
已为anaconda配置好PATH
Scrapy 安装
JupyterLab 输入
pip install scrapy
import scrapy
Splash 安装
终端中输入:
docker run -p 8050:8050 scrapinghub/splash
成功安装返回类似如下内容
Digest: sha256:b4173a88a9d11c424a4df4c8a41ce67ff6a6a3205bd093808966c12e0b06dacf
Status: Downloaded newer image for scrapinghub/splash:latest
2021-02-01 04:53:28+0000 [-] Log opened.
2021-02-01 04:53:29.033164 [-] Xvfb is started: [‘Xvfb’, ‘:846388905’, ‘-screen’, ‘0’, ‘1024x768x24’, ‘-nolisten’, ‘tcp’]
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to ‘/tmp/runtime-splash’
2021-02-01 04:53:29.354172 [-] Splash version: 3.5
2021-02-01 04:53:29.420819 [-] Qt 5.14.1, PyQt 5.14.2, WebKit 602.1, Chromium 77.0.3865.129, sip 4.19.22, Twisted 19.7.0, Lua 5.2
2021-02-01 04:53:29.421057 [-] Python 3.6.9 (default, Jul 17 2020, 12:50:27) [GCC 8.4.0]
2021-02-01 04:53:29.421504 [-] Open files limit: 1048576
2021-02-01 04:53:29.421711 [-] Can’t bump open files limit
2021-02-01 04:53:29.441758 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2021-02-01 04:53:29.442007 [-] memory cache: enabled, private mode: enabled, js cross-domain access: disabled
2021-02-01 04:53:29.616771 [-] verbosity=1, slots=20, argument_cache_max_entries=500, max-timeout=90.0
2021-02-01 04:53:29.617331 [-] Web UI: enabled, Lua: enabled (sandbox: enabled), Webkit: enabled, Chromium: enabled
2021-02-01 04:53:29.618280 [-] Site starting on 8050
2021-02-01 04:53:29.618402 [-] Starting factory <twisted.web.server.Site object at 0x7f07b402c5c0>
2021-02-01 04:53:29.618800 [-] Server listening on http://0.0.0.0:8050
2021-02-01 04:54:39.943377 [-] “172.17.0.1” - - [01/Feb/2021:04:54:39 +0000] “GET / HTTP/1.1” 200 7675 “-” “Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36”
2021-02-01 04:54:40.007321 [-] “172.17.0.1” - - [01/Feb/2021:04:54:39 +0000] “GET /_ui/style.css HTTP/1.1” 200 2591 “http://localhost:8050/” “Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36”
2021-02-01 04:54:40.025381 [-] “172.17.0.1” - - [01/Feb/2021:04:54:39 +0000] “GET /_ui/main.js HTTP/1.1” 200 13055 “http://localhost:8050/” “Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36”
2021-02-01 04:54:42.573986 [-] “172.17.0.1” - - [01/Feb/2021:04:54:42 +0000] “GET /_ui/inspections/splash-auto.json HTTP/1.1” 200 177721 “http://localhost:8050/” “Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36”
2021-02-01 04:54:42.698853 [-] “172.17.0.1” - - [01/Feb/2021:04:54:42 +0000] “GET /_ui/favicon.ico HTTP/1.1” 200 4286 “http://localhost:8050/” “Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36”
2021-02-01 04:55:39.940831 [-] Timing out client: IPv4Address(type=‘TCP’, host=‘172.17.0.1’, port=55762)
2021-02-01 04:55:42.699230 [-] Timing out client: IPv4Address(type=‘TCP’, host=‘172.17.0.1’, port=55764)
Splash关闭
先关闭容器再删除容器
sudo docker ps -a
sudo docker stop CONTAINER_ID
sudo docker rm CONTAINER_ID
Scrapy-Splash 安装
JupyterLab 中输入:
pip install scrapy-splash
不能import
Scrapy-Redis 安装
JupyterLab 中输入:
pip install scrapy-redis
import scrapy_redis
Scrapyd 等 安装
pip install scrapyd
pip install scrapyd-client
pip install python-scrapyd-api
Scrapyrt 安装 轻量级scrapyd
pip install scrapyrt