在Linux终端下使用代理

在Linux终端下使用代理

前言

最近运行一个Github项目,里面用到了Huggingface的Datasets库,这个库在会主动去网络上下载原始数据集文件,但其下载源都是原始数据集的链接。比如Spider数据集,其下载来源为原作者发布的Google Drive链接上。然而,学校里的服务器并不支持访问外网。故需要使用代理来协助程序访问Google Drive。

问题

下面以一个简单的代码和报错为例,介绍这个问题。

from datasets import load_dataset

dataset = load_dataset('spider')

直接运行上述代码,程序会自动去Google drive上尝试下载Spider数据集,但是由于网络访问限制,将会如下报错。

(slurm) jxqi@main-2:~/Text-to-SQL/tmp$ python test_google.py 
Using the latest cached version of the module from /home/jxqi/.cache/huggingface/modules/datasets_modules/datasets/spider/edbe505fd96c6218feb563fa547869bbc170052a1484d675f9d96d090a9473cf (last modified on Wed Oct 20 15:33:00 2021) since it couldn't be found locally at spider/spider.py or remotely (ConnectionError).
Downloading and preparing dataset spider/spider (download: 95.12 MiB, generated: 5.17 MiB, post-processed: Unknown size, total: 100.29 MiB) to /home/jxqi/.cache/huggingface/datasets/spider/spider/1.0.0/edbe505fd96c6218feb563fa547869bbc170052a1484d675f9d96d090a9473cf...
Traceback (most recent call last):
  File "test_google.py", line 3, in <module>
    dataset = load_dataset('spider')
  File "/home/jxqi/anaconda3/envs/slurm/lib/python3.8/site-packages/datasets/load.py", line 742, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/jxqi/anaconda3/envs/slurm/lib/python3.8/site-packages/datasets/builder.py", line 574, in download_and_prepare
    self._download_and_prepare(
  File "/home/jxqi/anaconda3/envs/slurm/lib/python3.8/site-packages/datasets/builder.py", line 630, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
  File "/home/jxqi/.cache/huggingface/modules/datasets_modules/datasets/spider/edbe505fd96c6218feb563fa547869bbc170052a1484d675f9d96d090a9473cf/spider.py", line 78, in _split_generators
    downloaded_filepath = dl_manager.download_and_extract(_URL)
  File "/home/jxqi/anaconda3/envs/slurm/lib/python3.8/site-packages/datasets/utils/download_manager.py", line 287, in download_and_extract
    return self.extract(self.download(url_or_urls))
  File "/home/jxqi/anaconda3/envs/slurm/lib/python3.8/site-packages/datasets/utils/download_manager.py", line 195, in download
    downloaded_path_or_paths = map_nested(
  File "/home/jxqi/anaconda3/envs/slurm/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 195, in map_nested
    return function(data_struct)
  File "/home/jxqi/anaconda3/envs/slurm/lib/python3.8/site-packages/datasets/utils/download_manager.py", line 218, in _download
    return cached_path(url_or_filename, download_config=download_config)
  File "/home/jxqi/anaconda3/envs/slurm/lib/python3.8/site-packages/datasets/utils/file_utils.py", line 281, in cached_path
    output_path = get_from_cache(
  File "/home/jxqi/anaconda3/envs/slurm/lib/python3.8/site-packages/datasets/utils/file_utils.py", line 623, in get_from_cache
    raise ConnectionError("Couldn't reach {}".format(url))
ConnectionError: Couldn't reach https://drive.google.com/uc?export=download&id=1_AckYkinAnhqmRQtGsQgUKAnTHxxX5J0

可以看到,由于服务器无法访问Google drive链接导致报错。

解决

查找资料,发现类似的问题,参考Linux 让终端走代理的几种方法,可以通过修改shell配置文件.bashrc实现本用户的程序直接走代理的方法。

其具体步骤为首先打开.bashrc文件,然后再文件尾部追加以下两行内容:

export http_proxy="http://proxy_host:port"
export https_proxy="http://proxy_host:port"

其中将proxy_host修改为你的代理服务器名称、port修改为代理端口。然后可能还需要添加用户名和密码,即:

export http_proxy="http://username:password@proxy_host:port"
export https_proxy="http://username:passwordproxy_host:port"

之后,需要对shell进行重启。使用以下命令:

source ~/.bashrc

重启之后程序就可以使用代理访问外网了。

参考

[1] Linux 让终端走代理的几种方法, https://zhuanlan.zhihu.com/p/46973701

上一篇:新年第一份“欧气”,“中国开发者大调查”第五批中奖名单出炉啦


下一篇:Java程序员必备的辅助开发神器(2022年版),建议收藏