从 Requests 和 urllib 库源码分析代理过程
Python(>=3.5) 使用 requests 发送请求时将会自动检测系统代理,通过 urllib.request.getproxies 实现。
同时,在 Python#26307 中解决了 urllib.request.getproxies 函数的一个 bug,此 bug 是在 2022.5.12 被解决,解决了 https 代理字段的错误。所以以下分析是基于在该日期之后发布的 python 版本进行的。
下面将从源码分析代理过程,Python版本为 3.10.9。
首先以一个例子引入
1import requests
2import cfg
3# resp = requests.get(url='https://google.com', proxies=cfg.MY_PROXY_JSON)
4resp = requests.get(url='https://google.com')
5print(resp) # <Response [200]>
系统已经开启代理,在墙内向谷歌发送请求,在没有配置 proxies 的情况下请求成功了。
分析 requests 和 urllib 源码
通过 requests 发送请求,经过 sessions.py
:
1settings = self.merge_environment_settings(
2 prep.url, proxies, stream, verify, cert
3)
阅读该函数内容:
1# requests/sessions.py
2def merge_environment_settings(self, url, proxies, stream, verify, cert):
3 """
4 Check the environment and merge it with some settings.
5
6 :rtype: dict
7 """
8 # Gather clues from the surrounding environment.
9 if self.trust_env:
10 # Set environment's proxies.
11 no_proxy = proxies.get("no_proxy") if proxies is not None else None
12 env_proxies = get_environ_proxies(url, no_proxy=no_proxy)
13 for (k, v) in env_proxies.items():
14 proxies.setdefault(k, v)
15
16 # Look for requests environment configuration
17 # and be compatible with cURL.
18 if verify is True or verify is None:
19 verify = (
20 os.environ.get("REQUESTS_CA_BUNDLE")
21 or os.environ.get("CURL_CA_BUNDLE")
22 or verify
23 )
24
25 # Merge all the kwargs.
26 proxies = merge_setting(proxies, self.proxies)
27 stream = merge_setting(stream, self.stream)
28 verify = merge_setting(verify, self.verify)
29 cert = merge_setting(cert, self.cert)
30
31 return {"proxies": proxies, "stream": stream, "verify": verify, "cert": cert}
get_environ_proxies
函数就是读取系统代理的地方:
1# requests/utils.py
2def get_environ_proxies(url, no_proxy=None):
3 """
4 Return a dict of environment proxies.
5
6 :rtype: dict
7 """
8 if should_bypass_proxies(url, no_proxy=no_proxy): # 检测是否不代理该 url
9 return {}
10 else:
11 return getproxies() # 获取系统代理
should_bypass_proxies
用于检测该 url 是否 bypass 掉,也就是是否不代理该 url,而 getproxies
就是用于获取系统代理了,进入 urllib 源码部分:
1# urllib/request.py
2def getproxies():
3 """Return a dictionary of scheme -> proxy server URL mappings.
4
5 Returns settings gathered from the environment, if specified,
6 or the registry.
7
8 """
9 return getproxies_environment() or getproxies_registry()
getproxies_environment
用于读取环境变量中的 http_proxy
等字段,而 getproxies_registry
用于读取注册表中的代理值:
1# urllib/request.py
2def getproxies_registry():
3 """Return a dictionary of scheme -> proxy server URL mappings.
4
5 Win32 uses the registry to store proxies.
6
7 """
8 proxies = {}
9 try:
10 import winreg
11 except ImportError:
12 # Std module, so should be around - but you never know!
13 return proxies
14 try:
15 internetSettings = winreg.OpenKey(winreg.HKEY_CURRENT_USER,
16 r'Software\Microsoft\Windows\CurrentVersion\Internet Settings')
17 proxyEnable = winreg.QueryValueEx(internetSettings,
18 'ProxyEnable')[0]
19 if proxyEnable:
20 # Returned as Unicode but problems if not converted to ASCII
21 proxyServer = str(winreg.QueryValueEx(internetSettings,
22 'ProxyServer')[0])
23 if '=' not in proxyServer and ';' not in proxyServer:
24 # Use one setting for all protocols.
25 proxyServer = 'http={0};https={0};ftp={0}'.format(proxyServer)
26 for p in proxyServer.split(';'):
27 protocol, address = p.split('=', 1)
28 # See if address has a type:// prefix
29 if not re.match('(?:[^/:]+)://', address):
30 # Add type:// prefix to address without specifying type
31 if protocol in ('http', 'https', 'ftp'):
32 # The default proxy type of Windows is HTTP
33 address = 'http://' + address
34 elif protocol == 'socks':
35 address = 'socks://' + address
36 proxies[protocol] = address
37 # Use SOCKS proxy for HTTP(S) protocols
38 if proxies.get('socks'):
39 # The default SOCKS proxy type of Windows is SOCKS4
40 address = re.sub(r'^socks://', 'socks4://', proxies['socks'])
41 proxies['http'] = proxies.get('http') or address
42 proxies['https'] = proxies.get('https') or address
43 internetSettings.Close()
44 except (OSError, ValueError, TypeError):
45 # Either registry key not found etc, or the value in an
46 # unexpected format.
47 # proxies already set up to be empty so nothing to do
48 pass
49 return proxies
win32 使用注册表存储代理值,位置在:
1HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Internet Settings
通过读取其中的值即可找到系统代理。
其中 ProxyEnable
字段为 1 则表示开启代理,读取 ProxyServer
即可找到代理服务器的地址。
这部分的代码也就是文首提到的被解决的 bug 所在位置,原来的 https 代理变量是 https:// 加上代理服务器地址,这是错误的,代理服务位于本地,并不需要 https。
最后我们回到 requests 源码,看看是如何合并代理变量的:
1# requests/sessions.py
2def merge_environment_settings(self, url, proxies, stream, verify, cert):
3 """
4 Check the environment and merge it with some settings.
5
6 :rtype: dict
7 """
8 # Gather clues from the surrounding environment.
9 if self.trust_env:
10 # Set environment's proxies.
11 no_proxy = proxies.get("no_proxy") if proxies is not None else None
12 env_proxies = get_environ_proxies(url, no_proxy=no_proxy)
13 for (k, v) in env_proxies.items():
14 proxies.setdefault(k, v)
15
16 # Look for requests environment configuration
17 # and be compatible with cURL.
18 if verify is True or verify is None:
19 verify = (
20 os.environ.get("REQUESTS_CA_BUNDLE")
21 or os.environ.get("CURL_CA_BUNDLE")
22 or verify
23 )
24
25 # Merge all the kwargs.
26 proxies = merge_setting(proxies, self.proxies)
27 stream = merge_setting(stream, self.stream)
28 verify = merge_setting(verify, self.verify)
29 cert = merge_setting(cert, self.cert)
30
31 return {"proxies": proxies, "stream": stream, "verify": verify, "cert": cert}
前面读取完系统代理变量后,现在需要进行合并代理变量,proxies
是 requests 发送请求中自己配置的 proxies
,通过 proxies.setdefault(k, v)
配置刚刚读取的系统变量,关键在于 setdefault
方法,如果 proxies
中已经存在该键,则不会进行覆盖!所以自己配置的 proxies
优先级是大于自动读取的系统配置的优先级的。
接着进入又一次 merge:
1def merge_setting(request_setting, session_setting, dict_class=OrderedDict):
2 """Determines appropriate setting for a given request, taking into account
3 the explicit setting on that request, and the setting in the session. If a
4 setting is a dictionary, they will be merged together using `dict_class`
5 """
6
7 if session_setting is None:
8 return request_setting
9
10 if request_setting is None:
11 return session_setting
12
13 # Bypass if not a dictionary (e.g. verify)
14 if not (
15 isinstance(session_setting, Mapping) and isinstance(request_setting, Mapping)
16 ):
17 return request_setting
18
19 merged_setting = dict_class(to_key_val_list(session_setting))
20 merged_setting.update(to_key_val_list(request_setting))
21
22 # Remove keys that are set to None. Extract keys first to avoid altering
23 # the dictionary during iteration.
24 none_keys = [k for (k, v) in merged_setting.items() if v is None]
25 for key in none_keys:
26 del merged_setting[key]
27
28 return merged_setting
request_setting
是刚第一次 merge 完成的配置,而 session_setting
则是自己在当前 session 中配置的,这里由于是直接通过 requests.get
发送请求而没有之前设置的 session,所以 session_setting
为空。
merged_setting
首先读取了 session_setting
,然后再用 request_setting
更新 merged_setting
,所以当前 request 配置的优先级高于 当前 session 配置的优先级。
至此,读取代理配置就完成了。
后话
这个关于这个代理的问题,是从一个翻译器项目的 issue: #110 开始关注的,由于有些版本的 python 代理编写有误,导致了用到 python 的项目出现 bug,所以才慢慢探索到这里。