从 Requests 和 urllib 库源码分析代理过程

7 minute

Python(>=3.5) 使用 requests 发送请求时将会自动检测系统代理,通过 urllib.request.getproxies 实现。

同时,在 Python#26307 中解决了 urllib.request.getproxies 函数的一个 bug,此 bug 是在 2022.5.12 被解决,解决了 https 代理字段的错误。所以以下分析是基于在该日期之后发布的 python 版本进行的。

下面将从源码分析代理过程,Python版本为 3.10.9。

首先以一个例子引入

1import requests
2import cfg
3# resp = requests.get(url='https://google.com', proxies=cfg.MY_PROXY_JSON)
4resp = requests.get(url='https://google.com')
5print(resp) # <Response [200]>

系统已经开启代理,在墙内向谷歌发送请求,在没有配置 proxies 的情况下请求成功了。

分析 requests 和 urllib 源码

通过 requests 发送请求,经过 sessions.py

1settings = self.merge_environment_settings(
2    prep.url, proxies, stream, verify, cert
3)

阅读该函数内容:

 1# requests/sessions.py
 2def merge_environment_settings(self, url, proxies, stream, verify, cert):
 3    """
 4    Check the environment and merge it with some settings.
 5
 6    :rtype: dict
 7    """
 8    # Gather clues from the surrounding environment.
 9    if self.trust_env:
10        # Set environment's proxies.
11        no_proxy = proxies.get("no_proxy") if proxies is not None else None
12        env_proxies = get_environ_proxies(url, no_proxy=no_proxy)
13        for (k, v) in env_proxies.items():
14            proxies.setdefault(k, v)
15
16        # Look for requests environment configuration
17        # and be compatible with cURL.
18        if verify is True or verify is None:
19            verify = (
20                os.environ.get("REQUESTS_CA_BUNDLE")
21                or os.environ.get("CURL_CA_BUNDLE")
22                or verify
23            )
24
25    # Merge all the kwargs.
26    proxies = merge_setting(proxies, self.proxies)
27    stream = merge_setting(stream, self.stream)
28    verify = merge_setting(verify, self.verify)
29    cert = merge_setting(cert, self.cert)
30
31    return {"proxies": proxies, "stream": stream, "verify": verify, "cert": cert}

get_environ_proxies 函数就是读取系统代理的地方:

 1# requests/utils.py
 2def get_environ_proxies(url, no_proxy=None):
 3    """
 4    Return a dict of environment proxies.
 5
 6    :rtype: dict
 7    """
 8    if should_bypass_proxies(url, no_proxy=no_proxy): # 检测是否不代理该 url
 9        return {}
10    else:
11        return getproxies() # 获取系统代理

should_bypass_proxies 用于检测该 url 是否 bypass 掉,也就是是否不代理该 url,而 getproxies 就是用于获取系统代理了,进入 urllib 源码部分:

1# urllib/request.py
2def getproxies():
3    """Return a dictionary of scheme -> proxy server URL mappings.
4
5    Returns settings gathered from the environment, if specified,
6    or the registry.
7
8    """
9    return getproxies_environment() or getproxies_registry()

getproxies_environment 用于读取环境变量中的 http_proxy 等字段,而 getproxies_registry 用于读取注册表中的代理值:

 1# urllib/request.py
 2def getproxies_registry():
 3    """Return a dictionary of scheme -> proxy server URL mappings.
 4
 5    Win32 uses the registry to store proxies.
 6
 7    """
 8    proxies = {}
 9    try:
10        import winreg
11    except ImportError:
12        # Std module, so should be around - but you never know!
13        return proxies
14    try:
15        internetSettings = winreg.OpenKey(winreg.HKEY_CURRENT_USER,
16            r'Software\Microsoft\Windows\CurrentVersion\Internet Settings')
17        proxyEnable = winreg.QueryValueEx(internetSettings,
18                                            'ProxyEnable')[0]
19        if proxyEnable:
20            # Returned as Unicode but problems if not converted to ASCII
21            proxyServer = str(winreg.QueryValueEx(internetSettings,
22                                                    'ProxyServer')[0])
23            if '=' not in proxyServer and ';' not in proxyServer:
24                # Use one setting for all protocols.
25                proxyServer = 'http={0};https={0};ftp={0}'.format(proxyServer)
26            for p in proxyServer.split(';'):
27                protocol, address = p.split('=', 1)
28                # See if address has a type:// prefix
29                if not re.match('(?:[^/:]+)://', address):
30                    # Add type:// prefix to address without specifying type
31                    if protocol in ('http', 'https', 'ftp'):
32                        # The default proxy type of Windows is HTTP
33                        address = 'http://' + address
34                    elif protocol == 'socks':
35                        address = 'socks://' + address
36                proxies[protocol] = address
37            # Use SOCKS proxy for HTTP(S) protocols
38            if proxies.get('socks'):
39                # The default SOCKS proxy type of Windows is SOCKS4
40                address = re.sub(r'^socks://', 'socks4://', proxies['socks'])
41                proxies['http'] = proxies.get('http') or address
42                proxies['https'] = proxies.get('https') or address
43        internetSettings.Close()
44    except (OSError, ValueError, TypeError):
45        # Either registry key not found etc, or the value in an
46        # unexpected format.
47        # proxies already set up to be empty so nothing to do
48        pass
49    return proxies

win32 使用注册表存储代理值,位置在:

1HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Internet Settings

通过读取其中的值即可找到系统代理。

其中 ProxyEnable 字段为 1 则表示开启代理,读取 ProxyServer 即可找到代理服务器的地址。

这部分的代码也就是文首提到的被解决的 bug 所在位置,原来的 https 代理变量是 https:// 加上代理服务器地址,这是错误的,代理服务位于本地,并不需要 https。

最后我们回到 requests 源码,看看是如何合并代理变量的:

 1# requests/sessions.py
 2def merge_environment_settings(self, url, proxies, stream, verify, cert):
 3    """
 4    Check the environment and merge it with some settings.
 5
 6    :rtype: dict
 7    """
 8    # Gather clues from the surrounding environment.
 9    if self.trust_env:
10        # Set environment's proxies.
11        no_proxy = proxies.get("no_proxy") if proxies is not None else None
12        env_proxies = get_environ_proxies(url, no_proxy=no_proxy)
13        for (k, v) in env_proxies.items():
14            proxies.setdefault(k, v)
15
16        # Look for requests environment configuration
17        # and be compatible with cURL.
18        if verify is True or verify is None:
19            verify = (
20                os.environ.get("REQUESTS_CA_BUNDLE")
21                or os.environ.get("CURL_CA_BUNDLE")
22                or verify
23            )
24
25    # Merge all the kwargs.
26    proxies = merge_setting(proxies, self.proxies)
27    stream = merge_setting(stream, self.stream)
28    verify = merge_setting(verify, self.verify)
29    cert = merge_setting(cert, self.cert)
30
31    return {"proxies": proxies, "stream": stream, "verify": verify, "cert": cert}

前面读取完系统代理变量后,现在需要进行合并代理变量,proxies 是 requests 发送请求中自己配置的 proxies,通过 proxies.setdefault(k, v) 配置刚刚读取的系统变量,关键在于 setdefault 方法,如果 proxies 中已经存在该键,则不会进行覆盖!所以自己配置的 proxies 优先级是大于自动读取的系统配置的优先级的。

接着进入又一次 merge:

 1def merge_setting(request_setting, session_setting, dict_class=OrderedDict):
 2    """Determines appropriate setting for a given request, taking into account
 3    the explicit setting on that request, and the setting in the session. If a
 4    setting is a dictionary, they will be merged together using `dict_class`
 5    """
 6
 7    if session_setting is None:
 8        return request_setting
 9
10    if request_setting is None:
11        return session_setting
12
13    # Bypass if not a dictionary (e.g. verify)
14    if not (
15        isinstance(session_setting, Mapping) and isinstance(request_setting, Mapping)
16    ):
17        return request_setting
18
19    merged_setting = dict_class(to_key_val_list(session_setting))
20    merged_setting.update(to_key_val_list(request_setting))
21
22    # Remove keys that are set to None. Extract keys first to avoid altering
23    # the dictionary during iteration.
24    none_keys = [k for (k, v) in merged_setting.items() if v is None]
25    for key in none_keys:
26        del merged_setting[key]
27
28    return merged_setting

request_setting 是刚第一次 merge 完成的配置,而 session_setting 则是自己在当前 session 中配置的,这里由于是直接通过 requests.get 发送请求而没有之前设置的 session,所以 session_setting 为空。

merged_setting 首先读取了 session_setting,然后再用 request_setting 更新 merged_setting,所以当前 request 配置的优先级高于 当前 session 配置的优先级。

至此,读取代理配置就完成了。

后话

这个关于这个代理的问题,是从一个翻译器项目的 issue: #110 开始关注的,由于有些版本的 python 代理编写有误,导致了用到 python 的项目出现 bug,所以才慢慢探索到这里。