skip to navigation
skip to content

Python Wiki

Python Insider Blog

Python 2 or 3?

Help Fund Python

[Python resources in languages other than English]

Non-English Resources

Add an event to this calendar.

Times are shown in UTC/GMT.

Add an event to this calendar.

PEP:470
Title:Using Multi Index Support for External to PyPI Package File Hosting
Version:3128e9d38937
Last-Modified:2014-06-06 07:57:08 -0400 (Fri, 06 Jun 2014)
Author:Donald Stufft <donald at stufft.io>,
BDFL-Delegate:Richard Jones <richard@python.org>
Discussions-To:distutils-sig at python.org
Status:Draft
Type:Process
Content-Type:text/x-rst
Created:12-May-2014
Post-History:14-May-2014, 05-Jun-2014

Abstract

This PEP proposes that the official means of having an installer locate and find package files which are hosted externally to PyPI become the use of multi index support instead of the practice of using external links on the simple installer API.

It is important to remember that this is not about forcing anyone to host their files on PyPI. If someone does not wish to do so they will never be under any obligation too. They can still list their project in PyPI as an index, and the tooling will still allow them to host it elsewhere.

This PEP strictly is concerned with the Simple Installer API and how automated installers interact with PyPI, it has no bearing on the informational pages which are primarily for human consumption.

Rationale

There is a long history documented in PEP 438 that explains why externally hosted files exist today in the state that they do on PyPI. For the sake of brevity I will not duplicate that and instead urge readers to first take a look at PEP 438 for background.

There are currently two primary ways for a project to make itself available without directly hosting the package files on PyPI. They can either include links to the package files in the simpler installer API or they can publish a custom package index which contains their project.

Custom Additional Index

Each installer which speaks to PyPI offers a mechanism for the user invoking that installer to provide additional custom locations to search for files during the dependency resolution phase. For pip these locations can be configured per invocation, per shell environment, per requirements file, per virtual environment, and per user. The mechanism for specifying additional locations have existed within pip and setuptools for many years, by comparison the mechanisms in PEP 438 and any other new mechanism will have existed for only a short period of time (if they exist at all currently).

The use of additional indexes instead of external links on the simple installer API provides a simple clean interface which is consistent with the way most Linux package systems work (apt-get, yum, etc). More importantly it works the same even for projects which are commercial or otherwise have their access restricted in some form (private networks, password, IP ACLs etc) while the external links method only realistically works for projects which do not have their access restricted.

Compared to the complex rules which a project must be aware of to prevent themselves from being considered unsafely hosted setting up an index is fairly trivial and in the simplest case does not require anything more than a filesystem and a standard web server such as Nginx or Twisted Web. Even if using simple static hosting without autoindexing support, it is still straightforward to generate appropriate index pages as static HTML.

Example Index with Twisted Web

  1. Create a root directory for your index, for the purposes of the example I'll assume you've chosen /var/www/index.example.com/.
  2. Inside of this root directory, create a directory for each project such as mkdir -p /var/www/index.example.com/{foo,bar,other}/.
  3. Place the package files for each project in their respective folder, creating paths like /var/www/index.example.com/foo/foo-1.0.tar.gz.
  4. Configure Twisted Web to serve the root directory, ideally with TLS.
$ twistd -n web --path /var/www/index.example.com/

Examples of Additional indexes with pip

Invocation:

$ pip install --extra-index-url https://pypi.example.com/ foobar

Shell Environment:

$ export PIP_EXTRA_INDEX_URL=https://pypi.example.com/
$ pip install foobar

Requirements File:

$ echo "--extra-index-url https://pypi.example.com/\nfoobar" > requirements.txt
$ pip install -r requirements.txt

Virtual Environment:

$ python -m venv myvenv
$ echo "[global]\nextra-index-url = https://pypi.example.com/" > myvenv/pip.conf
$ myvenv/bin/pip install foobar

User:

$ echo "[global]\nextra-index-url = https://pypi.example.com/" >~/.pip/pip.conf
$ pip install foobar

External Index Discovery

One of the problems with using an additional index is one of discovery. Users will not generally be aware that an additional index is required at all much less where that index can be found. Projects can attempt to convey this information using their description on the PyPI page however that excludes people who discover their project organically through pip search.

To support projects that wish to externally host their files and to enable users to easily discover what additional indexes are required, PyPI will gain the ability for projects to register external index URLs and additionally an associated comment for each. These URLs will be made available on the simple page however they will not be linked or provided in a form that older installers will automatically search them.

When an installer fetches the simple page for a project, if it finds this additional meta-data and it cannot find any files for that project in it's configured URLs then it should use this data to tell the user how to add one or more of the additional URLs to search in. This message should include any comments that the project has included to enable them to communicate to the user and provide hints as to which URL they might want if some are only useful or compatible with certain platforms or situations. When the installer has implemented the auto discovery mechanisms they should also deprecate any of the mechanisms added for PEP 438 (such as --allow-external) for removal at the end of the deprecation period proposed by the PEP.

This feature must be added to PyPI prior to starting the deprecation and removal process for link spidering.

Impact

The largest impact of this is going to be projects where the maintainers are no longer maintaining the project, for one reason or another. For these projects it's unlikely that a maintainer will arrive to set the external index metadata which would allow the auto discovery mechanism to find it.

Looking at the numbers factoring out PIL (which has been special cased above) the actual impact should be quite low, with it affecting just 6.9% of projects which host only externally or 2.8% which have their latest version hosted externally. This represents a mere 3883 unique IP addresses. The break down of this is that of those 3883 addresses, 100% of them installed something that could not be verified while only 3% installed something which could be.

Projects Which Rely on Externally Hosted files

This is determined by crawling the simple index and looking for installable files using a similar detection method as pip and setuptools use. The "latest" version is determined using pkg_resources.parse_version sort order and it is used to show whether or not the latest version is hosted externally or only old versions are.

PyPI External (old) External (latest) Total
Safe 38716 31 35 38782
Unsafe 0 1659 1169 2828
Total 38716 1690 1204 41610

Top Externally Hosted Projects by Requests

This is determined by looking at the number of requests the /simple/<project>/ page had gotten in a single day. The total number of requests during that day was 17,960,467.

Project Requests
PIL 13470
mysql-connector-python 321
salesforce-python-toolkit 54
pyodbc 50
elementtree 44
atfork 39
RBTools 29
django-contrib-requestprovider 28
wadofstuff-django-serializers 23
Pygame 21

Top Externally Hosted Projects by Unique IPs

This is determined by looking at the IP addresses of requests the /simple/<project>/ page had gotten in a single day. The total number of unique IP addresses during that day was 105,587.

Project Unique IPs
PIL 3515
mysql-connector-python 117
pyodbc 34
elementtree 21
RBTools 19
egenix-mx-base 16
Pygame 14
salesforce-python-toolkit 13
django-contrib-requestprovider 12
wxPython 11
python-apt 10

Rejected Proposals

Keep the current classification system but adjust the options

This PEP rejects several related proposals which attempt to fix some of the usability problems with the current system but while still keeping the general gist of PEP 438.

This includes:

  • Default to allowing safely externally hosted files, but disallow unsafely hosted.
  • Default to disallowing safely externally hosted files with only a global flag to enable them, but disallow unsafely hosted.
  • Continue on the suggested path of PEP 438 and remove the option to unsafely host externally but continue to allow the option to safely host externally.

These proposals are rejected because:

  • The classification "system" is complex, hard to explain, and requires an intimate knowledge of how the simple API works in order to be able to reason about which classification is required. This is reflected in the fact that the code to implement it is complicated and hard to understand as well.

  • People are generally surprised that PyPI allows externally linking to files and doesn't require people to host on PyPI. In contrast most of them are familiar with the concept of multiple software repositories such as is in use by many OSs.

  • PyPI is fronted by a globally distributed CDN which has improved the reliability and speed for end users. It is unlikely that any particular external host has something comparable. This can lead to extremely bad performance for end users when the external host is located in different parts of the world or does not generally have good connectivity.

    As a data point, many users reported sub DSL speeds and latency when accessing PyPI from parts of Europe and Asia prior to the use of the CDN.

  • PyPI has monitoring and an on-call rotation of sysadmins whom can respond to downtime quickly, thus enabling a quicker response to downtime. Again it is unlikely that any particular external host will have this. This can lead to single packages in a dependency chain being un-installable. This will often confuse users, who often times have no idea that this package relies on an external host, and they cannot figure out why PyPI appears to be up but the installer cannot find a package.

  • PyPI supports mirroring, both for private organizations and public mirrors. The legal terms of uploading to PyPI ensure that mirror operators, both public and private, have the right to distribute the software found on PyPI. However software that is hosted externally does not have this, causing private organizations to need to investigate each package individually and manually to determine if the license allows them to mirror it.

    For public mirrors this essentially means that these externally hosted packages cannot be reasonably mirrored. This is particularly troublesome in countries such as China where the bandwidth to outside of China is highly congested making a mirror within China often times a massively better experience.

  • Installers have no method to determine if they should expect any particular URL to be available or not. It is not unusual for the simple API to reference old packages and URLs which have long since stopped working. This causes installers to have to assume that it is OK for any particular URL to not be accessible. This causes problems where an URL is temporarily down or otherwise unavailable (a common cause of this is using a copy of Python linked against a really ancient copy of OpenSSL which is unable to verify the SSL certificate on PyPI) but it should be expected to be up. In this case installers will typically silently ignore this URL and later the user will get a confusing error stating that the installer couldn't find any versions instead of getting the real error message indicating that the URL was unavailable.

  • In the long run, global opt in flags like --allow-all-external will become little annoyances that developers cargo cult around in order to make their installer work. When they run into a project that requires it they will most likely simply add it to their configuration file for that installer and continue on with whatever they were actually trying to do. This will continue until they try to install their requirements on another computer or attempt to deploy to a server where their install will fail again until they add the "make it work" flag in their configuration file.

  • The URL classification only works for a certain subset of projects, however it does not allow for any project which needs additional restrictions such as Access Controls. This means that there would be two methods of doing the same thing, linking to a file safely and hosting an index. Hosting an index works in all situations and by relying on this we make for a more consistent experience no matter the reason for external hosting.

  • The safe external hosting option hampers the ability of PyPI to upgrade it's security infrastructure. For instance if MD5 becomes broken in the future there will be no way for PyPI to upgrade the hashes of the projects which rely on safe external hosting via MD5 while files that are hosted on PyPI can simply be processed over with a new hash function.