|Title:||Using Multi Repository Support for External to PyPI Package File Hosting|
|Last-Modified:||2014-10-13 13:59:48 -0400 (Mon, 13 Oct 2014)|
|Author:||Donald Stufft <donald at stufft.io>,|
|BDFL-Delegate:||Richard Jones <email@example.com>|
|Discussions-To:||distutils-sig at python.org|
|Post-History:||14-May-2014, 05-Jun-2014, 03-Oct-2014, 13-Oct-2014|
- Multiple Repository/Index Support
- External Index Discovery
- Deprecation and Removal of Link Spidering
- Summary of Changes
- Rejected Proposals
This PEP proposes a mechanism for project authors to register with PyPI an external repository where their project's downloads can be located. This information can than be included as part of the simple API so that installers can use it to tell users where the item they are attempting to install is located and what they need to do to enable this additional repository. In addition to adding discovery information to make explicit multiple repositories easy to use, this PEP also deprecates and removes the implicit multiple repository support which currently functions through directly or indirectly linking off site via the simple API. Finally this PEP also proposes deprecating and removing the functionality added by PEP 438, particularly the additional rel information and the meta tag to indicate the API version.
This PEP does not propose mandating that all authors upload their projects to PyPI in order to exist in the index nor does it propose any change to the human facing elements of PyPI.
Historically PyPI did not have any method of hosting files nor any method of automatically retrieving installables, it was instead focused on providing a central registry of names, to prevent naming collisions, and as a means of discovery for finding projects to use. In the course of time setuptools began to scrape these human facing pages, as well as pages linked from those pages, looking for things it could automatically download and install. Eventually this became the "Simple" API which used a similar URL structure however it eliminated any of the extraneous links and information to make the API more efficient. Additionally PyPI grew the ability for a project to upload release files directly to PyPI enabling PyPI to act as a repository in addition to an index.
This gives PyPI two equally important roles that it plays in the Python ecosystem, that of index to enable easy discovery of Python projects and central repository to enable easy hosting, download, and installation of Python projects. Due to the history behind PyPI and the very organic growth it has experienced the lines between these two roles are blurry, and this blurring has caused confusion for the end users of both of these roles and this has in turn caused ire between people attempting to use PyPI in different capacities, most often when end users want to use PyPI as a repository but the author wants to use PyPI solely as an index.
This confusion comes down to end users of projects not realizing if a project is hosted on PyPI or if it relies on an external service. This often manifests itself when the external service is down but PyPI is not. People will see that PyPI works, and other projects works, but this one specific one does not. They often times do not realize who they need to contact in order to get this fixed or what their remediation steps are.
By moving to using explicit multiple repositories we can make the lines between these two roles much more explicit and remove the "hidden" surprises caused by the current implementation of handling people who do not want to use PyPI as a repository. However simply moving to explicit multiple repositories is a regression in discoverability, and for that reason this PEP adds an extension to the current simple API which will enable easy discovery of the specific repository that a project can be found in.
PEP 438 attempted to solve this issue by allowing projects to explicitly declare if they were using the repository features or not, and if they were not, it had the installers classify the links it found as either "internal", "verifiable external" or "unverifiable external". PEP 438 was accepted and implemented in pip 1.4 (released on Jul 23, 2013) with the final transition implemented in pip 1.5 (released on Jan 2, 2014).
PEP 438 was successful in bringing about more people to utilize PyPI's repository features, an altogether good thing given the global CDN powering PyPI providing speed ups for a lot of people, however it did so by introducing a new point of confusion and pain for both the end users and the authors.
- Easily allow external hosting to "just work" when appropriately configured at the system, user or virtual environment level.
- Easily allow package authors to tell PyPI "my releases are hosted <here>" and have that advertised in such a way that tools can clearly communicate it to users, without silently introducing unexpected dependencies on third party services.
- Eliminate any and all references to the confusing "verifiable external" and "unverifiable external" distinction from the user experience (both when installing and when releasing packages).
- The repository aspects of PyPI should become just the default package hosting location (i.e. the only one that is treated as opt-out rather than opt-in by most client tools in their default configuration). Aside from that aspect, hosting on PyPI should not otherwise provide an enhanced user experience over hosting your own package repository.
- Do all of the above while providing default behaviour that is secure against most attackers below the nation state adversary level.
The two common installer tools, pip and easy_install/setuptools, both support the concept of additional locations to search for files to satisfy the installation requirements and have done so for many years. This means that there is no need to "phase" in a new flag or concept and the solution to installing a project from a repository other than PyPI will function regardless of how old (within reason) the end user's installer is. Not only has this concept existed in the Python tooling for some time, but it is a concept that exists across languages and even extending to the OS level with OS package tools almost universally using multiple repository support making it extremely likely that someone is already familiar with the concept.
Additionally, the multiple repository approach is a concept that is useful outside of the narrow scope of allowing projects which wish to be included on the index portion of PyPI but do not wish to utilize the repository portion of PyPI. This includes places where a company may wish to host a repository that contains their internal packages or where a project may wish to have multiple "channels" of releases, such as alpha, beta, release candidate, and final release. This could also be used for projects wishing to host files which cannot be uploaded to PyPI, such as multi-gigabyte data files or, currently at least, Linux Wheels.
Why Not PEP 438 or Similar?
While the additional search location support has existed in pip and setuptools for quite some time support for PEP 438 has only existed in pip since the 1.4 version, and still has yet to be implemented in setuptools. The design of PEP 438 did mean that users still benefited for projects which did not require external files even with older installers, however for projects which did require external files, users are still silently being given either potentially unreliable or, even worse, unsafe files to download. This system is also unique to Python as it arises out of the history of PyPI, this means that it is almost certain that this concept will be foreign to most, if not all users, until they encounter it while attempting to use the Python toolchain.
Additionally, the classification system proposed by PEP 438 has, in practice, turned out to be extremely confusing to end users, so much so that it is a position of this PEP that the situation as it stands is completely untenable. The common pattern for a user with this system is to attempt to install a project possibly get an error message (or maybe not if the project ever uploaded something to PyPI but later switched without removing old files), see that the error message suggests --allow-external, they reissue the command adding that flag most likely getting another error message, see that this time the error message suggests also adding --allow-unverified, and again issue the command a third time, this time finally getting the thing they wish to install.
This UX failure exists for several reasons.
If pip can locate files at all for a project on the Simple API it will simply use that instead of attempting to locate more. This is generally the right thing to do as attempting to locate more would erase a large part of the benefit of PEP 438. This means that if a project ever uploaded a file that matches what the user has requested for install that will be used regardless of how old it is.
PEP 438 makes an implicit assumption that most projects would either upload themselves to PyPI or would update themselves to directly linking to release files. While a large number of projects did ultimately decide to upload to PyPI, some of them did so only because the UX around what PEP 438 was so bad that they felt forced to do so. More concerning however, is the fact that very few projects have opted to directly and safely link to files and instead they still simply link to pages which must be scraped in order to find the actual files, thus rendering the safe variant (--allow-external) largely useless.
Even if an author wishes to directly link to their files, doing so safely is non-obvious. It requires the inclusion of a MD5 hash (for historical reasons) in the hash of the URL. If they do not include this then their files will be considered "unverified".
PEP 438 takes a security centric view and disallows any form of a global opt in for unverified projects. While this is generally a good thing, it creates extremely verbose and repetitive command invocations such as:
$ pip install --allow-external myproject --allow-unverified myproject myproject $ pip install --allow-all-external --allow-unverified myproject myproject
Installers SHOULD implement or continue to offer, the ability to point the installer at multiple URL locations. The exact mechanisms for a user to indicate they wish to use an additional location is left up to each individual implementation.
Additionally the mechanism discovering an installation candidate when multiple repositories are being used is also up to each individual implementation, however once configured an implementation should not discourage, warn, or otherwise cast a negative light upon the use of a repository simply because it is not the default repository.
Currently both pip and setuptools implement multiple repository support by using the best installation candidate it can find from either repository, essentially treating it as if it were one large repository.
Installers SHOULD also implement some mechanism for removing or otherwise disabling use of the default repository. The exact specifics of how that is achieved is up to each individual implementation.
Installers SHOULD also implement some mechanism for whitelisting and blacklisting which projects a user wishes to install from a particular repository. The exact specifics of how that is achieved is up to each individual implementation.
One of the problems with using an additional index is one of discovery. Users will not generally be aware that an additional index is required at all much less where that index can be found. Projects can attempt to convey this information using their description on the PyPI page however that excludes people who discover their project organically through pip search.
To support projects that wish to externally host their files and to enable users to easily discover what additional indexes are required, PyPI will gain the ability for projects to register external index URLs along with an associated comment for each. These URLs will be made available on the simple page however they will not be linked or provided in a form that older installers will automatically search them.
This ability will take the form of a <meta> tag. The name of this tag must be set to repository or find-link and the content will be a link to the location of the repository. An optional data-description attribute will convey any comments or description that the author has provided.
An example would look something like:
<meta name="repository" content="https://index.example.com/" data-description="Primary Repository"> <meta name="repository" content="https://index.example.com/Ubuntu-14.04/" data-description="Wheels built for Ubuntu 14.04"> <meta name="find-link" content="https://links.example.com/find-links/" data-description="A flat index for find links">
When an installer fetches the simple page for a project, if it finds this additional meta-data then it should use this data to tell the user how to add one or more of the additional URLs to search in. This message should include any comments that the project has included to enable them to communicate to the user and provide hints as to which URL they might want (e.g. if some are only useful or compatible with certain platforms or situations). When the installer has implemented the auto discovery mechanisms they should also deprecate any of the mechanisms added for PEP 438 (such as --allow-external) for removal at the end of the deprecation period proposed by the PEP.
In addition to the API for programtic access to the registered external repositories, PyPI will also prevent these URLs in the UI so that users with an installer that does not implement the discovery mechanism can still easily discover what repository the project is using to host itself.
This feature MUST be added to PyPI and be contained in a released version of pip prior to starting the deprecation and removal process for the implicit offsite hosting functionality.
- Implement simple API changes to allow the addition of an external repository.
- (Optional, Mandatory on PyPI) Deprecate and remove the hosting modes as defined by PEP 438.
- (Optional, Mandatory on PyPI) Restrict simple API to only list the files that are contained within the repository and the external repository metadata.
The large impact of this PEP will be that for users of older installation clients they will not get a discovery mechanism built into the install command. This will require them to browse to the PyPI web UI and discover the repository there. Since any URLs required to instal a project will be automatically migrated to the new format, the biggest change to users will be requiring a new option to install these projects.
Looking at the numbers the actual impact should be quite low, with it affecting just 3.8% of projects which host any files only externally or 2.2% which have their latest version hosted only externally.
6674 unique IP addresses have accessed the Simple API for these 3.8% of projects in a single day (2014-09-30). Of those, 99.5% of them installed something which could not be verified, and thus they were open to a Remote Code Execution via a Man-In-The-Middle attack, while 7.9% installed something which could be verified and only 0.4% only installed things which could be verified.
This means that 99.5% users of these features, both new and old, are doing something unsafe, and for anything using an older copy of pip or using setuptools at all they are silently unsafe.
This is determined by crawling the simple index and looking for installable files using a similar detection method as pip and setuptools use. The "latest" version is determined using pkg_resources.parse_version sort order and it is used to show whether or not the latest version is hosted externally or only old versions are.
|PyPI||External (old)||External (latest)||Total|
This is determined by looking at the number of requests the /simple/<project>/ page had gotten in a single day. The total number of requests during that day was 10,623,831.
This is determined by looking at the IP addresses of requests the /simple/<project>/ page had gotten in a single day. The total number of unique IP addresses during that day was 124,604.
This PEP rejects several related proposals which attempt to fix some of the usability problems with the current system but while still keeping the general gist of PEP 438.
- Default to allowing safely externally hosted files, but disallow unsafely hosted.
- Default to disallowing safely externally hosted files with only a global flag to enable them, but disallow unsafely hosted.
- Continue on the suggested path of PEP 438 and remove the option to unsafely host externally but continue to allow the option to safely host externally.
These proposals are rejected because:
- The classification system introduced in PEP 438 in an entirely unique concept to PyPI which is not generically applicable even in the context of Python packaging. Adding additional concepts comes at a cost.
- The classification system itself is non-obvious to explain and to pre-determine what classification of link a project will require entails inspecting the project's /simple/<project>/ page, and possibly any URLs linked from that page.
- The ability to host externally while still being linked for automatic discovery is mostly a historic relic which causes a fair amount of pain and complexity for little reward.
- The installer's ability to optimize or clean up the user interface is limited due to the nature of the implicit link scraping which would need to be done. This extends to the --allow-* options as well as the inability to determine if a link is expected to fail or not.
- The mechanism paints a very broad brush when enabling an option, while PEP 438 attempts to limit this with per package options. However a project that has existed for an extended period of time may often times have several different URLs listed in their simple index. It is not unusual for at least one of these to no longer be under control of the project. While an unregistered domain will sit there relatively harmless most of the time, pip will continue to attempt to install from it on every discovery phase. This means that an attacker simply needs to look at projects which rely on unsafe external URLs and register expired domains to attack users.
This document has been placed in the public domain.