Re: Help: how can I use Python to parse out URLs?

Steven D. Majewski (sdm7g@elvis.med.virginia.edu)
Fri, 29 Jul 1994 14:30:35 -0400

On Jul 28, 18:32, Ray Johnson wrote:
> Subject: Help: how can I use Python to parse out URLs?
>
> What I need to do is parse out all the URL's of a given string
> (which came from a HTML page). However, I've been having a hard
> time with the regexp modules. First of all, has anyone already
> done something like this? Could someone give my some pointers
> on what I need to do. I'm sure if I understood the libraries
> a little better it would be fairly simple to do. At least I hope...
>

The standard library module urllib ( Python/Lib/urllib.py ) has
routines to parse and split url components, routines to quote and
unquote hex escaped characters:

# Utilities to parse URLs:
# unwrap('<URL:type//host/path>') --> 'type//host/path'
# splittype('type:opaquestring') --> 'type', 'opaquestring'
# splithost('//host[:port]/path') --> 'host[:port]', '/path'
# splitport('host:port') --> 'host', 'port'
# splitquery('/path?query') --> '/path', 'query'
# splittag('/path#tag') --> '/path', 'tag'
# splitgophertype('/Xselector') --> 'X', 'selector'
# unquote('abc%20def') -> 'abc def'
# quote('abc def') -> 'abc%20def')

- and even a urlretrieve( url ) function.

- Steve Majewski (804-982-0831) <sdm7g@Virginia.EDU>
- UVA Department of Molecular Physiology and Biological Physics

.