Re: Help: how can I use Python to parse out URLs?

Steven Miale (smiale@cs.indiana.edu)
Thu, 28 Jul 1994 19:23:54 -0500

In article <1994Jul28.183207.27838@enterprise.rdd.lmsc.lockheed.com>,
Ray Johnson <rjohnson@freedom.rdd.lmsc.lockheed.com> wrote:
>Help, I'm not much of a Python user but I need to use it for
>a project I'm working on.
>
>What I need to do is parse out all the URL's of a given string
>(which came from a HTML page). However, I've been having a hard
>time with the regexp modules. First of all, has anyone already
>done something like this? Could someone give my some pointers
>on what I need to do. I'm sure if I understood the libraries
>a little better it would be fairly simple to do. At least I hope...

It's pretty simple. There is a html parsing library (which is included
in the standard distribution, I think) called "htmllib.py". What you
can do is define a class with methods 'start_a' and 'end_a' (among
others). For instance, you might have:

class AnchorEater(FormattingParser):
def start_a(self, attrs):
...
def end_a(self):
...

Alternatively, you can use some of the code I wrote for 'dancer',
specifically htmlparsemodule.c. All you need to do is pass it a class
with these methods:

handledata
unknown_starttag
unknown_endtag

and methods called 'start_XXX', 'do_XXX', and 'end_XXX', where XXX is
any markup type (A, IMG, EM, etc.) For instance:

class AnchorEater:
def handledata(self, text):
pass # ignore text
def unknown_starttag(self, tag, attrs):
pass # ignore every tag but A's
def unknown_endtag(self, tag):
pass
def start_a(self, attrs):
print 'Found an anchor! HREF is ',attrs['href']
def end_a(self):
print 'end of tag'

import htmlparse
text = 'this is a <A HREF="http:foo"> test </A> \
of a <A HREF="ftp://bob.com/bar.ps"> parser </A>'
htmlparse.parse(text, AnchorEater())

Results in:

Found an anchor! HREF is http:foo
end of tag
Found an anchor! HREF is ftp://bob.com/bar.ps
end of tag

-- 
Steven Miale [smiale@cs.indiana.edu] HTTP://cs.indiana.edu/hyplan/smiale.html