A simple "pull API" for HTML parsing, after Perl's
HTML::TokeParser
. Many simple HTML parsing tasks are
simpler this way than with the HTMLParser
module.
pullparser.PullParser
is a subclass of
HTMLParser.HTMLParser
.
Examples:
This program extracts all links from a document. It will print one line for
each link, containing the URL and the textual description between the
<a>...</a>
tags:
import pullparser, sys f = file(sys.argv[1]) p = pullparser.PullParser(f) for token in p.tags("a"): if token.type == "endtag": continue url = dict(token.attrs).get("href", "-") text = p.get_compressed_text(endat=("endtag", "a")) print "%s\t%s" % (url, text)
This program extracts the <title>
from the document:
import pullparser, sys f = file(sys.argv[1]) p = pullparser.PullParser(f) if p.get_tag("title"): title = p.get_compressed_text() print "Title: %s" % title
Thanks to Gisle Aas, who wrote HTML::TokeParser
.
All documentation (including this web page) is included in the distribution.
Stable release.
For installation instructions, see the INSTALL file included in the distribution.
The Subversion (SVN) trunk is http://codespeak.net/svn/wwwsearch/pullparser/trunk, so to check out the source:
svn co http://codespeak.net/svn/wwwsearch/pullparser/trunk
Beautiful Soup is widely recommended. More robust than this module.
2.2.1 or above.
The BSD license (included in distribution).
Because module HTMLParser
is fussy. Try
pullparser.TolerantPullParser
instead, which uses module
sgmllib
instead. Note that self-closing tags (<foo/>)
will show up as 'starttag' tags, not 'startendtag' tags if you use this
class - this is a limitation of module sgmllib
.
HTMLParser.HTMLParser
isn't very robust. Would be fairly
easy to (perhaps optionally) rebase on the other standard library HTML
parsing module, sgmllib.SGMLParser
(which is really an
HTML parser, not a full SGML parser, despite the name). I'm not going
to do that, though.
I prefer questions and comments to be sent to the mailing list rather than direct to me.
John J. Lee, November 2005.