Browser.py

Another bit of code put together: this time, an automated web browser
for Python. It’s something like Perl’s WWW:Mechanize — use it to
navigate to a page, follow links, fill out forms, and the like. Get the
code, or look at the href="/code/pybrowser/html/">documentation or the href="/code/pybrowser/browser.py.html">syntax-highlighted code.

19 Responses to “Browser.py”

  1. It’s not working for me - what do I need to install for the “from xml.dom.ext.reader import HtmlLib” bit to work?

    Simon Willison
  2. Darn, I thought that came with Python. :) It’s the Python/XML distribution (Debian package python2.3-xml) — I think the HtmlLib stuff is from 4DOM, the FourThought Python XML stuff. It appears that the Debian people collect together a load of Python stuff and bung it into one package, which is nice of them but makes it difficult to recommend a package to non-Debian-Linux-people.

    I’m open to other suggestions for an HTML parser; all I really need is something that can parse even broken HTML and give me a DOM tree out of it. I tried using the Twisted people’s @microdom@ first but gave it up in favour of HtmlLib, and even that I had to patch twice in the code (once to cope with bad namespaced elements, and once to fix a bug). Any better suggestions for an HTML -> DOM parser in Python, say the word, especially if it’s either easy to distribute with browser.py or comes with Python by default!

    sil
  3. You might want to take a look at Fredrik Lundh’s ElementTidy, or mxTidy which both use a library version of Dave Raggett’s HTML Tidy utility to fix any problems with the HTML before converting it to XHTML (the XML version of HTML)

    http://www.effbot.org/zone/element-tidylib.htm

    Chris
  4. The webunit package contains a module called SimpleDOM:
    http://www.mechanicalcat.net/tech/webunit/README.html#simpledom
    It seems useful. Besides webunit seems
    in some ways similar to Browser.py

    Henning
  5. SimpleDOM.NestingError: Open tags
    <html>, <body>, <table>,
    <tr>, <td>, <table>,
    <tr>, <td>, <script> do not
    match close tag </iframe>, at line 76, column
    290

    The major problem here is needing to parse invalid HTML, which most of the web is. The effbot interface to tidylib would work but requires a C extension. mxTidy certainly didn’t used to be an interface to tidylib — instead, it called the tidy executable directly. There are other tidylib interfaces, some of which need ctypes or similar…

    sil
  6. The twisted guys have microdom, which is specifically designed to correct nonstandard HTML, and can probably be easily lifted out of the package.

    Geoff
  7. This is great stuff. Thanks. The XML parser from:  http://pyxml.sourceforge.net/topics/download.html
    works well.
    I added a little patch for titles:

      def getTitle(self):
      try:  
      titleNode = self.__HtmlDom.getElementsByTagName('title')[0]
      return self.__getInnerText(titleNode)
      except:
      return ""
    
    Kevin Dahlhausen
  8. I used PyXML as well and indeed it works fine. Just noticed that if I install it with the setup.py’s defaults, it gets installed in site-packages/_xmlplus… python can’t find it there, it must sit in site-packages/xml. (I may be wrong, I’m a newbie at Python)

    Stefan Champailler
  9. I have had similar problems. I choose to compile tidy as a standalone tool, pipe the data i needed to parse through it using popen2 and then parse it. No problem, except figuring out encoding stuff for strange sweedish letters.

    Martin Eliasson
  10. Geoff: I did use microdom first, but I stopped using it in favour of HtmlLib (although annoyingly I can’t remember why!).

    sil
  11. Sounds like the debian package is PyXML — non-debian people can simply download and install PyXML from the link Kevin gives.

    Stefan: PyXML should work fine installed in libs/site-packages/xmlplus; the base Python distribution’s libs/xml/__init_.py has a special hook which loads it in place of the basic libraries.

    James Kew
  12. Being the person who maintains the python\-xml package, I feel obliged to say that there’s no “collection of different stuff” in debian packages. What often happens is one upstream package ending up splitted in several binary packages (which is the case for the Python interpreter). python\-xml is PyXML as released on http://pyxml.sf.net/ (with the xbel parts split off in separate packages.)

    Alexandre Fayolle
  13. I used the following modification to be able to use images as submit buttons :

      def submit(self, submitName=None):
      """Submit the currently selected L{form}.
    
      @param submitName: In case there are several submit buttons/images,
      one can select the one to be used by specifying its name here.
      @type submitName: string
      @raise NoFormSpecifiedError: If no form is currently selected.
      """
      if not self.__form: raise NoFormSpecifiedError
      method = self.__form.getAttribute('method') or 'GET'
      action = self.__form.getAttribute('action') or self.__uri
    
      if( submitName):
      submitters = self.__form.getElementsByTagName('input')
      matchingSubmitters = [x for x in submitters
    	  if x.getAttribute('name') == submitName]
    
      try:
      submitter = matchingSubmitters[0]
      submitterType = submitter.getAttribute('type')
      print "OK to submit for " + submitName + " of type " + submitterType
      if submitterType.lower() == "image":
      # Some hopefuly harmless defaults...
      self.__form.fieldValues[ submitName+".x"] = "1"
      self.__form.fieldValues[ submitName+".y"] = "1"
      except:
      print "Impossible to submit via " + submitName
      raise FieldNotFound,submitName
    
      self.get(action,method,self.__form.fieldValues)
    
    S. Champailler
  14. Getting an error I can’t resolve.  I am guessing that this arises when a returned page is over a certain size (64k).

    I am using the standard packages distributions from OpenBSD 3.4.  python 2.21 and pyXML 0.7.1

    Traceback (most recent call last):
      File “<stdin>”, line 1, in ?
      File “browser.py”, line 233, in get
      self.__htmldom = self.__reader.fromString(self.__data)
      File “/usr/obj/i386/py-xml-0.7.1/fake-i386/usr/local/lib/python2.2/site-packages/_xmlplus/dom/ext/reader/HtmlLib.py”, line 70, in fromString
      File “/usr/obj/i386/py-xml-0.7.1/fake-i386/usr/local/lib/python2.2/site-packages/_xmlplus/dom/ext/reader/HtmlLib.py”, line 28, in fromStream
      File “/usr/obj/i386/py-xml-0.7.1/fake-i386/usr/local/lib/python2.2/site-packages/_xmlplus/dom/ext/reader/Sgmlop.py”, line 57, in parse
    ValueError: character reference too large

    Any thoughts anyone?

    mk
  15. Hi, another newbie I’m afraid. I’ve picked up on Browser.py so that I could process my yahoo email (I’m using a Perl program to do so at the moment). Unfortunately I am getting errors that I don’t understand. The first occurs when simply changing the example in the documentation from http://www.yahoo.com to http://www.yahoo.co.uk, I get the following error …

    >>> from browser import Browser
    >>> b=Browser()
    >>> b.get(’http://www.yahoo.co.uk/’)
    Traceback (most recent call last):
      File “<stdin>”, line 1, in ?
      File “browser.py”, line 233, in get
      self.__htmldom = self.__reader.fromString(self.__data)
      File “C:\Python23\Lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py”, line
    69, in fromString
      return self.fromStream(stream, ownerDoc, charset)
      File “C:\Python23\Lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py”, line
    27, in fromStream
      self.parser.parse(stream)
      File “C:\Python23\Lib\site-packages\_xmlplus\dom\ext\reader\Sgmlop.py”, line 5
    7, in parse
      self._parser.parse(stream.read())
      File “C:\Python23\Lib\site-packages\_xmlplus\dom\ext\reader\Sgmlop.py”, line 1
    60, in finish_starttag
      unicode(value, self._charset))
      File “browser.py”, line 95, in newSetAttributeNS
      Element.setAttributeNS(self,ns,qname.upper(),value)
      File “C:\Python23\Lib\site-packages\_xmlplus\dom\Element.py”, line 170, in set
    AttributeNS
      raise InvalidCharacterErr()
    xml.dom.InvalidCharacterErr: Invalid or illegal character
    >>>

    The error is way beyond my knowledge so any comments would be welcome.

    The second hurdle comes when,using yahoo.com I get to the login page. I can correctly load the form, but when setting field values and submitting the following error occurs. This appears to be related to https but again the error is beyond me

    >>> from browser import Browser
    >>> b=Browser()
    >>> b.get(’http://www.yahoo.com/’)
    >>> b.follow_link(’Mail’)
    >>> b.dump_forms()
    Form login_form
      Action: https://login.yahoo.com/config/login?1c907898i8vmr
      Method: POST
      Hidden: .tries 1
      Hidden: .src ym
      Hidden: .md5 (no value)
      Hidden: .hash (no value)
      Hidden: .js (no value)
      Hidden: .last (no value)
      Hidden: promo (no value)
      Hidden: .intl us
      Hidden: .bypass (no value)
      Hidden: .partner (no value)
      Hidden: .u 06ta7l8vvr0ds
      Hidden: .v 0
      Hidden: .challenge hmb1p1cZHX45VKiGm9wQchCdirYw
      Hidden: .yplus (no value)
      Hidden: .emailCode (no value)
      Hidden: pkg (no value)
      Hidden: stepid (no value)
      Hidden: .ev (no value)
      Hidden: hasMsgr 0
      Hidden: .chkP Y
      Hidden: .done http://mail.yahoo.com
      Textbox: login (no value)
      Password: passwd (no value)
      Checkbox: .persistent y (off)
      Button: .save Sign In
    >>> b.form(’login_form’)
    >>> b.field(’login’,'fred’)
    >>> b.submit()
    Traceback (most recent call last):
      File “<stdin>”, line 1, in ?
      File “browser.py”, line 340, in submit
      self.get(action,method,self.__form.fieldValues)
      File “browser.py”, line 221, in get
      fp = ClientCookie.urlopen(newuri,urllib.urlencode(data))
      File “C:\Python23\Lib\site-packages\ClientCookie\_urllib2_support.py”, line 82
    9, in urlopen
      return _opener.open(url, data)
      File “C:\Python23\Lib\site-packages\ClientCookie\_urllib2_support.py”, line 52
    0, in open
      response = urllib2.OpenerDirector.open(self, req, data)
      File “C:\Python23\lib\urllib2.py”, line 338, in open
      ‘unknown_open’, req)
      File “C:\Python23\lib\urllib2.py”, line 313, in _call_chain
      result = func(*args)
      File “C:\Python23\lib\urllib2.py”, line 862, in unknown_open
      raise URLError(’unknown url type: %s’ % type)
    urllib2.URLError: <urlopen error unknown url type: https>
    >>>

    Comments and help appreciated

    Phil Rubini
  16. can someone e-mail me and tell me what kind of program python 2.21 is and what is it used for in XP?

    G. Belzman
  17. Gary,
    Python is a programming language. Browser.py, the subject of this page, is an automated web testing tool which is written in Python (and therefore needs Python present on your system to run). Get Python from http://www.python.org/. If you do any programming then you will find it a simpler and more powerful way to work that whatever you’re currently using.

    sil
  18. This program cannot be interpreted correctly at all. It keeps telling me that it cannot find a specific module xml.dom.ext.reader. just an FYO

    Weezy
  19. i cant go to yahoo email and yahoo mssenger. Can u tell me what’s the reason?

    amie katindig

Leave a Reply

OpenID is a decentralised authentication system. If you use LiveJournal or Vox you already have an OpenID; just use the URL of your homepage there. See also how to get yourself an OpenID.