This is as days pass by, by Stuart Langridge

And this is Browser.py, written , and concerning Uncategorized

Another bit of code put together: this time, an automated web browser for Python. It's something like Perl's WWW:Mechanize -- use it to navigate to a page, follow links, fill out forms, and the like. Get the code, or look at the documentation or the syntax-highlighted code.

Comments

Simon Willison

It's not working for me - what do I need to install for the "from xml.dom.ext.reader import HtmlLib" bit to work?

sil

Darn, I thought that came with Python. :) It's the Python/XML distribution (Debian package python2.3-xml) -- I think the HtmlLib stuff is from 4DOM, the FourThought Python XML stuff. It appears that the Debian people collect together a load of Python stuff and bung it into one package, which is nice of them but makes it difficult to recommend a package to non-Debian-Linux-people.


I'm open to other suggestions for an HTML parser; all I really need is something that can parse even broken HTML and give me a DOM tree out of it. I tried using the Twisted people's @microdom@ first but gave it up in favour of HtmlLib, and even that I had to patch twice in the code (once to cope with bad namespaced elements, and once to fix a bug). Any better suggestions for an HTML -> DOM parser in Python, say the word, especially if it's either easy to distribute with browser.py or comes with Python by default!

Chris

You might want to take a look at Fredrik Lundh's ElementTidy, or mxTidy which both use a library version of Dave Raggett's HTML Tidy utility to fix any problems with the HTML before converting it to XHTML (the XML version of HTML)


http://www.effbot.org/zone/element-tidylib.htm

sil

SimpleDOM.NestingError: Open tags

<html>, <body>, <table>,

<tr>, <td>, <table>,

<tr>, <td>, <script> do not

match close tag </iframe>, at line 76, column

290

The major problem here is needing to parse invalid HTML, which most of the web is. The effbot interface to tidylib would work but requires a C extension. mxTidy certainly didn't used to be an interface to tidylib -- instead, it called the tidy executable directly. There are other tidylib interfaces, some of which need ctypes or similar...

Geoff

The twisted guys have microdom, which is specifically designed to correct nonstandard HTML, and can probably be easily lifted out of the package.

Kevin Dahlhausen

This is great stuff. Thanks. The XML parser from:  http://pyxml.sourceforge.net/topics/download.html

works well.

I added a little patch for titles:


  def getTitle(self):

  try: 

  titleNode = self.__HtmlDom.getElementsByTagName('title')[0]

  return self.__getInnerText(titleNode)

  except:

  return ""

Stefan Champailler

I used PyXML as well and indeed it works fine. Just noticed that if I install it with the setup.py's defaults, it gets installed in site-packages/_xmlplus... python can't find it there, it must sit in site-packages/xml. (I may be wrong, I'm a newbie at Python)

Martin Eliasson

I have had similar problems. I choose to compile tidy as a standalone tool, pipe the data i needed to parse through it using popen2 and then parse it. No problem, except figuring out encoding stuff for strange sweedish letters.

sil

Geoff: I did use microdom first, but I stopped using it in favour of HtmlLib (although annoyingly I can't remember why!).

James Kew

Sounds like the debian package is PyXML -- non-debian people can simply download and install PyXML from the link Kevin gives.


Stefan: PyXML should work fine installed in libs/site-packages/xmlplus; the base Python distribution's libs/xml/__init_.py has a special hook which loads it in place of the basic libraries.

Alexandre Fayolle

Being the person who maintains the python\-xml package, I feel obliged to say that there's no "collection of different stuff" in debian packages. What often happens is one upstream package ending up splitted in several binary packages (which is the case for the Python interpreter). python\-xml is PyXML as released on http://pyxml.sf.net/ (with the xbel parts split off in separate packages.)

S. Champailler

I used the following modification to be able to use images as submit buttons :



  def submit(self, submitName=None):

  """Submit the currently selected L{form}.


  @param submitName: In case there are several submit buttons/images,

  one can select the one to be used by specifying its name here.

  @type submitName: string

  @raise NoFormSpecifiedError: If no form is currently selected.

  """

  if not self.__form: raise NoFormSpecifiedError

  method = self.__form.getAttribute('method') or 'GET'

  action = self.__form.getAttribute('action') or self.__uri


  if( submitName):

  submitters = self.__form.getElementsByTagName('input')

  matchingSubmitters = [x for x in submitters

  if x.getAttribute('name') == submitName]


  try:

  submitter = matchingSubmitters[0]

  submitterType = submitter.getAttribute('type')

  print "OK to submit for " + submitName + " of type " + submitterType

  if submitterType.lower() == "image":

  # Some hopefuly harmless defaults...

  self.__form.fieldValues[ submitName+".x"] = "1"

  self.__form.fieldValues[ submitName+".y"] = "1"

  except:

  print "Impossible to submit via " + submitName

  raise FieldNotFound,submitName


  self.get(action,method,self.__form.fieldValues)


mk

Getting an error I can't resolve.  I am guessing that this arises when a returned page is over a certain size (64k).


I am using the standard packages distributions from OpenBSD 3.4.  python 2.21 and pyXML 0.7.1


Traceback (most recent call last):

  File "<stdin>", line 1, in ?

  File "browser.py", line 233, in get

  self.__htmldom = self.__reader.fromString(self.__data)

  File "/usr/obj/i386/py-xml-0.7.1/fake-i386/usr/local/lib/python2.2/site-packages/_xmlplus/dom/ext/reader/HtmlLib.py", line 70, in fromString

  File "/usr/obj/i386/py-xml-0.7.1/fake-i386/usr/local/lib/python2.2/site-packages/_xmlplus/dom/ext/reader/HtmlLib.py", line 28, in fromStream

  File "/usr/obj/i386/py-xml-0.7.1/fake-i386/usr/local/lib/python2.2/site-packages/_xmlplus/dom/ext/reader/Sgmlop.py", line 57, in parse

ValueError: character reference too large


Any thoughts anyone?

Phil Rubini

Hi, another newbie I'm afraid. I've picked up on Browser.py so that I could process my yahoo email (I'm using a Perl program to do so at the moment). Unfortunately I am getting errors that I don't understand. The first occurs when simply changing the example in the documentation from www.yahoo.com to www.yahoo.co.uk, I get the following error ...


>>> from browser import Browser

>>> b=Browser()

>>> b.get('http://www.yahoo.co.uk/')

Traceback (most recent call last):

  File "<stdin>", line 1, in ?

  File "browser.py", line 233, in get

  self.__htmldom = self.__reader.fromString(self.__data)

  File "C:\Python23\Lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line

69, in fromString

  return self.fromStream(stream, ownerDoc, charset)

  File "C:\Python23\Lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line

27, in fromStream

  self.parser.parse(stream)

  File "C:\Python23\Lib\site-packages\_xmlplus\dom\ext\reader\Sgmlop.py", line 5

7, in parse

  self._parser.parse(stream.read())

  File "C:\Python23\Lib\site-packages\_xmlplus\dom\ext\reader\Sgmlop.py", line 1

60, in finish_starttag

  unicode(value, self._charset))

  File "browser.py", line 95, in newSetAttributeNS

  Element.setAttributeNS(self,ns,qname.upper(),value)

  File "C:\Python23\Lib\site-packages\_xmlplus\dom\Element.py", line 170, in set

AttributeNS

  raise InvalidCharacterErr()

xml.dom.InvalidCharacterErr: Invalid or illegal character

>>>


The error is way beyond my knowledge so any comments would be welcome.


The second hurdle comes when,using yahoo.com I get to the login page. I can correctly load the form, but when setting field values and submitting the following error occurs. This appears to be related to https but again the error is beyond me


>>> from browser import Browser

>>> b=Browser()

>>> b.get('http://www.yahoo.com/')

>>> b.follow_link('Mail')

>>> b.dump_forms()

Form login_form

  Action: https://login.yahoo.com/config/login?1c907898i8vmr

  Method: POST

  Hidden: .tries 1

  Hidden: .src ym

  Hidden: .md5 (no value)

  Hidden: .hash (no value)

  Hidden: .js (no value)

  Hidden: .last (no value)

  Hidden: promo (no value)

  Hidden: .intl us

  Hidden: .bypass (no value)

  Hidden: .partner (no value)

  Hidden: .u 06ta7l8vvr0ds

  Hidden: .v 0

  Hidden: .challenge hmb1p1cZHX45VKiGm9wQchCdirYw

  Hidden: .yplus (no value)

  Hidden: .emailCode (no value)

  Hidden: pkg (no value)

  Hidden: stepid (no value)

  Hidden: .ev (no value)

  Hidden: hasMsgr 0

  Hidden: .chkP Y

  Hidden: .done http://mail.yahoo.com

  Textbox: login (no value)

  Password: passwd (no value)

  Checkbox: .persistent y (off)

  Button: .save Sign In

>>> b.form('login_form')

>>> b.field('login','fred')

>>> b.submit()

Traceback (most recent call last):

  File "<stdin>", line 1, in ?

  File "browser.py", line 340, in submit

  self.get(action,method,self.__form.fieldValues)

  File "browser.py", line 221, in get

  fp = ClientCookie.urlopen(newuri,urllib.urlencode(data))

  File "C:\Python23\Lib\site-packages\ClientCookie\_urllib2_support.py", line 82

9, in urlopen

  return _opener.open(url, data)

  File "C:\Python23\Lib\site-packages\ClientCookie\_urllib2_support.py", line 52

0, in open

  response = urllib2.OpenerDirector.open(self, req, data)

  File "C:\Python23\lib\urllib2.py", line 338, in open

  'unknown_open', req)

  File "C:\Python23\lib\urllib2.py", line 313, in _call_chain

  result = func(*args)

  File "C:\Python23\lib\urllib2.py", line 862, in unknown_open

  raise URLError('unknown url type: %s' % type)

urllib2.URLError: <urlopen error unknown url type: https>

>>>


Comments and help appreciated

G. Belzman

can someone e-mail me and tell me what kind of program python 2.21 is and what is it used for in XP?

sil

Gary,
Python is a programming language. Browser.py, the subject of this page, is an automated web testing tool which is written in Python (and therefore needs Python present on your system to run). Get Python from http://www.python.org/. If you do any programming then you will find it a simpler and more powerful way to work that whatever you’re currently using.

Weezy

This program cannot be interpreted correctly at all. It keeps telling me that it cannot find a specific module xml.dom.ext.reader. just an FYO

amie katindig

i cant go to yahoo email and yahoo mssenger. Can u tell me what’s the reason?

This website belongs to Stuart Langridge. Contact details are available. Don't eat yellow snow. Valid HTML5, at least in theory, except for the bits that aren't because I'm that futuristic that I'm ahead of the spec, oh yes. HTML5 help from Bruce Lawson, among others. Fonts from the superb FontSquirrel. End.