<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Browser.py</title>
	<atom:link href="http://www.kryogenix.org/days/2003/11/30/pybrowser/feed" rel="self" type="application/rss+xml" />
	<link>http://www.kryogenix.org/days/2003/11/30/pybrowser</link>
	<description>scratched tallies on the prison wall</description>
	<pubDate>Wed, 03 Dec 2008 23:10:39 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.5</generator>
		<item>
		<title>By: Simon Willison</title>
		<link>http://www.kryogenix.org/days/2003/11/30/pybrowser#comment-1956</link>
		<dc:creator>Simon Willison</dc:creator>
		<pubDate>Thu, 01 Jan 1970 01:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.kryogenix.org/adpb/2003/11/30/pybrowser/#comment-1956</guid>
		<description>It's not working for me - what do I need to install for the "from xml.dom.ext.reader import HtmlLib" bit to work?</description>
		<content:encoded><![CDATA[<p>It&#8217;s not working for me - what do I need to install for the &#8220;from xml.dom.ext.reader import HtmlLib&#8221; bit to work?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: sil</title>
		<link>http://www.kryogenix.org/days/2003/11/30/pybrowser#comment-1957</link>
		<dc:creator>sil</dc:creator>
		<pubDate>Thu, 01 Jan 1970 01:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.kryogenix.org/adpb/2003/11/30/pybrowser/#comment-1957</guid>
		<description>Darn, I thought that came with Python. :) It's the Python/XML distribution (Debian package python2.3-xml) -- I think the HtmlLib stuff is from 4DOM, the FourThought Python XML stuff. It appears that the Debian people collect together a load of Python stuff and bung it into one package, which is nice of them but makes it difficult to recommend a package to non-Debian-Linux-people.&lt;br /&gt;
&lt;br /&gt;
I'm open to other suggestions for an HTML parser; all I really need is something that can parse even broken HTML and give me a DOM tree out of it. I tried using the Twisted people's @microdom@ first but gave it up in favour of HtmlLib, and even that I had to patch twice in the code (once to cope with bad namespaced elements, and once to fix a bug). Any better suggestions for an HTML -&#62; DOM parser in Python, say the word, especially if it's either easy to distribute with browser.py or comes with Python by default!</description>
		<content:encoded><![CDATA[<p>Darn, I thought that came with Python. :) It&#8217;s the Python/XML distribution (Debian package python2.3-xml) &#8212; I think the HtmlLib stuff is from 4DOM, the FourThought Python XML stuff. It appears that the Debian people collect together a load of Python stuff and bung it into one package, which is nice of them but makes it difficult to recommend a package to non-Debian-Linux-people.</p>
<p>I&#8217;m open to other suggestions for an HTML parser; all I really need is something that can parse even broken HTML and give me a DOM tree out of it. I tried using the Twisted people&#8217;s @microdom@ first but gave it up in favour of HtmlLib, and even that I had to patch twice in the code (once to cope with bad namespaced elements, and once to fix a bug). Any better suggestions for an HTML -&gt; DOM parser in Python, say the word, especially if it&#8217;s either easy to distribute with browser.py or comes with Python by default!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Chris</title>
		<link>http://www.kryogenix.org/days/2003/11/30/pybrowser#comment-1958</link>
		<dc:creator>Chris</dc:creator>
		<pubDate>Thu, 01 Jan 1970 01:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.kryogenix.org/adpb/2003/11/30/pybrowser/#comment-1958</guid>
		<description>You might want to take a look at Fredrik Lundh's ElementTidy, or mxTidy which both use a library version of Dave Raggett's HTML Tidy utility to fix any problems with the HTML before converting it to XHTML (the XML version of HTML)&lt;br /&gt;
&lt;br /&gt;
&lt;a href="http://www.effbot.org/zone/element-tidylib.htm"&gt;http://www.effbot.org/zone/element-tidylib.htm&lt;/a&gt;</description>
		<content:encoded><![CDATA[<p>You might want to take a look at Fredrik Lundh&#8217;s ElementTidy, or mxTidy which both use a library version of Dave Raggett&#8217;s HTML Tidy utility to fix any problems with the HTML before converting it to XHTML (the XML version of HTML)</p>
<p><a href="http://www.effbot.org/zone/element-tidylib.htm">http://www.effbot.org/zone/element-tidylib.htm</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Henning</title>
		<link>http://www.kryogenix.org/days/2003/11/30/pybrowser#comment-1959</link>
		<dc:creator>Henning</dc:creator>
		<pubDate>Thu, 01 Jan 1970 01:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.kryogenix.org/adpb/2003/11/30/pybrowser/#comment-1959</guid>
		<description>The webunit package contains a module called SimpleDOM:&lt;br /&gt;
&lt;a href="http://www.mechanicalcat.net/tech/webunit/README.html#simpledom"&gt;http://www.mechanicalcat.net/tech/webunit/README.html#simpledom&lt;/a&gt;&lt;br /&gt;
It seems useful. Besides webunit seems &lt;br /&gt;
in some ways similar to Browser.py</description>
		<content:encoded><![CDATA[<p>The webunit package contains a module called SimpleDOM:<br />
<a href="http://www.mechanicalcat.net/tech/webunit/README.html#simpledom">http://www.mechanicalcat.net/tech/webunit/README.html#simpledom</a><br />
It seems useful. Besides webunit seems <br />
in some ways similar to Browser.py</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: sil</title>
		<link>http://www.kryogenix.org/days/2003/11/30/pybrowser#comment-1960</link>
		<dc:creator>sil</dc:creator>
		<pubDate>Thu, 01 Jan 1970 01:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.kryogenix.org/adpb/2003/11/30/pybrowser/#comment-1960</guid>
		<description>&lt;code&gt;SimpleDOM.NestingError: Open tags
&#60;html&#62;, &#60;body&#62;, &#60;table&#62;,
&#60;tr&#62;, &#60;td&#62;, &#60;table&#62;,
&#60;tr&#62;, &#60;td&#62;, &#60;script&#62; do not
match close tag &#60;/iframe&#62;, at line 76, column
290&lt;/code&gt;&lt;br /&gt;
The major problem here is needing to parse invalid HTML, which most of the web is. The effbot interface to tidylib would work but requires a C extension. mxTidy certainly didn't used to be an interface to tidylib -- instead, it called the tidy executable directly. There are other tidylib interfaces, some of which need ctypes or similar...</description>
		<content:encoded><![CDATA[<p><code>SimpleDOM.NestingError: Open tags<br />
&lt;html&gt;, &lt;body&gt;, &lt;table&gt;,<br />
&lt;tr&gt;, &lt;td&gt;, &lt;table&gt;,<br />
&lt;tr&gt;, &lt;td&gt;, &lt;script&gt; do not<br />
match close tag &lt;/iframe&gt;, at line 76, column<br />
290</code><br />
The major problem here is needing to parse invalid HTML, which most of the web is. The effbot interface to tidylib would work but requires a C extension. mxTidy certainly didn&#8217;t used to be an interface to tidylib &#8212; instead, it called the tidy executable directly. There are other tidylib interfaces, some of which need ctypes or similar&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Geoff</title>
		<link>http://www.kryogenix.org/days/2003/11/30/pybrowser#comment-1961</link>
		<dc:creator>Geoff</dc:creator>
		<pubDate>Thu, 01 Jan 1970 01:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.kryogenix.org/adpb/2003/11/30/pybrowser/#comment-1961</guid>
		<description>The twisted guys have microdom, which is specifically designed to correct nonstandard HTML, and can probably be easily lifted out of the package.</description>
		<content:encoded><![CDATA[<p>The twisted guys have microdom, which is specifically designed to correct nonstandard HTML, and can probably be easily lifted out of the package.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kevin Dahlhausen</title>
		<link>http://www.kryogenix.org/days/2003/11/30/pybrowser#comment-1962</link>
		<dc:creator>Kevin Dahlhausen</dc:creator>
		<pubDate>Thu, 01 Jan 1970 01:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.kryogenix.org/adpb/2003/11/30/pybrowser/#comment-1962</guid>
		<description>This is great stuff. Thanks. The XML parser from:&#160; &lt;a href="http://pyxml.sourceforge.net/topics/download.html&#160; "&gt;http://pyxml.sourceforge.net/topics/download.html&lt;/a&gt;&lt;br /&gt;
works well.&lt;br /&gt;
I added a little patch for titles:&lt;br /&gt;
&lt;pre&gt;&lt;br /&gt;
&#160; def getTitle(self):&lt;br /&gt;
&#160; try:&#160; &lt;br /&gt;
&#160; titleNode = self.__HtmlDom.getElementsByTagName('title')[0]&lt;br /&gt;
&#160; return self.__getInnerText(titleNode)&lt;br /&gt;
&#160; except:&lt;br /&gt;
&#160; return ""&lt;br /&gt;
&lt;/pre&gt;</description>
		<content:encoded><![CDATA[<p>This is great stuff. Thanks. The XML parser from:&nbsp; <a href="http://pyxml.sourceforge.net/topics/download.html&nbsp; ">http://pyxml.sourceforge.net/topics/download.html</a><br />
works well.<br />
I added a little patch for titles:</p>
<pre>
&nbsp; def getTitle(self):
&nbsp; try:&nbsp; 
&nbsp; titleNode = self.__HtmlDom.getElementsByTagName('title')[0]
&nbsp; return self.__getInnerText(titleNode)
&nbsp; except:
&nbsp; return ""
</pre>
]]></content:encoded>
	</item>
	<item>
		<title>By: Stefan Champailler</title>
		<link>http://www.kryogenix.org/days/2003/11/30/pybrowser#comment-1963</link>
		<dc:creator>Stefan Champailler</dc:creator>
		<pubDate>Thu, 01 Jan 1970 01:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.kryogenix.org/adpb/2003/11/30/pybrowser/#comment-1963</guid>
		<description>I used PyXML as well and indeed it works fine. Just noticed that if I install it with the setup.py's defaults, it gets installed in site-packages/_xmlplus... python can't find it there, it must sit in site-packages/xml. (I may be wrong, I'm a newbie at Python)</description>
		<content:encoded><![CDATA[<p>I used PyXML as well and indeed it works fine. Just noticed that if I install it with the setup.py&#8217;s defaults, it gets installed in site-packages/_xmlplus&#8230; python can&#8217;t find it there, it must sit in site-packages/xml. (I may be wrong, I&#8217;m a newbie at Python)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Martin Eliasson</title>
		<link>http://www.kryogenix.org/days/2003/11/30/pybrowser#comment-1964</link>
		<dc:creator>Martin Eliasson</dc:creator>
		<pubDate>Thu, 01 Jan 1970 01:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.kryogenix.org/adpb/2003/11/30/pybrowser/#comment-1964</guid>
		<description>I have had similar problems. I choose to compile tidy as a standalone tool, pipe the data i needed to parse through it using popen2 and then parse it. No problem, except figuring out encoding stuff for strange sweedish letters.</description>
		<content:encoded><![CDATA[<p>I have had similar problems. I choose to compile tidy as a standalone tool, pipe the data i needed to parse through it using popen2 and then parse it. No problem, except figuring out encoding stuff for strange sweedish letters.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: sil</title>
		<link>http://www.kryogenix.org/days/2003/11/30/pybrowser#comment-1965</link>
		<dc:creator>sil</dc:creator>
		<pubDate>Thu, 01 Jan 1970 01:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.kryogenix.org/adpb/2003/11/30/pybrowser/#comment-1965</guid>
		<description>Geoff: I did use microdom first, but I stopped using it in favour of HtmlLib (although annoyingly I can't remember why!).</description>
		<content:encoded><![CDATA[<p>Geoff: I did use microdom first, but I stopped using it in favour of HtmlLib (although annoyingly I can&#8217;t remember why!).</p>
]]></content:encoded>
	</item>
</channel>
</rss>
