How to Parse HTML in Mozilla

The question comes up frequently in extension development.

The usual answer involves loading the HTML into a hidden iframe, which requires a DOMWindow. Or potentially using the SAXParser that appeared in Firefox 2, though you don’t get a DOM that way.

Thankfully, the answer is going to be a lot easier in the next release. Rob Sayre has a patch for parsing “text/html” using the existing DOMParser API. Currently, DOMParser only handles “text/xml”, hence the warning at the top of the MDC page. This limits its usefulness given the lack of XHTML in the wild. The new feature will make it very easy to load HTML (even partial tags) into a DOM.

I need to remember to update those MDC pages when the patch lands.

3 Comments

  1. Bo Yang said,

    March 10, 2007 @ 4:00 am

    I am waiting this for too long long a time.
    Also, I think it is better to make the DOMParser could works
    without a window invovled.

    I found there is always sometime I need a library for parsing
    html to DOM. Maybe Mozilla can provide such a convinient one.

  2. leslie said,

    March 22, 2007 @ 1:50 pm

    I’ve been hacking around with JRuby to try to get an Hpricot-like lib built around embedded Gecko+JavaXPCom+mozdom4java, and I have it at least getting all the libs loaded.

    Is there anyone else trying to use headless-Gecko to parse HTML and get the programmatic DOM?

    ~L

  3. Better Flash Geek said,

    March 30, 2007 @ 7:23 pm

    Hidden iframe sounds and feels ancient nowadays ((( It’s a pity we still need to hack to get essential things.

RSS feed for comments on this post