How to Parse HTML in Mozilla

The question comes up frequently in extension development.

The usual answer involves loading the HTML into a hidden iframe, which requires a DOMWindow. Or potentially using the SAXParser that appeared in Firefox 2, though you don’t get a DOM that way.

Thankfully, the answer is going to be a lot easier in the next release. Rob Sayre has a patch for parsing “text/html” using the existing DOMParser API. Currently, DOMParser only handles “text/xml”, hence the warning at the top of the MDC page. This limits its usefulness given the lack of XHTML in the wild. The new feature will make it very easy to load HTML (even partial tags) into a DOM.

I need to remember to update those MDC pages when the patch lands.

3 Replies to “How to Parse HTML in Mozilla”

  1. I am waiting this for too long long a time.
    Also, I think it is better to make the DOMParser could works
    without a window invovled.

    I found there is always sometime I need a library for parsing
    html to DOM. Maybe Mozilla can provide such a convinient one.

  2. I’ve been hacking around with JRuby to try to get an Hpricot-like lib built around embedded Gecko+JavaXPCom+mozdom4java, and I have it at least getting all the libs loaded.

    Is there anyone else trying to use headless-Gecko to parse HTML and get the programmatic DOM?

    ~L

Comments are closed.