You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modules-dev@httpd.apache.org by Whut  Jia <wh...@163.com> on 2011/03/24 13:10:46 UTC

how to parse html content in handler

Hi,all
I want to parse a html content and withdraw some element in myself apache handler.Please ask how to do it.
Thanks,
Jia

Re: how to parse html content in handler

Posted by MK <mk...@cognitivedissonance.ca>.
On Thu, 24 Mar 2011 22:58:07 +0800 (CST)
"Whut  Jia" <wh...@163.com> wrote:
> Hi,
> Thank you!
> But I want to parse a jsp page in my handler.How can I do it??

I've never used .jps but it looks to me like all the processing
instructions are in the form <% instruction %>.  If that's the case,
the simple parser I mentioned in my other post rips xml style PI's,
which are in the form <? instruction ?>.  So a very simple tweak to 
the source code would do that.

-- 
"Enthusiasm is not the enemy of the intellect." (said of Irving Howe)
"The angel of history[...]is turned toward the past." (Walter Benjamin)


Re: how to parse html content in handler

Posted by Nick Kew <ni...@apache.org>.
On Thu, 24 Mar 2011 22:58:07 +0800 (CST)
"Whut  Jia" <wh...@163.com> wrote:

> Hi,
> Thank you!
> But I want to parse a jsp page in my handler.How can I do it??
> Please help me! In my handler, I do a request (http://www.xxx/xxx.jsp)with libcurl,and then parse returned response ,and draw some infomation.Please ask how to parse this jsp response???  

Last time I looked, JSP 2 insisted on XML well-formedness,
and would work well under an XML parser.  You could dispense
with parsing altogether and just implement event handlers
under an existing parser such as mod_xmlns or mod_xml2.

JSP 1 was an SSI-like language.  It would be a little more
work, but mod_includes would be a good startingpoint.

-- 
Nick Kew

Available for work, contract or permanent.
http://www.webthing.com/~nick/cv.html

Re:Re: how to parse html content in handler

Posted by Whut Jia <wh...@163.com>.
Hi,
Thank you!
But I want to parse a jsp page in my handler.How can I do it??
Please help me! In my handler, I do a request (http://www.xxx/xxx.jsp)with libcurl,and then parse returned response ,and draw some infomation.Please ask how to parse this jsp response???  
Thanks,
Jia 




At 2011-03-24 20:25:11,"Ben Noordhuis" <in...@bnoordhuis.nl> wrote:

>On Thu, Mar 24, 2011 at 13:10, Whut  Jia <wh...@163.com> wrote:
>> Hi,all
>> I want to parse a html content and withdraw some element in myself apache handler.Please ask how to do it.
>> Thanks,
>> Jia
>
>Hey, have a look at how mod_proxy_html[1] does it.
>
>[1] http://apache.webthing.com/mod_proxy_html/

Re: how to parse html content in handler

Posted by Ben Noordhuis <in...@bnoordhuis.nl>.
On Thu, Mar 24, 2011 at 13:10, Whut  Jia <wh...@163.com> wrote:
> Hi,all
> I want to parse a html content and withdraw some element in myself apache handler.Please ask how to do it.
> Thanks,
> Jia

Hey, have a look at how mod_proxy_html[1] does it.

[1] http://apache.webthing.com/mod_proxy_html/

Re: how to parse html content in handler

Posted by Mike Meyer <mw...@mired.org>.
On Fri, 25 Mar 2011 09:28:01 -0400
MK <mk...@cognitivedissonance.ca> wrote:

> On Thu, 24 Mar 2011 20:10:46 +0800 (CST)
> Whut  Jia <wh...@163.com> wrote:
> > Hi,all
> > I want to parse a html content and withdraw some element in myself
> > apache handler.Please ask how to do it. Thanks,
> > Jia
> 
> I think right now the only public C library for parsing html is in the
> venerable and long unmaintained libwww.  

How about the HTMLparser module in libxml2?

    <mike
-- 
Mike Meyer <mw...@mired.org>		http://www.mired.org/consulting.html
Independent Software developer/SCM consultant, email for more information.

O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

Re: how to parse html content in handler

Posted by MK <mk...@cognitivedissonance.ca>.
On Fri, 25 Mar 2011 10:19:43 -0400
Joshua Marantz <jm...@google.com> wrote:

> mod_pagespeed's event-driven HTML parser is open source, and is
> written in C++:

There are quite a few around in C++, Boost also has (at least) one.  

-- 
"Enthusiasm is not the enemy of the intellect." (said of Irving Howe)
"The angel of history[...]is turned toward the past." (Walter Benjamin)


Re: how to parse html content in handler

Posted by Joshua Marantz <jm...@google.com>.
mod_pagespeed's event-driven HTML parser is open source, and is written in
C++:
http://code.google.com/p/modpagespeed/source/browse/trunk/src/net/instaweb/htmlparse/public/html_parse.h

<http://code.google.com/p/modpagespeed/source/browse/trunk/src/net/instaweb/htmlparse/public/html_parse.h>This
parser is tested using HTML from large numbers of web sites.  The build
process for this module (
http://code.google.com/p/modpagespeed/wiki/HowToBuild) generates a separate
.a for the HTML parser, although it's got a few dependencies that would need
to be linked in.  These are all included in mod_pagespeed.so which is
self-contained but larger.

If there was much interest we could try to try to package up a
self-contained library that would make it easier to call from other modules.

See also libxml2, which has an HTML mode.

-Josh

On Fri, Mar 25, 2011 at 9:28 AM, MK <mk...@cognitivedissonance.ca> wrote:

> On Thu, 24 Mar 2011 20:10:46 +0800 (CST)
> Whut  Jia <wh...@163.com> wrote:
> > Hi,all
> > I want to parse a html content and withdraw some element in myself
> > apache handler.Please ask how to do it. Thanks,
> > Jia
>
> I think right now the only public C library for parsing html is in the
> venerable and long unmaintained libwww.
>
> However, I wrote a quick and simple, event driven parser library a few
> months ago -- I have been meaning to open source this on CCAN or
> somewhere but have not gotten around to it, so if you are interested
> you can send me a message directly, I have some basic scraper demos
> etc.   It is not on the scale of libwww -- it is just a low level HTML
> parser -- but I am sure it could do what you want, and you can either
> compile it in or link to with an apache module (it has no further
> dependencies).
>
>
> --
> "Enthusiasm is not the enemy of the intellect." (said of Irving Howe)
> "The angel of history[...]is turned toward the past." (Walter Benjamin)
>
>

Re: how to parse html content in handler

Posted by MK <mk...@cognitivedissonance.ca>.
On Thu, 24 Mar 2011 20:10:46 +0800 (CST)
Whut  Jia <wh...@163.com> wrote:
> Hi,all
> I want to parse a html content and withdraw some element in myself
> apache handler.Please ask how to do it. Thanks,
> Jia

I think right now the only public C library for parsing html is in the
venerable and long unmaintained libwww.  

However, I wrote a quick and simple, event driven parser library a few
months ago -- I have been meaning to open source this on CCAN or
somewhere but have not gotten around to it, so if you are interested
you can send me a message directly, I have some basic scraper demos
etc.   It is not on the scale of libwww -- it is just a low level HTML
parser -- but I am sure it could do what you want, and you can either
compile it in or link to with an apache module (it has no further
dependencies).


-- 
"Enthusiasm is not the enemy of the intellect." (said of Irving Howe)
"The angel of history[...]is turned toward the past." (Walter Benjamin)