You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modperl@perl.apache.org by Mike Henderson <ni...@gmail.com> on 2005/09/22 20:30:15 UTC

HTML::Parser not mod_perl safe?

Hello, just a quick question...
 Has anyone out there successfully deployed HTML::Parser in an apache
1.3.x/ mod_perl / HTML::Mason environment (dynamically parsing pages)
?
 I realize that the module itself is kind of crunky, and additionally an XS
module, so, i'm left wondering.
 Basically, what i'm seeing is everything working as you'd expect on the
first load of the page which creates and uses an HTML::Parser object, but,
on any subsequent loads from that same apache child, things are partially
broken -- specifically, during parsing, callbacks to text() don't seem to be
happening, but callbacks to start() and end() seem to work fine.
 I'm wondering if there's any way around this -- that is, any way to
completely destroy any previous data that HTML::Parser is letting linger
that's causing a problem, and reloading the module. Not sure about the
feasiblity of this due it being XS.
 If it is in fact just broken, does anyone recommend an HTML parsing module
that's pure perl / and or one that .. um ... works? :)
HTML::TreeBuilder doesn't seem to be as robust....
 Thanks.
 Mike

Re: HTML::Parser not mod_perl safe?

Posted by Issac Goldstand <ig...@mirimar.net>.
Not sure if this is what people are running into, but if you use
variables, even lexicals scoped on the package level, in a subtype of
HTML::Parser, they won't get reset if you call new() on your class
unless you overload the default new() or otherwise reset them.

For example (untested, but this is approximately what I recall doing on
my own)

package Foo::Parser;
our @ISA=qw(HTML::Parser);
my $foo;
my $bar;

sub text {
  ...
  $foo=1;
  $bar=2;
}

package MyMain;

my $p=Foo::Parser->new;
# $foo and $bar are empty
...
$p->parse;
...
# $foo and bar are now set
...
my $q=Foo::Parser->new;
# $foo and $bar are still set

I run into this a lot even outside of mod_perl if using the same parser
twice...  It might also apply to the blessed hashref too, I don't
recall.  In any case, I usually just add a sub reset() to my
HTML::Parser subclasses which resets all instance data and call that
every time I construct a parser before calling parse().

Again, not sure if this is what people have been running into, but
thought it might be worth mentioning.  Best of luck, people.

  Issac

Mike Henderson wrote:

>I think it's pretty safe to say there is definitely some issues with
>HTML::Parser and mod_perl, at least when subclassing it.
> I managed to kludge around the problem by not doing that -- ie not doing:
>
>---
>package PackageName;
>
>use HTML::Parser;
>
>@PackageName::ISA = qw(HTML::Parser);
>---
>
>I ended up using a somewhat different approach, something like:
> ---
>
>package PackageName;
>
>use HTML::Parser;
>
>sub new {
>my $SELF_PackageName = bless {}, shift;
>$SELF_PackageName->{parser} = HTML::Parser->new( api_version => 3,
>start_h => [\&start, "self, tagname, attr, attrseq, text"],
>end_h => [\&end, "self, tagname, text" ],
>text_h => [\&text, "self, text, is_cdata"]
>);
>return $SELF_PackageName;
>}
>
>sub parse_file { shift->{parser}->parse_file(@_); }
>
>sub start { ... }
>sub end { ... }
>sub text { ... }
>---
> It got a bit weird after that, as the HTML::Parser callbacks pass the
>instance of the actual HTML::Parser object back to the PackageName routines,
>and I actually end up storing all of
>my data in the HTML::Parser namespace ... but it works! :) ... and this is
>why we love perl.
> Thanks guys.
> On 9/22/05, Barry Hoggard <tr...@gmail.com> wrote:
>  
>
>>On Sep 22, 2005, at 2:30 PM, Mike Henderson wrote:
>>
>>    
>>
>>>Hello, just a quick question...
>>>
>>>Has anyone out there successfully deployed HTML::Parser in an apache
>>>1.3.x / mod_perl / HTML::Mason environment (dynamically parsing pages)
>>>?
>>>
>>>I realize that the module itself is kind of crunky, and additionally
>>>an XS module, so, i'm left wondering.
>>>
>>>Basically, what i'm seeing is everything working as you'd expect on
>>>the first load of the page which creates and uses an HTML::Parser
>>>object, but, on any subsequent loads from that same apache child,
>>>things are partially broken -- specifically, during parsing, callbacks
>>>to text() don't seem to be happening, but callbacks to start() and
>>>end() seem to work fine.
>>>
>>>I'm wondering if there's any way around this -- that is, any way to
>>>completely destroy any previous data that HTML::Parser is letting
>>>linger that's causing a problem, and reloading the module. Not sure
>>>about the feasiblity of this due it being XS.
>>>      
>>>
>>I have seen odd behavior using Netscape::Bookmarks (which uses
>>HTML::Parse to parse the file) under mod_perl 1.3.x and Mason. I
>>thought it was my code maybe, but what you are saying reminds me that
>>we got garbage back sometimes from a parse.
>>
>>
>>Barry Hoggard
>>
>>
>>
>>    
>>
>
>  
>

-- 

  Yitzchak Goldstand
  Mirimar Networks
  www.mirimar.net


Re: HTML::Parser not mod_perl safe?

Posted by Mike Henderson <ni...@gmail.com>.
I think it's pretty safe to say there is definitely some issues with
HTML::Parser and mod_perl, at least when subclassing it.
 I managed to kludge around the problem by not doing that -- ie not doing:

---
package PackageName;

use HTML::Parser;

@PackageName::ISA = qw(HTML::Parser);
---

I ended up using a somewhat different approach, something like:
 ---

package PackageName;

use HTML::Parser;

sub new {
my $SELF_PackageName = bless {}, shift;
$SELF_PackageName->{parser} = HTML::Parser->new( api_version => 3,
start_h => [\&start, "self, tagname, attr, attrseq, text"],
end_h => [\&end, "self, tagname, text" ],
text_h => [\&text, "self, text, is_cdata"]
);
return $SELF_PackageName;
}

sub parse_file { shift->{parser}->parse_file(@_); }

sub start { ... }
sub end { ... }
sub text { ... }
---
 It got a bit weird after that, as the HTML::Parser callbacks pass the
instance of the actual HTML::Parser object back to the PackageName routines,
and I actually end up storing all of
my data in the HTML::Parser namespace ... but it works! :) ... and this is
why we love perl.
 Thanks guys.
 On 9/22/05, Barry Hoggard <tr...@gmail.com> wrote:
>
> On Sep 22, 2005, at 2:30 PM, Mike Henderson wrote:
>
> > Hello, just a quick question...
> >
> > Has anyone out there successfully deployed HTML::Parser in an apache
> > 1.3.x / mod_perl / HTML::Mason environment (dynamically parsing pages)
> > ?
> >
> > I realize that the module itself is kind of crunky, and additionally
> > an XS module, so, i'm left wondering.
> >
> > Basically, what i'm seeing is everything working as you'd expect on
> > the first load of the page which creates and uses an HTML::Parser
> > object, but, on any subsequent loads from that same apache child,
> > things are partially broken -- specifically, during parsing, callbacks
> > to text() don't seem to be happening, but callbacks to start() and
> > end() seem to work fine.
> >
> > I'm wondering if there's any way around this -- that is, any way to
> > completely destroy any previous data that HTML::Parser is letting
> > linger that's causing a problem, and reloading the module. Not sure
> > about the feasiblity of this due it being XS.
>
>
> I have seen odd behavior using Netscape::Bookmarks (which uses
> HTML::Parse to parse the file) under mod_perl 1.3.x and Mason. I
> thought it was my code maybe, but what you are saying reminds me that
> we got garbage back sometimes from a parse.
>
>
> Barry Hoggard
>
>
>

Re: HTML::Parser not mod_perl safe?

Posted by Barry Hoggard <tr...@gmail.com>.
On Sep 22, 2005, at 2:30 PM, Mike Henderson wrote:

> Hello, just a quick question...
>  
> Has anyone out there successfully deployed HTML::Parser in an apache 
> 1.3.x / mod_perl / HTML::Mason environment (dynamically parsing pages) 
> ?
>  
> I realize that the module itself is kind of crunky, and additionally 
> an XS module, so, i'm left wondering.
>  
> Basically, what i'm seeing is everything working as you'd expect on 
> the first load of the page which creates and uses an HTML::Parser 
> object, but, on any subsequent loads from that same apache child, 
> things are partially broken -- specifically, during parsing, callbacks 
> to text() don't seem to be happening, but callbacks to start() and 
> end() seem to work fine.
>  
> I'm wondering if there's any way around this -- that is, any way to 
> completely destroy any previous data that HTML::Parser is letting 
> linger that's causing a problem, and reloading the module. Not sure 
> about the feasiblity of this due it being XS.


I have seen odd behavior using Netscape::Bookmarks (which uses 
HTML::Parse to parse the file) under mod_perl 1.3.x and Mason.  I 
thought it was my code maybe, but what you are saying reminds me that 
we got garbage back sometimes from a parse.


Barry Hoggard