You are viewing a plain text version of this content. The canonical link for it is here.

Posted to droids-dev@incubator.apache.org by Thorsten Scherler <th...@juntadeandalucia.es> on 2009/04/02 11:49:02 UTC

Re: ParseData to support custom data?

On Wed, 2009-04-01 at 02:55 +0800, Mingfai wrote:
> hi Thorsten ,
> 
> Thx. It's great that you spot the same issue, and I have exactly the same
> thought about the API.
> 
> public interface Parser
> 
> I'm thinking about what's the purpose of Parser, and what exact is the
> different between a Parser and a Handler, given that we could (and probably
> need to) parse the Content Entity input stream in the handler. It was
> actually mentioned in an JIRA issue (
> https://issues.apache.org/jira/browse/DROIDS-11) about separating the
> outlink parsing to an Extractor. afaik:
> 
>    - By default, the parser does nothing more than extracting the outlinks
>    with NekoHTML's SAX Parser. The SAX Parser is event-based and does not store
>    any parsed results. And when any handler need to access the content, it will
>    need to parse again.
> 
>    - for me, I use Jericho HTML Parser, which does do a parsing and then
>    store some parsed data. so in the Droids model, I expect I should implement
>    my parser with Jericho HTML, and store the parsed data. When there are
>    multiple handler, all of them could share the same parsed results.

I am not 100% sure whether it is advisable to store the parse (more
because you later on talk about DOM) the problem I see is the
consumption of resources. I recon in may cases parsing 2 times is faster
the parse once and reuse the DOM object.


> 
>    in fact, if i have only one handler, there is no different for me to do
>    my parsing and handling in the handler.

The concept of the handler is to act on input. Being it a stream of the
direct URI or an object we retrieve via parsing. Remember parsing is
optional! Meaning we should not have a fixed connection between parser
and handler. The parser stage is to determine new task or limit the
object that we pass on (extracting e.g. outlinks or certain information
for filtering purpose. 

> 
>    And as I have implemented my own parsing anyway, the original outlink
>    extraction could be skipped and there won't be duplicated parsing.

Not sure about that.

> 
>    - For the original case, I wonder if the NekoHTML SAX Parser should be
>    stored in the parse(d) data without the link extraction content handler. So
>    the handler still need to call "parse()" again but it needs not to construct
>    a NekoHTML SAX Parser. If any one use DOM parser, for sure the original SAX
>    parsing logic should be skipped and the DOM tree could stored for the
>    handler.

Hmm, DOM is evil and storing DOM objects in a multithreaded environment
is a box killer. 

The idea is that we are using SAX handlers to do the parsing work and
extract the information you want in this stage. Have a look at
TikaHtmlParser. 

There you find:
EchoHandler data = new EchoHandler(charset); 
LinkExtractor extractor = new LinkExtractor(link, elements);
TeeContentHandler parallelHandler = new TeeContentHandler(data,
extractor);
...
parser.parse(instream, parallelHandler, metadata);
ParseData parseData = new ParseData(extractor.getLinks());

What happen here is that we parallel parse and act on the parse. This is
IMO the best approach to not consume too much resources. 

> 
>    - There are some minor comments to the API as follows:
>    - it's good to merge Parse and ParseData. The meaning of "Parse" isn't
>       too clear. ParseData is more meaningful. And ParsedData or ParseResult is
>       more clear to me.
>       - I suggest to write some lines in the class comment to mention the
>       design purpose of these classes.
>       - If the Parse/ParseData also store a reference of the Parser, for SAX
>       Parser, it could be re-used by the handler. (however, for DOM
> parser, it's
>       confusing, as it should store the parsed data only)

I strongly discourage DOM parsing/storing for droids. However droids
allows you even that. I am not sure whether you really mean keep a
reference or the parsed object. If you mean a reference to the parser
than I am not convince. Having references on a object blocks this object
from GC. We would need to clean all this reference after all handler are
finished. 

>       - It seems to me Parser.getParse should be Paser.parse() as it is to
>       trigger an action rather than getting the parse definition. (or
>       Parse.getData() -> Parse.parse())

sounds swell but you mean Paser.parse(...), right?

>       - re. Object getParseObject(); , I suggest to call it Object getData
>       instead.
> 
> 
> btw, my understanding of Droids is largely come from the SimpleRuntime
> usage. I hope i didn't miss the big picture.

The simpleRuntime is nice to show the different set up of components
however it lakes to show features like automatic extensibility of the
droid. I have shown that in my presentation @apacheCon when I used the
droids-spring sample.

Thanks for your feedback mingfai.

salu2

> 
> 
> regards,
> mingfai
> 
> 
> 
> On Fri, Mar 27, 2009 at 11:17 PM, Thorsten Scherler <th...@apache.org>wrote:
> 
> > On Thu, 2009-03-26 at 22:40 +0800, Mingfai wrote:
> > > hi,
> > >
> > > Thanks for creating this very useful project.
> >
> > :)
> >
> > Thanks for this nice feedback.
> >
> > >
> > > I'm new to the droids, and have just learnt most of the concepts and able
> > to
> > > write custom parser, filter, handler etc. And I have encountered a use
> > case
> > > that i want to parse and store some custom data in the Parse/ParseData,
> > and
> > > have the custom data available in the handler.
> > >
> >
> > We actually discussed this before but I am not sure whether it was here
> > or still on the labs list. Bottom line that we do not to rethink the API
> > around that. To begin with the API has an import to an implementation
> > class (ParseData) which is just a bad idea. Further like you pointed out
> > it may make sense to a allow Object to allow custom objects.
> >
> > > Take an hypothetical example, assume I have a crawler that run on
> > Google's
> > > search result, the parser parse the a result page and extract 10 links
> > > together with the 10 cache links. In the Droids framework, there is no
> > way
> > > to pass the cache links to the handle, right?
> >
> > Actually since they are links and if they are not excluded in the
> > regex-urlfilter.txt they would enter as "normal" link/task
> >
> > > As a workaround, i could just
> > > use a singleton to store a map of data using the uri as the key, but it
> > > seems to me it is better if the ParseData could store more than the
> > outlinks
> > > but also some custom data that we use. What do you think?
> >
> > How about
> > public interface Parse {
> > Object getObject();
> > ParseData getParseData();
> > }
> >
> > would merge with ParseData like
> > public interface Parse {
> > Object getParseObject();
> > Collection<Link> getOutlinks() ;
> > }
> >
> > This way we can reduce the level of depth in the API and make the
> > relation clearer. We may even think about merging Parser and Parse too.
> >
> > WDYT?
> >
> > salu2
> >
> > >
> > > The implementation could be very simple, just store a Map
> > >
> > > Regards,
> > > mingfai
> > --
> > Thorsten Scherler <thorsten.at.apache.org>
> > Open Source <consulting, training and solutions>
> >
> >
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la Información, S.A.U. (SADESI)

Re: ParseData to support custom data?

Posted by Thorsten Scherler <th...@juntadeandalucia.es>.

On Sat, 2009-04-04 at 21:10 +0800, Mingfai wrote:
> hi,
...
> > >
> > >    And as I have implemented my own parsing anyway, the original outlink
> > >    extraction could be skipped and there won't be duplicated parsing.
> >
> > Not sure about that.
> >
> 
> for my case, i have to use DOM for the handler anyway. The question is
> whether it is better to:
> 
>    1. use the SAX parsing in the parsing stage for creating the task. And do
>    the handler in my DOM way. or
>    2. replace the SAX Link Extractor with a DOM Link extractor, and store
>    the parsed DOM for the handler.
> 
> anyway, as Droids allows to store a custom data. I prefer to go for the 2nd
> approach first and consider to optimize it to 1 in the future.

IMO DOM parsing makes sense when you are using the page as is and just
adding some more tags to it. So if you need to use DOM and have only one
handler the following may even make more sense:

protocol -> handler (here you are creating a DOM from the stream that
the protocol has open. The you extract the links in the same time as you
treat the stream)

salu2
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la Información, S.A.U. (SADESI)

Re: ParseData to support custom data?

Posted by Mingfai <mi...@gmail.com>.

hi,

Let's not to go to SAX Vs DOM parsing discussion first. I agree with you
that SAX parsing performs better and consumes less resource. I picked
Jericho HTML Parser as its API is easier to use.

On Thu, Apr 2, 2009 at 7:49 PM, Thorsten Scherler <
thorsten.scherler.ext@juntadeandalucia.es> wrote:

> I am not 100% sure whether it is advisable to store the parse (more
> because you later on talk about DOM) the problem I see is the
> consumption of resources. I recon in may cases parsing 2 times is faster
> the parse once and reuse the DOM object.
>

For SAX parser, it should not need to store the parser. There should be no
much benefit for reuse.

the re-use scenario is only relevant if we use DOM parser. I did a quick
test to try to HttpProtoocol.load().obtainContent() and pass the input
stream to Jericho to create a DOM source. (it's not exactly a DOM parser but
works similarly) My test fetch a 165k page from a popular portal for 100
times. The whole process takes 9.1s and the parsing consumes 30%/2.7s. I
think *if* anyone use DOM parsing, the process should be done once only.

>
> >
> >    in fact, if i have only one handler, there is no different for me to
> do
> >    my parsing and handling in the handler.
>
> The concept of the handler is to act on input. Being it a stream of the
> direct URI or an object we retrieve via parsing. Remember parsing is
> optional! Meaning we should not have a fixed connection between parser
> and handler. The parser stage is to determine new task or limit the
> object that we pass on (extracting e.g. outlinks or certain information
> for filtering purpose.
>
> >
> >    And as I have implemented my own parsing anyway, the original outlink
> >    extraction could be skipped and there won't be duplicated parsing.
>
> Not sure about that.
>

for my case, i have to use DOM for the handler anyway. The question is
whether it is better to:

   1. use the SAX parsing in the parsing stage for creating the task. And do
   the handler in my DOM way. or
   2. replace the SAX Link Extractor with a DOM Link extractor, and store
   the parsed DOM for the handler.

anyway, as Droids allows to store a custom data. I prefer to go for the 2nd
approach first and consider to optimize it to 1 in the future.

I did consider one more case that the handler may be executed on another
distributed node. So any custom data to be stored as to be serializable. And
it's preferred not to store anything.

>
> >
> >    - There are some minor comments to the API as follows:
> >    - it's good to merge Parse and ParseData. The meaning of "Parse" isn't
> >       too clear. ParseData is more meaningful. And ParsedData or
> ParseResult is
> >       more clear to me.
>

After a deep thought, calling it Parse is just fine. For naming, the shorter
the better.

>
> >       - I suggest to write some lines in the class comment to mention the
> >       design purpose of these classes.
> >       - If the Parse/ParseData also store a reference of the Parser, for
> SAX
> >       Parser, it could be re-used by the handler. (however, for DOM
> > parser, it's
>
>       confusing, as it should store the parsed data only)
>
> I strongly discourage DOM parsing/storing for droids. However droids
> allows you even that. I am not sure whether you really mean keep a
> reference or the parsed object. If you mean a reference to the parser
> than I am not convince. Having references on a object blocks this object
> from GC. We would need to clean all this reference after all handler are
> finished.

i agree with you it's not a good idea to pass parser reference now.

>
>
>
> sounds swell but you mean Paser.parse(...), right?

yes, it was a typo.

>
>
> >       - re. Object getParseObject(); , I suggest to call it Object
> getData
> >       instead.
> >
> >
> > btw, my understanding of Droids is largely come from the SimpleRuntime
> > usage. I hope i didn't miss the big picture.
>
> The simpleRuntime is nice to show the different set up of components
> however it lakes to show features like automatic extensibility of the
> droid. I have shown that in my presentation @apacheCon when I used the
> droids-spring sample.
>
> Thanks for your feedback mingfai.
>
> salu2
>
> >
>

Thanks for your comments.

regards,
mingfai