You are viewing a plain text version of this content. The canonical link for it is here.

Posted to droids-dev@incubator.apache.org by Florent André <fl...@4sengines.com> on 2009/07/13 14:49:12 UTC

Need 1 :

Hi Droids list !

After a speak during the Lenya meeting with Senior Thorsten (Olé !:) ), I
would like to have more informations about droids.

I know that droids is not only a web crawler (and I would like to use it
for other think), but my immediate need is about crawling...

So let's go : 

I would like to pass to droids an xml like (just an example) : 
<article>
  <droids:url>http://example.com/test.html</droids:url>
  <title>
   
<droids:xpath>html/body/div[@id='content']/div[@id='title']/h1</droids:xpath>
  </title>
  <firstparagraph>
    
<droids:xpath>html/body/div[@id='content']/div[@id='article']/p[position()=1]</droids:xpath>
  </firstparagraph>
  <othertext>
   
<droids:xpath>html/body/div[@id='content']/div[@id='article']/p[position()>1]</droids:xpath>
  </othertext>
</article>

and that droids give me someting like : 
<article>
  <title> this is the article title </article>
  <firstparagraph> This article is about the....</firstparagraph>
  <othertext>bla bla bla bla bla...</othertext>
</article>

So my questions are : 

1) It's possible ? 

2) If yes, I will have to (think that I'm not a java's SuperStar) :
   a) install droids, type 2 commands lines, and let's go (1 hour work)
   b) install droids, really understand understand how droids work, code
some classes (3 weeks work)
   c) install droids, create a class from existing one, doing some try
error (4-5 days work)
   d) ...

3) It's difficult to plug droids into a Lenya (based on cocoon) app ?

Thanks for your answer,

Regards

Re: Need 1 :

Posted by Florent André <fl...@4sengines.com>.

Thanks mingfai and thorsten for your answers, and sorry for the looooong
time reaction (I was a little "stack overflow")

This help me to better understand Droids.

I already do something like I try to describe to you in Lenya. During this
I was facing of this
problems : 
- html page who use frameset (content of frame is not retrieve)
- encoding type of page 
- malformed and faulty HTML (<p> with no </p> etc...)
- pages that are 403 movedpermanently

When droids give me a page... this is a 100% clean (x)html ? ;)

And on another think : 
- how is the deal with flash, img, .zip,... : just a link or a download ?
- how is the deal with javascript ? 
- with forms ?

Thanks

On Tue, 14 Jul 2009 17:11:35 +0800, Mingfai <mi...@gmail.com> wrote:
> hi,
> 
> 
>> So let's go :
>> >
>> > I would like to pass to droids an xml like (just an example) :
>> > <article>
>> >   <droids:url>http://example.com/test.html</droids:url>
>>
>> In droids crawling the url is the entrance point of the processing. What
>> happens then is highly configurable and currently Ming Fai has suggested
>> some changes for the future. I will describe the possibilities that
>> droids currently offers for the presented use case.
>>
>> Like said we start with the queue where you inject the starting urls.
>> Then this queue will call a worker (which basically is the part of the
>> code where the real work is done). This worker may call a linkExtractor
>> and/or a Parser to extract link and any other information about the
>> incoming page.
> 
> 
> 
> I think most crawler (incl. Droids and any of my suggested change) works
in
> more or less the same way. We always have URL as seeds and be put in a
> queue/list (TaskQueue in Droids),  a main component to control
multi-thread
> and execution (TaskMaster), components to fetch/retrieve the URL as
> inputstream/entity (Worker and Protocol), components to parse/process the
> inputstream/entity (Parser), components to extract outlinks
(LinkExtractor)
> and put back into the main queue/list.(Worker) Droids also has URLFilter
> that accept/reject outlinks, TaskValidator to intecept at the
> add-to-queue-time (that works similar as URLFilter for crawling, maybe u
> could ignore this), DelayTimer to slow down the fetching. The above
refers
> to the current Droids implementation. I think it covers most of the main
> concepts.
> 
> regards,
> mingfai

Re: Need 1 :

Posted by Mingfai <mi...@gmail.com>.

hi,


> So let's go :
> >
> > I would like to pass to droids an xml like (just an example) :
> > <article>
> >   <droids:url>http://example.com/test.html</droids:url>
>
> In droids crawling the url is the entrance point of the processing. What
> happens then is highly configurable and currently Ming Fai has suggested
> some changes for the future. I will describe the possibilities that
> droids currently offers for the presented use case.
>
> Like said we start with the queue where you inject the starting urls.
> Then this queue will call a worker (which basically is the part of the
> code where the real work is done). This worker may call a linkExtractor
> and/or a Parser to extract link and any other information about the
> incoming page.



I think most crawler (incl. Droids and any of my suggested change) works in
more or less the same way. We always have URL as seeds and be put in a
queue/list (TaskQueue in Droids),  a main component to control multi-thread
and execution (TaskMaster), components to fetch/retrieve the URL as
inputstream/entity (Worker and Protocol), components to parse/process the
inputstream/entity (Parser), components to extract outlinks (LinkExtractor)
and put back into the main queue/list.(Worker) Droids also has URLFilter
that accept/reject outlinks, TaskValidator to intecept at the
add-to-queue-time (that works similar as URLFilter for crawling, maybe u
could ignore this), DelayTimer to slow down the fetching. The above refers
to the current Droids implementation. I think it covers most of the main
concepts.

regards,
mingfai

Another proposed idea ? (was Re: Need 1 :)

Posted by Florent André <fl...@4sengines.com>.

Here I will try to better explain my idea : 

- In my webmaster working days, I have many repetitive "clic action" to do.
hummmmm, a little boring, so go to play :
-- ruby (http://en.wikipedia.org/wiki/Ruby_(programming_language) )
-- mecanize http://mechanize.rubyforge.org/mechanize/
-- hpricot (xml parser) 

Some lines of code after... and I'm an happy webmaster.

But not really in fact. Now I would like to do less code and more "just
instructions". Pass instruction by xml could be very nice.

Consider this use case :
- I have the "enterprise web yellow page" (nearly an LDAP) and my
enterprise CMS (no "dev" solutions possibles - JUST clic), and I have to
pass some informations to yellow-page to CSM.

- so in a cool "droids world", i would like to do something like that :

- write an droid-configuration.xml : set witch worker, configure link depth
following, set the DelayTimer is seconds,...

- write a droids-job.xml : go to this page, fill this form, select links in
{xpath}, follow this link, extract the {xpath} add save, go to this page
and fill the form with saved informations.

... With that, a really happy webmaster ! :)


What do you think about that ?


Asta luego

On Tue, 14 Jul 2009 09:56:33 +0200, Thorsten Scherler
<th...@juntadeandalucia.es> wrote:
> On Mon, 2009-07-13 at 16:49 +0200, Florent André wrote:
>> Hi Droids list !
>> 
>> After a speak during the Lenya meeting with Senior Thorsten (Olé !:) ),
>> I
>> would like to have more informations about droids.
> 
> Bonjour Monsieur Florent, bienvenido a Droids. ;)
> 
>> I know that droids is not only a web crawler (and I would like to use it
>> for other think), but my immediate need is about crawling...
> 
> What comes know as xml document I will try to put it in terms of droids.
> I guess putting it in our wiki http://cwiki.apache.org/DROIDS/ will be
> helpful for future references. 
> 
>> So let's go : 
>> 
>> I would like to pass to droids an xml like (just an example) : 
>> <article>
>>   <droids:url>http://example.com/test.html</droids:url>
> 
> In droids crawling the url is the entrance point of the processing. What
> happens then is highly configurable and currently Ming Fai has suggested
> some changes for the future. I will describe the possibilities that
> droids currently offers for the presented use case. 
> 
> Like said we start with the queue where you inject the starting urls.
> Then this queue will call a worker (which basically is the part of the
> code where the real work is done). This worker may call a linkExtractor
> and/or a Parser to extract link and any other information about the
> incoming page.
> 
>>   <title>
>>    
>>
<droids:xpath>html/body/div[@id='content']/div[@id='title']/h1</droids:xpath>
>>   </title>
>>   <firstparagraph>
>>     
>>
<droids:xpath>html/body/div[@id='content']/div[@id='article']/p[position()=1]</droids:xpath>
>>   </firstparagraph>
>>   <othertext>
>>    
>>
<droids:xpath>html/body/div[@id='content']/div[@id='article']/p[position()>1]</droids:xpath>
>>   </othertext>
>> </article>
>> 
>> and that droids give me someting like : 
>> <article>
>>   <title> this is the article title </article>
>>   <firstparagraph> This article is about the....</firstparagraph>
>>   <othertext>bla bla bla bla bla...</othertext>
>> </article>
> 
> You could use a simple xsl transformation for that. You can develop the
> xsl stylesheet (basically the xpaths) to extract the info with lenya as
> usual. Just use a generator to get the source and then add the
> transformer which will return the above doc. This stylesheet you would
> copy to your droids plugin and use it to generate a result outputstream.
> This stream you would pass to save handler of droids which then saves
> you the stream to the location you want.
> 
>> So my questions are : 
>> 
>> 1) It's possible ? 
> 
> Yes certainly. 
> 
>> 
>> 2) If yes, I will have to (think that I'm not a java's SuperStar) :
>>    a) install droids, type 2 commands lines, and let's go (1 hour work)
> 
> No, droids is a very loose framework and we do not have the specific use
> case you ask for in our code base (maybe afterwards). ;)
> 
>>    b) install droids, really understand understand how droids work, code
>> some classes (3 weeks work)
> 
> jeje, that is most valuable, but for your use case should not be
> necessary.
> 
>>    c) install droids, create a class from existing one, doing some try
>> error (4-5 days work)
> 
> Yeah, I guess that is realistic with testing and so on. 
> 
>>    d) ...
>> 
>> 3) It's difficult to plug droids into a Lenya (based on cocoon) app ?
> 
> Actually not at all. I recommend to first code your bot in droids then
> generate the jar and copy it to your lenya module. Do not forget the
> dependencies that your droids may have and add them to the lib dir of
> your module.
> 
> HTH to get you the general idea.
> 
> salu2
> 
>> 
>> Thanks for your answer,
>> 
>> Regards

Re: Need 1 :

Posted by Thorsten Scherler <th...@juntadeandalucia.es>.

On Mon, 2009-07-13 at 16:49 +0200, Florent André wrote:
> Hi Droids list !
> 
> After a speak during the Lenya meeting with Senior Thorsten (Olé !:) ), I
> would like to have more informations about droids.

Bonjour Monsieur Florent, bienvenido a Droids. ;)

> I know that droids is not only a web crawler (and I would like to use it
> for other think), but my immediate need is about crawling...

What comes know as xml document I will try to put it in terms of droids.
I guess putting it in our wiki http://cwiki.apache.org/DROIDS/ will be
helpful for future references. 

> So let's go : 
> 
> I would like to pass to droids an xml like (just an example) : 
> <article>
>   <droids:url>http://example.com/test.html</droids:url>

In droids crawling the url is the entrance point of the processing. What
happens then is highly configurable and currently Ming Fai has suggested
some changes for the future. I will describe the possibilities that
droids currently offers for the presented use case. 

Like said we start with the queue where you inject the starting urls.
Then this queue will call a worker (which basically is the part of the
code where the real work is done). This worker may call a linkExtractor
and/or a Parser to extract link and any other information about the
incoming page.

>   <title>
>    
> <droids:xpath>html/body/div[@id='content']/div[@id='title']/h1</droids:xpath>
>   </title>
>   <firstparagraph>
>     
> <droids:xpath>html/body/div[@id='content']/div[@id='article']/p[position()=1]</droids:xpath>
>   </firstparagraph>
>   <othertext>
>    
> <droids:xpath>html/body/div[@id='content']/div[@id='article']/p[position()>1]</droids:xpath>
>   </othertext>
> </article>
> 
> and that droids give me someting like : 
> <article>
>   <title> this is the article title </article>
>   <firstparagraph> This article is about the....</firstparagraph>
>   <othertext>bla bla bla bla bla...</othertext>
> </article>

You could use a simple xsl transformation for that. You can develop the
xsl stylesheet (basically the xpaths) to extract the info with lenya as
usual. Just use a generator to get the source and then add the
transformer which will return the above doc. This stylesheet you would
copy to your droids plugin and use it to generate a result outputstream.
This stream you would pass to save handler of droids which then saves
you the stream to the location you want.

> So my questions are : 
> 
> 1) It's possible ? 

Yes certainly. 

> 
> 2) If yes, I will have to (think that I'm not a java's SuperStar) :
>    a) install droids, type 2 commands lines, and let's go (1 hour work)

No, droids is a very loose framework and we do not have the specific use
case you ask for in our code base (maybe afterwards). ;)

>    b) install droids, really understand understand how droids work, code
> some classes (3 weeks work)

jeje, that is most valuable, but for your use case should not be
necessary.

>    c) install droids, create a class from existing one, doing some try
> error (4-5 days work)

Yeah, I guess that is realistic with testing and so on. 

>    d) ...
> 
> 3) It's difficult to plug droids into a Lenya (based on cocoon) app ?

Actually not at all. I recommend to first code your bot in droids then
generate the jar and copy it to your lenya module. Do not forget the
dependencies that your droids may have and add them to the lib dir of
your module.

HTH to get you the general idea.

salu2

> 
> Thanks for your answer,
> 
> Regards
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la Información, S.A.U. (SADESI)