You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Tarjei Huse <ta...@nu.no> on 2008/09/07 20:54:36 UTC

Basic code organization questions + scheduling

Hi, I'm planning to use Hadoop in for a set of typical crawler/indexer 
tasks. The basic flow is

input:    array of urls
actions:          |
1.              get pages
                       |
2.          extract new urls from pages -> start new job
             extract text  -> index / filter (as new jobs)

What I'm considering is how I should build this application to fit into 
the map/reduce context. I'm thinking that step 1 and 2 should be 
separate map/reduce tasks that then pipe things on to the next step.

This is where I am a bit at loss to see how it is smart to organize the 
code in logical units and also how to spawn new tasks when an old one is 
over.

Is the usual way to control the flow of a set of tasks to have an 
external application running that listens to jobs ending via the 
endNotificationUri and then spawns new tasks or should the job itself 
contain code to create new jobs? Would it be a good idea to use 
Cascading here?

I'm also considering how I should do job scheduling (I got a lot of 
reoccurring tasks). Has anyone found a good framework for job control of 
reoccurring tasks or should I plan to build my own using quartz ?

Any tips/best practices with regard to the issues described above are 
most welcome. Feel free to ask further questions if you find my 
descriptions of the issues lacking.

Kind regards,
Tarjei



Re: Basic code organization questions + scheduling

Posted by Chris K Wensel <ch...@wensel.net>.
If you wrote a simple URL fetcher function for Cascading, you would  
have a very powerful web crawler that would dwarf Nutch in flexibility.

That said, Nutch is optimized for storage, has supporting tools,  
ranking algorithms, and has been up against some nasty html and other  
document types. building a really robust crawler is non-trivial.

If i was just starting out and needed to implement a proprietary  
process, I would use Nutch for fetching raw content, and refreshing  
it. then use Cascading for parsing, indexing, etc.

cheers,
chris

On Sep 8, 2008, at 12:42 AM, tarjei wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi Alex (and others).
>
>> You should take a look at Nutch.  It's a search-engine built on  
>> Lucene,
>> though it can be setup on top of Hadoop.  Take a look:
> This didn't help me much. Although the description I gave of the basic
> flow of the app seems to be close to what Nutch is doing (and I've  
> been
> looking at the Nutch code), the questions are more general and not
> related to indexing as such, but about code organization. If someone  
> has
> more input to those, feel free to add it.
>
>
>> On Mon, Sep 8, 2008 at 2:54 AM, Tarjei Huse <ta...@nu.no> wrote:
>>
>>> Hi, I'm planning to use Hadoop in for a set of typical crawler/ 
>>> indexer
>>> tasks. The basic flow is
>>>
>>> input:    array of urls
>>> actions:          |
>>> 1.              get pages
>>>                     |
>>> 2.          extract new urls from pages -> start new job
>>>           extract text  -> index / filter (as new jobs)
>>>
>>> What I'm considering is how I should build this application to fit  
>>> into the
>>> map/reduce context. I'm thinking that step 1 and 2 should be  
>>> separate
>>> map/reduce tasks that then pipe things on to the next step.
>>>
>>> This is where I am a bit at loss to see how it is smart to  
>>> organize the
>>> code in logical units and also how to spawn new tasks when an old  
>>> one is
>>> over.
>>>
>>> Is the usual way to control the flow of a set of tasks to have an  
>>> external
>>> application running that listens to jobs ending via the  
>>> endNotificationUri
>>> and then spawns new tasks or should the job itself contain code to  
>>> create
>>> new jobs? Would it be a good idea to use Cascading here?
>>>
>>> I'm also considering how I should do job scheduling (I got a lot of
>>> reoccurring tasks). Has anyone found a good framework for job  
>>> control of
>>> reoccurring tasks or should I plan to build my own using quartz ?
>>>
>>> Any tips/best practices with regard to the issues described above  
>>> are most
>>> welcome. Feel free to ask further questions if you find my  
>>> descriptions of
>>> the issues lacking.
>
> Kind regards,
> Tarjei
>
>>>
>>> Kind regards,
>>> Tarjei
>>>
>>>
>>>
>>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFIxNdWYVRKCnSvzfIRAnJ0AJ9EcXzdyZgouN8q6wtad63SUHP/twCfZ88o
> 9km8MTJcTQxnc7bijR1Oxs0=
> =79fZ
> -----END PGP SIGNATURE-----

--
Chris K Wensel
chris@wensel.net
http://chris.wensel.net/
http://www.cascading.org/


Re: Basic code organization questions + scheduling

Posted by tarjei <ta...@nu.no>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Alex (and others).

> You should take a look at Nutch.  It's a search-engine built on Lucene,
> though it can be setup on top of Hadoop.  Take a look:
This didn't help me much. Although the description I gave of the basic
flow of the app seems to be close to what Nutch is doing (and I've been
looking at the Nutch code), the questions are more general and not
related to indexing as such, but about code organization. If someone has
more input to those, feel free to add it.


 > On Mon, Sep 8, 2008 at 2:54 AM, Tarjei Huse <ta...@nu.no> wrote:
> 
>> Hi, I'm planning to use Hadoop in for a set of typical crawler/indexer
>> tasks. The basic flow is
>>
>> input:    array of urls
>> actions:          |
>> 1.              get pages
>>                      |
>> 2.          extract new urls from pages -> start new job
>>            extract text  -> index / filter (as new jobs)
>>
>> What I'm considering is how I should build this application to fit into the
>> map/reduce context. I'm thinking that step 1 and 2 should be separate
>> map/reduce tasks that then pipe things on to the next step.
>>
>> This is where I am a bit at loss to see how it is smart to organize the
>> code in logical units and also how to spawn new tasks when an old one is
>> over.
>>
>> Is the usual way to control the flow of a set of tasks to have an external
>> application running that listens to jobs ending via the endNotificationUri
>> and then spawns new tasks or should the job itself contain code to create
>> new jobs? Would it be a good idea to use Cascading here?
>>
>> I'm also considering how I should do job scheduling (I got a lot of
>> reoccurring tasks). Has anyone found a good framework for job control of
>> reoccurring tasks or should I plan to build my own using quartz ?
>>
>> Any tips/best practices with regard to the issues described above are most
>> welcome. Feel free to ask further questions if you find my descriptions of
>> the issues lacking.

Kind regards,
Tarjei

>>
>> Kind regards,
>> Tarjei
>>
>>
>>
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxNdWYVRKCnSvzfIRAnJ0AJ9EcXzdyZgouN8q6wtad63SUHP/twCfZ88o
9km8MTJcTQxnc7bijR1Oxs0=
=79fZ
-----END PGP SIGNATURE-----

Re: Basic code organization questions + scheduling

Posted by Alex Loddengaard <al...@google.com>.
Hi Tarjei,

You should take a look at Nutch.  It's a search-engine built on Lucene,
though it can be setup on top of Hadoop.  Take a look:

<http://lucene.apache.org/nutch/>
-and-
<http://wiki.apache.org/nutch/NutchHadoopTutorial>

Hope this helps!

Alex

On Mon, Sep 8, 2008 at 2:54 AM, Tarjei Huse <ta...@nu.no> wrote:

> Hi, I'm planning to use Hadoop in for a set of typical crawler/indexer
> tasks. The basic flow is
>
> input:    array of urls
> actions:          |
> 1.              get pages
>                      |
> 2.          extract new urls from pages -> start new job
>            extract text  -> index / filter (as new jobs)
>
> What I'm considering is how I should build this application to fit into the
> map/reduce context. I'm thinking that step 1 and 2 should be separate
> map/reduce tasks that then pipe things on to the next step.
>
> This is where I am a bit at loss to see how it is smart to organize the
> code in logical units and also how to spawn new tasks when an old one is
> over.
>
> Is the usual way to control the flow of a set of tasks to have an external
> application running that listens to jobs ending via the endNotificationUri
> and then spawns new tasks or should the job itself contain code to create
> new jobs? Would it be a good idea to use Cascading here?
>
> I'm also considering how I should do job scheduling (I got a lot of
> reoccurring tasks). Has anyone found a good framework for job control of
> reoccurring tasks or should I plan to build my own using quartz ?
>
> Any tips/best practices with regard to the issues described above are most
> welcome. Feel free to ask further questions if you find my descriptions of
> the issues lacking.
>
> Kind regards,
> Tarjei
>
>
>