You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2012/10/29 17:22:42 UTC

NUTCH-1370

Hi,

So I thought I'd take this one on tonight and see if I can resolve.
Basically, my high level question is as follows...
Is each line of a text file (seed file) which we attempt to inject
into the webdb considered as an individual map task?
The idea is to establish a counter for the successfully injected URLS
(and possibly a counter for unsuccessful ones as well) so determining
how many URLs are (or should be) present within the webdb can be
determined after bootstrapping Nutch via the inject command.

Thanks all

Lewis

-- 
Lewis

Re: NUTCH-1370

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Again,

Thanks Julien, I will also make this method public in the patch for 2.x.

This is actually getting quite interesting now as I've found out that using
the o.a.hadoop.mapreduce.Job#Counters API can actually lead to security
issues when attempting to obtain counters fro map and reduce jobs.

For my own interest I'm heading over to mapreduce-user@ to get to the
bottom of this one. What is really interesting is that an issue was filed
[0] to deal with exactly this task so maybe I can chip in over there... we
will see :0)

Thanks for the info Julien, above aside the patch for 2.x is nearly done.
I'll patch trunk in due course once I have the mapred specifics sorted out.

Lewis

[0] https://issues.apache.org/jira/browse/MAPREDUCE-3520

On Tue, Oct 30, 2012 at 8:27 AM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Hi,
>
> Sounds pretty harmless to have that method public IMHO
>
> Julien
>
>
> On 29 October 2012 16:57, Lewis John Mcgibbney <le...@gmail.com>wrote:
>
>> Hi Julien,
>>
>> Thanks for the comments. Any additional ones regarding the accessibility
>> of the getDataStoreClass?
>>
>> Thanks again
>>
>> Lewis
>>
>>
>> On Mon, Oct 29, 2012 at 4:52 PM, Julien Nioche <
>> lists.digitalpebble@gmail.com> wrote:
>>
>>> Hi Lewis
>>>
>>> see comments below
>>>
>>>>
>>>> So I thought I'd take this one on tonight and see if I can resolve.
>>>> Basically, my high level question is as follows...
>>>> Is each line of a text file (seed file) which we attempt to inject
>>>> into the webdb considered as an individual map task?
>>>>
>>>
>>> no - each file in a map task
>>>
>>>
>>>> The idea is to establish a counter for the successfully injected URLS
>>>> (and possibly a counter for unsuccessful ones as well) so determining
>>>> how many URLs are (or should be) present within the webdb can be
>>>> determined after bootstrapping Nutch via the inject command.
>>>>
>>>> you get this information from the Hadoop Mapreduce Admin - the number
>>> of seeds is the Map input records of the first job, the number post
>>> filtering and normalisation is in Map output records as for the final
>>> number of urls in the crawldb post merging with whatever is in the Reduce
>>> Output Record.
>>>
>>> Just get the values from the counters of these 2 jobs to display a user
>>> friendly message in the log
>>>
>>> In general I would advise anyone to use the pseudo distributed mode
>>> instead of the local one as you get a lot more info from the Hadoop admin
>>> screen and won't have to trawl through the log files.
>>>
>>> HTH
>>>
>>> Julien
>>>
>>>
>>> --
>>> *
>>> *Open Source Solutions for Text Engineering
>>>
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>> http://twitter.com/digitalpebble
>>>
>>>
>>
>>
>> --
>> *Lewis*
>>
>>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>
>


-- 
*Lewis*

Re: NUTCH-1370

Posted by Julien Nioche <li...@gmail.com>.

Hi,

Sounds pretty harmless to have that method public IMHO

Julien

On 29 October 2012 16:57, Lewis John Mcgibbney <le...@gmail.com>wrote:

> Hi Julien,
>
> Thanks for the comments. Any additional ones regarding the accessibility
> of the getDataStoreClass?
>
> Thanks again
>
> Lewis
>
>
> On Mon, Oct 29, 2012 at 4:52 PM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
>> Hi Lewis
>>
>> see comments below
>>
>>>
>>> So I thought I'd take this one on tonight and see if I can resolve.
>>> Basically, my high level question is as follows...
>>> Is each line of a text file (seed file) which we attempt to inject
>>> into the webdb considered as an individual map task?
>>>
>>
>> no - each file in a map task
>>
>>
>>> The idea is to establish a counter for the successfully injected URLS
>>> (and possibly a counter for unsuccessful ones as well) so determining
>>> how many URLs are (or should be) present within the webdb can be
>>> determined after bootstrapping Nutch via the inject command.
>>>
>>> you get this information from the Hadoop Mapreduce Admin - the number of
>> seeds is the Map input records of the first job, the number post
>> filtering and normalisation is in Map output records as for the final
>> number of urls in the crawldb post merging with whatever is in the Reduce
>> Output Record.
>>
>> Just get the values from the counters of these 2 jobs to display a user
>> friendly message in the log
>>
>> In general I would advise anyone to use the pseudo distributed mode
>> instead of the local one as you get a lot more info from the Hadoop admin
>> screen and won't have to trawl through the log files.
>>
>> HTH
>>
>> Julien
>>
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>>
>
>
> --
> *Lewis*
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: NUTCH-1370

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Julien,

Thanks for the comments. Any additional ones regarding the accessibility of
the getDataStoreClass?

Thanks again

Lewis

On Mon, Oct 29, 2012 at 4:52 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Hi Lewis
>
> see comments below
>
>>
>> So I thought I'd take this one on tonight and see if I can resolve.
>> Basically, my high level question is as follows...
>> Is each line of a text file (seed file) which we attempt to inject
>> into the webdb considered as an individual map task?
>>
>
> no - each file in a map task
>
>
>> The idea is to establish a counter for the successfully injected URLS
>> (and possibly a counter for unsuccessful ones as well) so determining
>> how many URLs are (or should be) present within the webdb can be
>> determined after bootstrapping Nutch via the inject command.
>>
>> you get this information from the Hadoop Mapreduce Admin - the number of
> seeds is the Map input records of the first job, the number post
> filtering and normalisation is in Map output records as for the final
> number of urls in the crawldb post merging with whatever is in the Reduce
> Output Record.
>
> Just get the values from the counters of these 2 jobs to display a user
> friendly message in the log
>
> In general I would advise anyone to use the pseudo distributed mode
> instead of the local one as you get a lot more info from the Hadoop admin
> screen and won't have to trawl through the log files.
>
> HTH
>
> Julien
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>
>


-- 
*Lewis*

Re: NUTCH-1370

Posted by Julien Nioche <li...@gmail.com>.

Hi Lewis

see comments below

>
> So I thought I'd take this one on tonight and see if I can resolve.
> Basically, my high level question is as follows...
> Is each line of a text file (seed file) which we attempt to inject
> into the webdb considered as an individual map task?
>

no - each file in a map task


> The idea is to establish a counter for the successfully injected URLS
> (and possibly a counter for unsuccessful ones as well) so determining
> how many URLs are (or should be) present within the webdb can be
> determined after bootstrapping Nutch via the inject command.
>
> you get this information from the Hadoop Mapreduce Admin - the number of
seeds is the Map input records of the first job, the number post filtering
and normalisation is in Map output records as for the final number of urls
in the crawldb post merging with whatever is in the Reduce Output Record.

Just get the values from the counters of these 2 jobs to display a user
friendly message in the log

In general I would advise anyone to use the pseudo distributed mode instead
of the local one as you get a lot more info from the Hadoop admin screen
and won't have to trawl through the log files.

HTH

Julien


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: NUTCH-1370

Posted by Lewis John Mcgibbney <le...@gmail.com>.

In addition to this. Can someone please explain why [0]
StorageUtils#getDataStoreClass is a private method in this class. The
reason I ask is that it would be nice to be able to log which Gora
class is being used to persist the Injected URLs.

Are there any security risks associated with making this class public
and accessible?

Thanks

Lewis

[0] https://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/storage/StorageUtils.java

On Mon, Oct 29, 2012 at 4:22 PM, Lewis John Mcgibbney
<le...@gmail.com> wrote:
> Hi,
>
> So I thought I'd take this one on tonight and see if I can resolve.
> Basically, my high level question is as follows...
> Is each line of a text file (seed file) which we attempt to inject
> into the webdb considered as an individual map task?
> The idea is to establish a counter for the successfully injected URLS
> (and possibly a counter for unsuccessful ones as well) so determining
> how many URLs are (or should be) present within the webdb can be
> determined after bootstrapping Nutch via the inject command.
>
> Thanks all
>
> Lewis
>
> --
> Lewis



-- 
Lewis