You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Nayanish Hinge <na...@gmail.com> on 2010/09/02 11:33:28 UTC

depth information not being available in crawl datum

Hi,
I have a specific use case where I need to know at which level (depth) I
fetched the current url.
Currently the depth could be figured out from the for loop index in the
crawl.java.
But my use case necessitate me to have this information stored in
crawl-datum. Currently Nutch does not have any information for a url as to
what level it was reached at.
Has anyone used any technique to solve this problem.

I tried to set a custom property in Nutch configuration "crawl-depth" within
the for loop before the fetcher is called.
But in case any intermediate failure occurs in nutch crawling, i loose the
track of depth.

I read about distributed cache, but it is generally used for read only data.
In my use case, I would either have this inside crawl-datum object or have a
map of <url - depth> stored somewhere accessible to all map-reduce jobs in
an efficient manner.

Thanks
-- 
Nayanish
Hyderabad

Re: depth information not being available in crawl datum

Posted by Nayanish Hinge <na...@gmail.com>.

Could somebody reply here please.
Thanks
Nayn

On Tue, Sep 7, 2010 at 4:53 PM, Nayanish Hinge <na...@gmail.com>wrote:

> Hi Julien,
> Thanks for the info, However I am still unable to envision the use case.
> I want this depth information to be available in HtmlParser code which has
> 'Content' and nothing else.
> I guess the scoring attaches metadata for all outlinks and would be at
> coarser level.
>
> 'Next Page' sequence is a gives out pages at same depth from the
> perspecitve of pullind out data from them.
> I mean, Next Page sequence gives out many other similar pages as that of
> current page. We wish to treat all such outlink to be at the same depth as
> that of the current page.
> However some outlink (non-next page) might be treated at deeper level.
>
> Hopw i am not being too gibberish.
>
> Thanks
> Nayn
>
>
>
> On Thu, Sep 2, 2010 at 11:17 PM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
>> Hi,
>>
>> You could track the depth of a URL from the seeds by implementing a custom
>> ScoringFilter. ScoringFilters are called at various points of the
>> workflow,
>> including when outlinks have been found for a page. The logic would be to
>> simply increment the depth of the current page and generate a metadata for
>> the outlinks. You can then use this to prioritize the pages with a lower
>> depth value during the generation or limit to crawl to specific depth.
>>
>> HTH
>>
>> Julien
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>>
>> On 2 September 2010 10:33, Nayanish Hinge <na...@gmail.com>
>> wrote:
>>
>> > Hi,
>> > I have a specific use case where I need to know at which level (depth) I
>> > fetched the current url.
>> > Currently the depth could be figured out from the for loop index in the
>> > crawl.java.
>> > But my use case necessitate me to have this information stored in
>> > crawl-datum. Currently Nutch does not have any information for a url as
>> to
>> > what level it was reached at.
>> > Has anyone used any technique to solve this problem.
>> >
>> > I tried to set a custom property in Nutch configuration "crawl-depth"
>> > within
>> > the for loop before the fetcher is called.
>> > But in case any intermediate failure occurs in nutch crawling, i loose
>> the
>> > track of depth.
>> >
>> > I read about distributed cache, but it is generally used for read only
>> > data.
>> > In my use case, I would either have this inside crawl-datum object or
>> have
>> > a
>> > map of <url - depth> stored somewhere accessible to all map-reduce jobs
>> in
>> > an efficient manner.
>> >
>> > Thanks
>> > --
>> > Nayanish
>> > Hyderabad
>> >
>>
>>
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>>
>
>
>
> --
> Nayanish
> Software Development Engineer
> Amazon
> Hyderabad
>



-- 
Nayanish
Software Development Engineer
Amazon
Hyderabad

Re: depth information not being available in crawl datum

Posted by Nayanish Hinge <na...@gmail.com>.

Hi Julien,
Thanks for the info, However I am still unable to envision the use case.
I want this depth information to be available in HtmlParser code which has
'Content' and nothing else.
I guess the scoring attaches metadata for all outlinks and would be at
coarser level.

'Next Page' sequence is a gives out pages at same depth from the perspecitve
of pullind out data from them.
I mean, Next Page sequence gives out many other similar pages as that of
current page. We wish to treat all such outlink to be at the same depth as
that of the current page.
However some outlink (non-next page) might be treated at deeper level.

Hopw i am not being too gibberish.

Thanks
Nayn


On Thu, Sep 2, 2010 at 11:17 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Hi,
>
> You could track the depth of a URL from the seeds by implementing a custom
> ScoringFilter. ScoringFilters are called at various points of the workflow,
> including when outlinks have been found for a page. The logic would be to
> simply increment the depth of the current page and generate a metadata for
> the outlinks. You can then use this to prioritize the pages with a lower
> depth value during the generation or limit to crawl to specific depth.
>
> HTH
>
> Julien
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>
> On 2 September 2010 10:33, Nayanish Hinge <na...@gmail.com>
> wrote:
>
> > Hi,
> > I have a specific use case where I need to know at which level (depth) I
> > fetched the current url.
> > Currently the depth could be figured out from the for loop index in the
> > crawl.java.
> > But my use case necessitate me to have this information stored in
> > crawl-datum. Currently Nutch does not have any information for a url as
> to
> > what level it was reached at.
> > Has anyone used any technique to solve this problem.
> >
> > I tried to set a custom property in Nutch configuration "crawl-depth"
> > within
> > the for loop before the fetcher is called.
> > But in case any intermediate failure occurs in nutch crawling, i loose
> the
> > track of depth.
> >
> > I read about distributed cache, but it is generally used for read only
> > data.
> > In my use case, I would either have this inside crawl-datum object or
> have
> > a
> > map of <url - depth> stored somewhere accessible to all map-reduce jobs
> in
> > an efficient manner.
> >
> > Thanks
> > --
> > Nayanish
> > Hyderabad
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
Nayanish
Software Development Engineer
Amazon
Hyderabad

Re: depth information not being available in crawl datum

Posted by Jitendra <je...@gmail.com>.

Hi,

Does separate meta data can exist for each outlink?. I need to attach
different depth to each outlink after extraction of outlinks. But Outlink
class itself seems to have only two fields ( url and anchor).

Can you please explain in little detail, How can we achieve above goal using
scoring filters.

Thanks a lot !

On Thu, Sep 2, 2010 at 11:18 PM, Julien Nioche-4 [via Lucene] <
ml-node+1407789-1384209779-187975@n3.nabble.com<ml...@n3.nabble.com>
> wrote:

> Hi,
>
> You could track the depth of a URL from the seeds by implementing a custom
> ScoringFilter. ScoringFilters are called at various points of the workflow,
>
> including when outlinks have been found for a page. The logic would be to
> simply increment the depth of the current page and generate a metadata for
> the outlinks. You can then use this to prioritize the pages with a lower
> depth value during the generation or limit to crawl to specific depth.
>
> HTH
>
> Julien
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com <http://www.digitalpebble.com?by-user=t>
>
> On 2 September 2010 10:33, Nayanish Hinge <[hidden email]<http://user/SendEmail.jtp?type=node&node=1407789&i=0>>
> wrote:
>
> > Hi,
> > I have a specific use case where I need to know at which level (depth) I
> > fetched the current url.
> > Currently the depth could be figured out from the for loop index in the
> > crawl.java.
> > But my use case necessitate me to have this information stored in
> > crawl-datum. Currently Nutch does not have any information for a url as
> to
> > what level it was reached at.
> > Has anyone used any technique to solve this problem.
> >
> > I tried to set a custom property in Nutch configuration "crawl-depth"
> > within
> > the for loop before the fetcher is called.
> > But in case any intermediate failure occurs in nutch crawling, i loose
> the
> > track of depth.
> >
> > I read about distributed cache, but it is generally used for read only
> > data.
> > In my use case, I would either have this inside crawl-datum object or
> have
> > a
> > map of <url - depth> stored somewhere accessible to all map-reduce jobs
> in
> > an efficient manner.
> >
> > Thanks
> > --
> > Nayanish
> > Hyderabad
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com <http://www.digitalpebble.com?by-user=t>
>
>
> ------------------------------
>  View message @
> http://lucene.472066.n3.nabble.com/depth-information-not-being-available-in-crawl-datum-tp1405256p1407789.html
> To start a new topic under Nutch - User, email
> ml-node+603147-511429585-187975@n3.nabble.com<ml...@n3.nabble.com>
> To unsubscribe from Nutch - User, click here<http://lucene.472066.n3.nabble.com/template/TplServlet.jtp?tpl=unsubscribe_by_code&node=603147&code=amVldC5sb3Zlc0BnbWFpbC5jb218NjAzMTQ3fC0xMDg2ODAyNDgy>.
>
>
>


-- 
Thanks and regards

Jitendra Singh

-- 
View this message in context: http://lucene.472066.n3.nabble.com/depth-information-not-being-available-in-crawl-datum-tp1405256p1420793.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: depth information not being available in crawl datum

Posted by Julien Nioche <li...@gmail.com>.

Hi,

You could track the depth of a URL from the seeds by implementing a custom
ScoringFilter. ScoringFilters are called at various points of the workflow,
including when outlinks have been found for a page. The logic would be to
simply increment the depth of the current page and generate a metadata for
the outlinks. You can then use this to prioritize the pages with a lower
depth value during the generation or limit to crawl to specific depth.

HTH

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

On 2 September 2010 10:33, Nayanish Hinge <na...@gmail.com> wrote:

> Hi,
> I have a specific use case where I need to know at which level (depth) I
> fetched the current url.
> Currently the depth could be figured out from the for loop index in the
> crawl.java.
> But my use case necessitate me to have this information stored in
> crawl-datum. Currently Nutch does not have any information for a url as to
> what level it was reached at.
> Has anyone used any technique to solve this problem.
>
> I tried to set a custom property in Nutch configuration "crawl-depth"
> within
> the for loop before the fetcher is called.
> But in case any intermediate failure occurs in nutch crawling, i loose the
> track of depth.
>
> I read about distributed cache, but it is generally used for read only
> data.
> In my use case, I would either have this inside crawl-datum object or have
> a
> map of <url - depth> stored somewhere accessible to all map-reduce jobs in
> an efficient manner.
>
> Thanks
> --
> Nayanish
> Hyderabad
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com