You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Bayu Widyasanyata <bw...@gmail.com> on 2013/01/12 02:35:14 UTC

How segments is created?

Hi,

When "nutch generate" is executed the new segments will create and somehow
they would'nt?
It's when "segment already parsed" generated, in example:

ParseSegment: segment: crawl/segments/20130106091814 Exception in thread
"main" java.io.IOException: Segment already parsed!

My question is how the new segments is created or how nutch know that the
page is updated?
Does it handle by fetching process which know when a page is updated?

Does my analyzing above is correct?

Now, I do "trick" to force the generating of segments by put adddays
command of nutch.

Thanks,

-- 
wassalam,
[bayu]

Re: How segments is created?

Posted by Bayu Widyasanyata <bw...@gmail.com>.

On Sun, Jan 13, 2013 at 5:50 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> No, you can plugin another FetchSchedule that supports adjusting the
> interval based on whether a record is modified. See the
> AdaptiveFetchSchedule for an example.
>

Hi,

Thanks for pointing into that subject since I'm new in nutch & solr :)
Sadly because this doc [0] is not available yet.
But this [1] very helpful to start.

Thanks.

[0] http://wiki.apache.org/nutch/AdaptiveFetchSchedule
[1] http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/

-- 
wassalam,
[bayu]

RE: How segments is created?

Posted by Markus Jelsma <ma...@openindex.io>.


 
 
-----Original message-----
> From:Bayu Widyasanyata <bw...@gmail.com>
> Sent: Sun 13-Jan-2013 07:34
> To: user@nutch.apache.org
> Subject: Re: How segments is created?
> 
> On Sun, Jan 13, 2013 at 12:47 PM, Tejas Patil <te...@gmail.com>wrote:
> 
> >
> > Well, if you know that the front page is updated frequently, set
> > "db.fetch.interval.default" to lower value so that urls will be eligible
> > for re-fetch sooner. By default, if a url is fetched successfully, it
> > becomes eligible for re-fetching after 30 days
> 
> 
> Very clear!
> In summary,
> Nutch can not identify if a page is being updated hence (if page is updated
> frequently) we should set to lower value "db.fetch.interval.default" to
> re-fetch the page.

No, you can plugin another FetchSchedule that supports adjusting the interval based on whether a record is modified. See the AdaptiveFetchSchedule for an example.

> 
> Thanks so much!
> -- 
> wassalam,
> [bayu]
>

Re: How segments is created?

Posted by Bayu Widyasanyata <bw...@gmail.com>.

On Sun, Jan 13, 2013 at 12:47 PM, Tejas Patil <te...@gmail.com>wrote:

>
> Well, if you know that the front page is updated frequently, set
> "db.fetch.interval.default" to lower value so that urls will be eligible
> for re-fetch sooner. By default, if a url is fetched successfully, it
> becomes eligible for re-fetching after 30 days

Very clear!
In summary,
Nutch can not identify if a page is being updated hence (if page is updated
frequently) we should set to lower value "db.fetch.interval.default" to
re-fetch the page.

Thanks so much!
-- 
wassalam,
[bayu]

Re: How segments is created?

Posted by Tejas Patil <te...@gmail.com>.

Hi Bayu,

On Sat, Jan 12, 2013 at 9:15 PM, Bayu Widyasanyata
<bw...@gmail.com>wrote:

> Hi Tejas,
> Sorry if my questions are confusing :)
>
Its ok :)

>
> I have read your post on StackOverflow, and made some clarity for me.
>
> What makes me still didn't understand is how nutch will know when he will
> not parsed a segment (as appear on "segment already parsed")?
>

When nutch parses a segment, it creates parse_text, parse_data and
crawl_parse sub-directories inside the segments directory. These store the
output of the parse command. Next time if you try to run parse command on
the same segment, it finds that these sub-directories are already present
and thus the logs a message indicating that the segment was already parsed.


> Some times I should do more two times to make document (a URL) and its
> outlinks fetched and parsed by nutch (get more depth).


Didn't get what you wanted to convey.

>
> Back to my question.
> As a simple example is the front page of newspaper online website.
> If they add 1 (one) news on frontpage, does nutch will create new segment
> inside crawl/segments directory (e.g. YYYYMMDDMMSSSS format)?
>

Segments are created for every individual round. They are not generated for
individual urls.

>
> Hence, if nutch cannot identify if a page is actually being updated (for
> above example is frontpage of newspaper online add 1 news / 1 outlink),
> then should we force nutch to re-fetch the URL? Is it correct?
> Or we will add -addays option periodically to ensure that we have updated
> database?
>

Well, if you know that the front page is updated frequently, set
"db.fetch.interval.default" to lower value so that urls will be eligible
for re-fetch sooner. By default, if a url is fetched successfully, it
becomes eligible for re-fetching after 30 days.

>
> Thanks.-
>
> On Sat, Jan 12, 2013 at 1:09 PM, Tejas Patil <tejas.patil.cs@gmail.com
> >wrote:
>
> > Hi Bayu,
> >
> > I did not understand your question properly but I will try to address
> your
> > questions as far as I can.
> >
> > Generate phase creates a segment which will just have the fetch list
> (this
> > is inside the "crawl_generate" directory inside segments). If there are
> no
> > urls in the crawldb which are eligible for fetching at that point, then
> it
> > will end up creating an empty directory.
> >
> > It is during Fetch and Parse phases, the actual data is populated inside
> > the segments. ([0] is a shameless plug of my answer on StackOverlfow
> which
> > has description about the subdirectories inside the segments dir). During
> > generate or fetch, Nutch cannot identify if a page is actually being
> > updated at the content owners' end. It will have to re-fetch the
> > corresponding url.
> >
> > Does that answer what you wanted ?
> >
> > [0] :
> >
> >
> http://stackoverflow.com/questions/10225239/what-the-outputs-exactly-are-when-integrating-nutch1-4-and-solr/10262243
> >
> > Thanks,
> > Tejas Patil
> >
> > On Fri, Jan 11, 2013 at 5:35 PM, Bayu Widyasanyata
> > <bw...@gmail.com>wrote:
> >
> > > Hi,
> > >
> > > When "nutch generate" is executed the new segments will create and
> > somehow
> > > they would'nt?
> >
> > It's when "segment already parsed" generated, in example:
> > >
> > > ParseSegment: segment: crawl/segments/20130106091814 Exception in
> thread
> > > "main" java.io.IOException: Segment already parsed!
> > >
> > > My question is how the new segments is created or how nutch know that
> the
> > > page is updated?
> > > Does it handle by fetching process which know when a page is updated?
> > >
> > > Does my analyzing above is correct?
> > >
> > > Now, I do "trick" to force the generating of segments by put adddays
> > > command of nutch.
> > >
> > > Thanks,
> > >
> > > --
> > > wassalam,
> > > [bayu]
> > >
> >
>
>
>
> --
> wassalam,
> [bayu]
>

Re: How segments is created?

Posted by Bayu Widyasanyata <bw...@gmail.com>.

Hi Tejas,
Sorry if my questions are confusing :)

I have read your post on StackOverflow, and made some clarity for me.

What makes me still didn't understand is how nutch will know when he will
not parsed a segment (as appear on "segment already parsed")?
Some times I should do more two times to make document (a URL) and its
outlinks fetched and parsed by nutch (get more depth).

Back to my question.
As a simple example is the front page of newspaper online website.
If they add 1 (one) news on frontpage, does nutch will create new segment
inside crawl/segments directory (e.g. YYYYMMDDMMSSSS format)?

Hence, if nutch cannot identify if a page is actually being updated (for
above example is frontpage of newspaper online add 1 news / 1 outlink),
then should we force nutch to re-fetch the URL? Is it correct?
Or we will add -addays option periodically to ensure that we have updated
database?

Thanks.-

On Sat, Jan 12, 2013 at 1:09 PM, Tejas Patil <te...@gmail.com>wrote:

> Hi Bayu,
>
> I did not understand your question properly but I will try to address your
> questions as far as I can.
>
> Generate phase creates a segment which will just have the fetch list (this
> is inside the "crawl_generate" directory inside segments). If there are no
> urls in the crawldb which are eligible for fetching at that point, then it
> will end up creating an empty directory.
>
> It is during Fetch and Parse phases, the actual data is populated inside
> the segments. ([0] is a shameless plug of my answer on StackOverlfow which
> has description about the subdirectories inside the segments dir). During
> generate or fetch, Nutch cannot identify if a page is actually being
> updated at the content owners' end. It will have to re-fetch the
> corresponding url.
>
> Does that answer what you wanted ?
>
> [0] :
>
> http://stackoverflow.com/questions/10225239/what-the-outputs-exactly-are-when-integrating-nutch1-4-and-solr/10262243
>
> Thanks,
> Tejas Patil
>
> On Fri, Jan 11, 2013 at 5:35 PM, Bayu Widyasanyata
> <bw...@gmail.com>wrote:
>
> > Hi,
> >
> > When "nutch generate" is executed the new segments will create and
> somehow
> > they would'nt?
>
> It's when "segment already parsed" generated, in example:
> >
> > ParseSegment: segment: crawl/segments/20130106091814 Exception in thread
> > "main" java.io.IOException: Segment already parsed!
> >
> > My question is how the new segments is created or how nutch know that the
> > page is updated?
> > Does it handle by fetching process which know when a page is updated?
> >
> > Does my analyzing above is correct?
> >
> > Now, I do "trick" to force the generating of segments by put adddays
> > command of nutch.
> >
> > Thanks,
> >
> > --
> > wassalam,
> > [bayu]
> >
>

-- 
wassalam,
[bayu]

Re: How segments is created?

Posted by Tejas Patil <te...@gmail.com>.

Hi Bayu,

I did not understand your question properly but I will try to address your
questions as far as I can.

Generate phase creates a segment which will just have the fetch list (this
is inside the "crawl_generate" directory inside segments). If there are no
urls in the crawldb which are eligible for fetching at that point, then it
will end up creating an empty directory.

It is during Fetch and Parse phases, the actual data is populated inside
the segments. ([0] is a shameless plug of my answer on StackOverlfow which
has description about the subdirectories inside the segments dir). During
generate or fetch, Nutch cannot identify if a page is actually being
updated at the content owners' end. It will have to re-fetch the
corresponding url.

Does that answer what you wanted ?

[0] :
http://stackoverflow.com/questions/10225239/what-the-outputs-exactly-are-when-integrating-nutch1-4-and-solr/10262243

Thanks,
Tejas Patil

On Fri, Jan 11, 2013 at 5:35 PM, Bayu Widyasanyata
<bw...@gmail.com>wrote:

> Hi,
>
> When "nutch generate" is executed the new segments will create and somehow
> they would'nt?

It's when "segment already parsed" generated, in example:
>
> ParseSegment: segment: crawl/segments/20130106091814 Exception in thread
> "main" java.io.IOException: Segment already parsed!
>
> My question is how the new segments is created or how nutch know that the
> page is updated?
> Does it handle by fetching process which know when a page is updated?
>
> Does my analyzing above is correct?
>
> Now, I do "trick" to force the generating of segments by put adddays
> command of nutch.
>
> Thanks,
>
> --
> wassalam,
> [bayu]
>