You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by prateek sachdeva <pr...@gmail.com> on 2020/07/20 21:38:23 UTC

Apache Nutch 1.16 Fetcher reducers?

Hi Guys,

As per Apache Nutch 1.16 Fetcher class implementation here -
https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java,
this is a map only job. I don't see any reducer set in the Job. So my
question is why not set job.setNumreduceTasks(0) and save the time by
outputting directly to HDFS.

Regards
Prateek

Re: Apache Nutch 1.16 Fetcher reducers?

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
> might have to create my own custom FetcherOutputFormat to allow out of
> order writes. I will check how I can do that.

Just replace the MapFile.Writer by a SequenceFile.Writer
Eventually, this may require further changes.

> I have also concluded this discussion here -
> https://stackoverflow.com/questions/63003881/apache-nutch-1-16-fetcher-reducers/.

Thanks for updating the discussion there!

On 7/22/20 4:09 PM, prateek sachdeva wrote:
> ctly Thanks a lot Sebastian. Yes, after checking the logs i saw "key out of
> order exception" and realized that MapFile expects entries to be in order
> and MapFile is used in FetcherOutputFormat while writing data to HDFS. I
> might have to create my own custom FetcherOutputFormat to allow out of
> order writes. I will check how I can do that.
> 
> I will also try to merge parsing and avro conversion to fetch Job directly
> so see if there are some improvements.
> 
> I have also concluded this discussion here -
> https://stackoverflow.com/questions/63003881/apache-nutch-1-16-fetcher-reducers/.
> So if you want to add something here, please feel free to do so.
> 
> Regards
> Prateek
> 
> On Tue, Jul 21, 2020 at 7:50 PM Sebastian Nagel
> <wa...@googlemail.com.invalid> wrote:
> 
>> Hi Prateek,
>>
>>> if I do 0 reducers in
>>> the Fetch phase, I am not getting all the urls in output that I seeded in
>>> input. Looks like only a few of them made it to the final output.
>>
>> There should be error messages in the task logs caused by output not sorted
>> by URL (used as key in map files).
>>
>>
>>>> Final clarification - If I do fetcher.store.content=true and
>>>> fetcher.parse=true, I don't need that Parse Job in my workflow and
>> parsing
>>>> will be done as part of fetcher flow only?
>>
>> Yes, parsing is then done in the fetcher and the parse output is written to
>> crawl_parse, parse_text and parse_data.
>>
>> Best,
>> Sebastian
>>
>> On 7/21/20 3:42 PM, prateek sachdeva wrote:
>>> Correcting my statement below. I just realized that if I do 0 reducers in
>>> the Fetch phase, I am not getting all the urls in output that I seeded in
>>> input. Looks like only a few of them made it to the final output.
>>> So something is not working as expected if we use 0 reducers in the Fetch
>>> phase.
>>>
>>> Regards
>>> Prateek
>>>
>>> On Tue, Jul 21, 2020 at 2:13 PM prateek sachdeva <pr...@gmail.com>
>>> wrote:
>>>
>>>> Makes complete sense. Agreed that 0 reducers in apache nutch fetcher
>> won't
>>>> make sense because of tooling that's built around it.
>>>> Answering your questions - No, we have not made any changes to
>>>> FetcherOutputFormat. Infact, the whole fetcher and parse job is the
>> same as
>>>> that of apache nutch 1.16(Fetcher.java and ParseSegment.java). We have
>>>> built wrappers around these classes to run using Azkaban (
>>>> https://azkaban.github.io/). And still it works if I assign 0 reducers
>> in
>>>> the Fetch phase.
>>>>
>>>> Final clarification - If I do fetcher.store.content=true and
>>>> fetcher.parse=true, I don't need that Parse Job in my workflow and
>> parsing
>>>> will be done as part of fetcher flow only?
>>>> Also, I agree with your point that if I modify FetcherOutputFormat to
>>>> include avro conversion step, I might get rid of that as well. This will
>>>> save some time for sure since Fetcher will be directly creating the
>> final
>>>> avro format that I need. So the only question remains is that if I do
>>>> fetcher.parse=true, can I get rid of parse Job as a separate step
>>>> completely.
>>>>
>>>> Regards
>>>> Prateek
>>>>
>>>> On Tue, Jul 21, 2020 at 1:26 PM Sebastian Nagel
>>>> <wa...@googlemail.com.invalid> wrote:
>>>>
>>>>> Hi Prateek,
>>>>>
>>>>> (regarding 1.)
>>>>>
>>>>> It's also possible to combine fetcher.store.content=true and
>>>>> fetcher.parse=true.
>>>>> You might save some time unless the fetch job is CPU-bound - it usually
>>>>> is limited by network and RAM for buffering content.
>>>>>
>>>>>> which code are you referring to?
>>>>>
>>>>> Maybe it isn't "a lot". The SegmentReader is assuming map files, and
>>>>> there are probably
>>>>> some more tools which also do.  If nothing is used in your workflow,
>>>>> that's fine.
>>>>> But if a fetcher without the reduce step should become the default for
>>>>> Nutch, we'd
>>>>> need to take care for all tools and also ensure backward-compatibility.
>>>>>
>>>>>
>>>>>> FYI- I tried running with 0 reducers
>>>>>
>>>>> I assume you've also adapted FetcherOutputFormat ?
>>>>>
>>>>> Btw., you could think about inlining the "avroConversion" (or parts of
>>>>> it) into FetcherOutputFormat which also could remove the need to
>>>>> store the content.
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>>
>>>>> On 7/21/20 11:28 AM, prateek sachdeva wrote:
>>>>>> Hi Sebastian,
>>>>>>
>>>>>> Thanks for your reply. Couple of questions -
>>>>>>
>>>>>> 1. We have customized apache nutch jobs a bit like this. We have a
>>>>> separate parse job (ParseSegment.java) after fetch job (Fetcher.java).
>> So
>>>>>> as suggested above, if I use fetcher.store.content=false, I am
>> assuming
>>>>> the "content" folder will not be created and hence our parse job
>>>>>> won't work because it takes the content folder as an input file. Also,
>>>>> we have added an additional step "avroConversion" which takes input
>>>>>> as "parse_data", "parse_text", "content" and "crawl_fetch" and
>> converts
>>>>> into a specific avro schema defined by us. So I think, I will end up
>>>>>> breaking a lot of things if I add fetcher.store.content=false and do
>>>>> parsing in the fetch phase only (fetcher.parse=true)
>>>>>>
>>>>>> image.png
>>>>>>
>>>>>> 2. In your earlier email, you said "a lot of code accessing the
>>>>> segments still assumes map files", which code are you referring to? In
>> my
>>>>>> use case above, we are not sending the crawled output to any indexers.
>>>>> In the avro conversion step, we just convert data into avro schema
>>>>>> and dump to HDFS. Do you think we still need reducers in the fetch
>>>>> phase? FYI- I tried running with 0 reducers and don't see any impact as
>>>>>> such.
>>>>>>
>>>>>> Appreciate your help.
>>>>>>
>>>>>> Regards
>>>>>> Prateek
>>>>>>
>>>>>> On Tue, Jul 21, 2020 at 9:06 AM Sebastian Nagel <
>>>>> wastl.nagel@googlemail.com.invalid> wrote:
>>>>>>
>>>>>>     Hi Prateek,
>>>>>>
>>>>>>     you're right there is no specific reducer used but without a
>> reduce
>>>>> step
>>>>>>     the segment data isn't (re)partitioned and the data isn't sorted.
>>>>>>     This was a strong requirement once Nutch was a complete search
>>>>> engine
>>>>>>     and the "content" subdir of a segment was used as page cache.
>>>>>>     Getting the content from a segment is fast if the segment is
>>>>> partitioned
>>>>>>     in a predictable way (hash partitioning) and map files are used.
>>>>>>
>>>>>>     Well, this isn't a strong requirement anymore, since Nutch uses
>>>>> Solr,
>>>>>>     Elasticsearch or other index services. But a lot of code accessing
>>>>>>     the segments still assumes map files. Removing the reduce step
>> from
>>>>>>     the fetcher would also mean a lot of work in code and tools
>>>>> accessing
>>>>>>     the segments, esp. to ensure backward compatibility.
>>>>>>
>>>>>>     Have you tried to run the fetcher with
>>>>>>      fetcher.parse=true
>>>>>>      fetcher.store.content=false ?
>>>>>>     This will save a lot of time and without the need to write the
>> large
>>>>>>     raw content the reduce phase should be fast, only a small fraction
>>>>>>     (5-10%) of the fetcher map phase.
>>>>>>
>>>>>>     Best,
>>>>>>     Sebastian
>>>>>>
>>>>>>
>>>>>>     On 7/20/20 11:38 PM, prateek sachdeva wrote:
>>>>>>     > Hi Guys,
>>>>>>     >
>>>>>>     > As per Apache Nutch 1.16 Fetcher class implementation here -
>>>>>>     >
>>>>>
>> https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java
>>>>> ,
>>>>>>     > this is a map only job. I don't see any reducer set in the Job.
>>>>> So my
>>>>>>     > question is why not set job.setNumreduceTasks(0) and save the
>>>>> time by
>>>>>>     > outputting directly to HDFS.
>>>>>>     >
>>>>>>     > Regards
>>>>>>     > Prateek
>>>>>>     >
>>>>>>
>>>>>
>>>>>
>>>
>>
>>
> 


Re: Apache Nutch 1.16 Fetcher reducers?

Posted by prateek sachdeva <pr...@gmail.com>.
ctly Thanks a lot Sebastian. Yes, after checking the logs i saw "key out of
order exception" and realized that MapFile expects entries to be in order
and MapFile is used in FetcherOutputFormat while writing data to HDFS. I
might have to create my own custom FetcherOutputFormat to allow out of
order writes. I will check how I can do that.

I will also try to merge parsing and avro conversion to fetch Job directly
so see if there are some improvements.

I have also concluded this discussion here -
https://stackoverflow.com/questions/63003881/apache-nutch-1-16-fetcher-reducers/.
So if you want to add something here, please feel free to do so.

Regards
Prateek

On Tue, Jul 21, 2020 at 7:50 PM Sebastian Nagel
<wa...@googlemail.com.invalid> wrote:

> Hi Prateek,
>
> > if I do 0 reducers in
> > the Fetch phase, I am not getting all the urls in output that I seeded in
> > input. Looks like only a few of them made it to the final output.
>
> There should be error messages in the task logs caused by output not sorted
> by URL (used as key in map files).
>
>
> >> Final clarification - If I do fetcher.store.content=true and
> >> fetcher.parse=true, I don't need that Parse Job in my workflow and
> parsing
> >> will be done as part of fetcher flow only?
>
> Yes, parsing is then done in the fetcher and the parse output is written to
> crawl_parse, parse_text and parse_data.
>
> Best,
> Sebastian
>
> On 7/21/20 3:42 PM, prateek sachdeva wrote:
> > Correcting my statement below. I just realized that if I do 0 reducers in
> > the Fetch phase, I am not getting all the urls in output that I seeded in
> > input. Looks like only a few of them made it to the final output.
> > So something is not working as expected if we use 0 reducers in the Fetch
> > phase.
> >
> > Regards
> > Prateek
> >
> > On Tue, Jul 21, 2020 at 2:13 PM prateek sachdeva <pr...@gmail.com>
> > wrote:
> >
> >> Makes complete sense. Agreed that 0 reducers in apache nutch fetcher
> won't
> >> make sense because of tooling that's built around it.
> >> Answering your questions - No, we have not made any changes to
> >> FetcherOutputFormat. Infact, the whole fetcher and parse job is the
> same as
> >> that of apache nutch 1.16(Fetcher.java and ParseSegment.java). We have
> >> built wrappers around these classes to run using Azkaban (
> >> https://azkaban.github.io/). And still it works if I assign 0 reducers
> in
> >> the Fetch phase.
> >>
> >> Final clarification - If I do fetcher.store.content=true and
> >> fetcher.parse=true, I don't need that Parse Job in my workflow and
> parsing
> >> will be done as part of fetcher flow only?
> >> Also, I agree with your point that if I modify FetcherOutputFormat to
> >> include avro conversion step, I might get rid of that as well. This will
> >> save some time for sure since Fetcher will be directly creating the
> final
> >> avro format that I need. So the only question remains is that if I do
> >> fetcher.parse=true, can I get rid of parse Job as a separate step
> >> completely.
> >>
> >> Regards
> >> Prateek
> >>
> >> On Tue, Jul 21, 2020 at 1:26 PM Sebastian Nagel
> >> <wa...@googlemail.com.invalid> wrote:
> >>
> >>> Hi Prateek,
> >>>
> >>> (regarding 1.)
> >>>
> >>> It's also possible to combine fetcher.store.content=true and
> >>> fetcher.parse=true.
> >>> You might save some time unless the fetch job is CPU-bound - it usually
> >>> is limited by network and RAM for buffering content.
> >>>
> >>>> which code are you referring to?
> >>>
> >>> Maybe it isn't "a lot". The SegmentReader is assuming map files, and
> >>> there are probably
> >>> some more tools which also do.  If nothing is used in your workflow,
> >>> that's fine.
> >>> But if a fetcher without the reduce step should become the default for
> >>> Nutch, we'd
> >>> need to take care for all tools and also ensure backward-compatibility.
> >>>
> >>>
> >>>> FYI- I tried running with 0 reducers
> >>>
> >>> I assume you've also adapted FetcherOutputFormat ?
> >>>
> >>> Btw., you could think about inlining the "avroConversion" (or parts of
> >>> it) into FetcherOutputFormat which also could remove the need to
> >>> store the content.
> >>>
> >>> Best,
> >>> Sebastian
> >>>
> >>>
> >>> On 7/21/20 11:28 AM, prateek sachdeva wrote:
> >>>> Hi Sebastian,
> >>>>
> >>>> Thanks for your reply. Couple of questions -
> >>>>
> >>>> 1. We have customized apache nutch jobs a bit like this. We have a
> >>> separate parse job (ParseSegment.java) after fetch job (Fetcher.java).
> So
> >>>> as suggested above, if I use fetcher.store.content=false, I am
> assuming
> >>> the "content" folder will not be created and hence our parse job
> >>>> won't work because it takes the content folder as an input file. Also,
> >>> we have added an additional step "avroConversion" which takes input
> >>>> as "parse_data", "parse_text", "content" and "crawl_fetch" and
> converts
> >>> into a specific avro schema defined by us. So I think, I will end up
> >>>> breaking a lot of things if I add fetcher.store.content=false and do
> >>> parsing in the fetch phase only (fetcher.parse=true)
> >>>>
> >>>> image.png
> >>>>
> >>>> 2. In your earlier email, you said "a lot of code accessing the
> >>> segments still assumes map files", which code are you referring to? In
> my
> >>>> use case above, we are not sending the crawled output to any indexers.
> >>> In the avro conversion step, we just convert data into avro schema
> >>>> and dump to HDFS. Do you think we still need reducers in the fetch
> >>> phase? FYI- I tried running with 0 reducers and don't see any impact as
> >>>> such.
> >>>>
> >>>> Appreciate your help.
> >>>>
> >>>> Regards
> >>>> Prateek
> >>>>
> >>>> On Tue, Jul 21, 2020 at 9:06 AM Sebastian Nagel <
> >>> wastl.nagel@googlemail.com.invalid> wrote:
> >>>>
> >>>>     Hi Prateek,
> >>>>
> >>>>     you're right there is no specific reducer used but without a
> reduce
> >>> step
> >>>>     the segment data isn't (re)partitioned and the data isn't sorted.
> >>>>     This was a strong requirement once Nutch was a complete search
> >>> engine
> >>>>     and the "content" subdir of a segment was used as page cache.
> >>>>     Getting the content from a segment is fast if the segment is
> >>> partitioned
> >>>>     in a predictable way (hash partitioning) and map files are used.
> >>>>
> >>>>     Well, this isn't a strong requirement anymore, since Nutch uses
> >>> Solr,
> >>>>     Elasticsearch or other index services. But a lot of code accessing
> >>>>     the segments still assumes map files. Removing the reduce step
> from
> >>>>     the fetcher would also mean a lot of work in code and tools
> >>> accessing
> >>>>     the segments, esp. to ensure backward compatibility.
> >>>>
> >>>>     Have you tried to run the fetcher with
> >>>>      fetcher.parse=true
> >>>>      fetcher.store.content=false ?
> >>>>     This will save a lot of time and without the need to write the
> large
> >>>>     raw content the reduce phase should be fast, only a small fraction
> >>>>     (5-10%) of the fetcher map phase.
> >>>>
> >>>>     Best,
> >>>>     Sebastian
> >>>>
> >>>>
> >>>>     On 7/20/20 11:38 PM, prateek sachdeva wrote:
> >>>>     > Hi Guys,
> >>>>     >
> >>>>     > As per Apache Nutch 1.16 Fetcher class implementation here -
> >>>>     >
> >>>
> https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java
> >>> ,
> >>>>     > this is a map only job. I don't see any reducer set in the Job.
> >>> So my
> >>>>     > question is why not set job.setNumreduceTasks(0) and save the
> >>> time by
> >>>>     > outputting directly to HDFS.
> >>>>     >
> >>>>     > Regards
> >>>>     > Prateek
> >>>>     >
> >>>>
> >>>
> >>>
> >
>
>

Re: Apache Nutch 1.16 Fetcher reducers?

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi Prateek,

> if I do 0 reducers in
> the Fetch phase, I am not getting all the urls in output that I seeded in
> input. Looks like only a few of them made it to the final output.

There should be error messages in the task logs caused by output not sorted
by URL (used as key in map files).


>> Final clarification - If I do fetcher.store.content=true and
>> fetcher.parse=true, I don't need that Parse Job in my workflow and parsing
>> will be done as part of fetcher flow only?

Yes, parsing is then done in the fetcher and the parse output is written to
crawl_parse, parse_text and parse_data.

Best,
Sebastian

On 7/21/20 3:42 PM, prateek sachdeva wrote:
> Correcting my statement below. I just realized that if I do 0 reducers in
> the Fetch phase, I am not getting all the urls in output that I seeded in
> input. Looks like only a few of them made it to the final output.
> So something is not working as expected if we use 0 reducers in the Fetch
> phase.
> 
> Regards
> Prateek
> 
> On Tue, Jul 21, 2020 at 2:13 PM prateek sachdeva <pr...@gmail.com>
> wrote:
> 
>> Makes complete sense. Agreed that 0 reducers in apache nutch fetcher won't
>> make sense because of tooling that's built around it.
>> Answering your questions - No, we have not made any changes to
>> FetcherOutputFormat. Infact, the whole fetcher and parse job is the same as
>> that of apache nutch 1.16(Fetcher.java and ParseSegment.java). We have
>> built wrappers around these classes to run using Azkaban (
>> https://azkaban.github.io/). And still it works if I assign 0 reducers in
>> the Fetch phase.
>>
>> Final clarification - If I do fetcher.store.content=true and
>> fetcher.parse=true, I don't need that Parse Job in my workflow and parsing
>> will be done as part of fetcher flow only?
>> Also, I agree with your point that if I modify FetcherOutputFormat to
>> include avro conversion step, I might get rid of that as well. This will
>> save some time for sure since Fetcher will be directly creating the final
>> avro format that I need. So the only question remains is that if I do
>> fetcher.parse=true, can I get rid of parse Job as a separate step
>> completely.
>>
>> Regards
>> Prateek
>>
>> On Tue, Jul 21, 2020 at 1:26 PM Sebastian Nagel
>> <wa...@googlemail.com.invalid> wrote:
>>
>>> Hi Prateek,
>>>
>>> (regarding 1.)
>>>
>>> It's also possible to combine fetcher.store.content=true and
>>> fetcher.parse=true.
>>> You might save some time unless the fetch job is CPU-bound - it usually
>>> is limited by network and RAM for buffering content.
>>>
>>>> which code are you referring to?
>>>
>>> Maybe it isn't "a lot". The SegmentReader is assuming map files, and
>>> there are probably
>>> some more tools which also do.  If nothing is used in your workflow,
>>> that's fine.
>>> But if a fetcher without the reduce step should become the default for
>>> Nutch, we'd
>>> need to take care for all tools and also ensure backward-compatibility.
>>>
>>>
>>>> FYI- I tried running with 0 reducers
>>>
>>> I assume you've also adapted FetcherOutputFormat ?
>>>
>>> Btw., you could think about inlining the "avroConversion" (or parts of
>>> it) into FetcherOutputFormat which also could remove the need to
>>> store the content.
>>>
>>> Best,
>>> Sebastian
>>>
>>>
>>> On 7/21/20 11:28 AM, prateek sachdeva wrote:
>>>> Hi Sebastian,
>>>>
>>>> Thanks for your reply. Couple of questions -
>>>>
>>>> 1. We have customized apache nutch jobs a bit like this. We have a
>>> separate parse job (ParseSegment.java) after fetch job (Fetcher.java). So
>>>> as suggested above, if I use fetcher.store.content=false, I am assuming
>>> the "content" folder will not be created and hence our parse job
>>>> won't work because it takes the content folder as an input file. Also,
>>> we have added an additional step "avroConversion" which takes input
>>>> as "parse_data", "parse_text", "content" and "crawl_fetch" and converts
>>> into a specific avro schema defined by us. So I think, I will end up
>>>> breaking a lot of things if I add fetcher.store.content=false and do
>>> parsing in the fetch phase only (fetcher.parse=true)
>>>>
>>>> image.png
>>>>
>>>> 2. In your earlier email, you said "a lot of code accessing the
>>> segments still assumes map files", which code are you referring to? In my
>>>> use case above, we are not sending the crawled output to any indexers.
>>> In the avro conversion step, we just convert data into avro schema
>>>> and dump to HDFS. Do you think we still need reducers in the fetch
>>> phase? FYI- I tried running with 0 reducers and don't see any impact as
>>>> such.
>>>>
>>>> Appreciate your help.
>>>>
>>>> Regards
>>>> Prateek
>>>>
>>>> On Tue, Jul 21, 2020 at 9:06 AM Sebastian Nagel <
>>> wastl.nagel@googlemail.com.invalid> wrote:
>>>>
>>>>     Hi Prateek,
>>>>
>>>>     you're right there is no specific reducer used but without a reduce
>>> step
>>>>     the segment data isn't (re)partitioned and the data isn't sorted.
>>>>     This was a strong requirement once Nutch was a complete search
>>> engine
>>>>     and the "content" subdir of a segment was used as page cache.
>>>>     Getting the content from a segment is fast if the segment is
>>> partitioned
>>>>     in a predictable way (hash partitioning) and map files are used.
>>>>
>>>>     Well, this isn't a strong requirement anymore, since Nutch uses
>>> Solr,
>>>>     Elasticsearch or other index services. But a lot of code accessing
>>>>     the segments still assumes map files. Removing the reduce step from
>>>>     the fetcher would also mean a lot of work in code and tools
>>> accessing
>>>>     the segments, esp. to ensure backward compatibility.
>>>>
>>>>     Have you tried to run the fetcher with
>>>>      fetcher.parse=true
>>>>      fetcher.store.content=false ?
>>>>     This will save a lot of time and without the need to write the large
>>>>     raw content the reduce phase should be fast, only a small fraction
>>>>     (5-10%) of the fetcher map phase.
>>>>
>>>>     Best,
>>>>     Sebastian
>>>>
>>>>
>>>>     On 7/20/20 11:38 PM, prateek sachdeva wrote:
>>>>     > Hi Guys,
>>>>     >
>>>>     > As per Apache Nutch 1.16 Fetcher class implementation here -
>>>>     >
>>> https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java
>>> ,
>>>>     > this is a map only job. I don't see any reducer set in the Job.
>>> So my
>>>>     > question is why not set job.setNumreduceTasks(0) and save the
>>> time by
>>>>     > outputting directly to HDFS.
>>>>     >
>>>>     > Regards
>>>>     > Prateek
>>>>     >
>>>>
>>>
>>>
> 


Re: Apache Nutch 1.16 Fetcher reducers?

Posted by prateek sachdeva <pr...@gmail.com>.
Correcting my statement below. I just realized that if I do 0 reducers in
the Fetch phase, I am not getting all the urls in output that I seeded in
input. Looks like only a few of them made it to the final output.
So something is not working as expected if we use 0 reducers in the Fetch
phase.

Regards
Prateek

On Tue, Jul 21, 2020 at 2:13 PM prateek sachdeva <pr...@gmail.com>
wrote:

> Makes complete sense. Agreed that 0 reducers in apache nutch fetcher won't
> make sense because of tooling that's built around it.
> Answering your questions - No, we have not made any changes to
> FetcherOutputFormat. Infact, the whole fetcher and parse job is the same as
> that of apache nutch 1.16(Fetcher.java and ParseSegment.java). We have
> built wrappers around these classes to run using Azkaban (
> https://azkaban.github.io/). And still it works if I assign 0 reducers in
> the Fetch phase.
>
> Final clarification - If I do fetcher.store.content=true and
> fetcher.parse=true, I don't need that Parse Job in my workflow and parsing
> will be done as part of fetcher flow only?
> Also, I agree with your point that if I modify FetcherOutputFormat to
> include avro conversion step, I might get rid of that as well. This will
> save some time for sure since Fetcher will be directly creating the final
> avro format that I need. So the only question remains is that if I do
> fetcher.parse=true, can I get rid of parse Job as a separate step
> completely.
>
> Regards
> Prateek
>
> On Tue, Jul 21, 2020 at 1:26 PM Sebastian Nagel
> <wa...@googlemail.com.invalid> wrote:
>
>> Hi Prateek,
>>
>> (regarding 1.)
>>
>> It's also possible to combine fetcher.store.content=true and
>> fetcher.parse=true.
>> You might save some time unless the fetch job is CPU-bound - it usually
>> is limited by network and RAM for buffering content.
>>
>> > which code are you referring to?
>>
>> Maybe it isn't "a lot". The SegmentReader is assuming map files, and
>> there are probably
>> some more tools which also do.  If nothing is used in your workflow,
>> that's fine.
>> But if a fetcher without the reduce step should become the default for
>> Nutch, we'd
>> need to take care for all tools and also ensure backward-compatibility.
>>
>>
>> > FYI- I tried running with 0 reducers
>>
>> I assume you've also adapted FetcherOutputFormat ?
>>
>> Btw., you could think about inlining the "avroConversion" (or parts of
>> it) into FetcherOutputFormat which also could remove the need to
>> store the content.
>>
>> Best,
>> Sebastian
>>
>>
>> On 7/21/20 11:28 AM, prateek sachdeva wrote:
>> > Hi Sebastian,
>> >
>> > Thanks for your reply. Couple of questions -
>> >
>> > 1. We have customized apache nutch jobs a bit like this. We have a
>> separate parse job (ParseSegment.java) after fetch job (Fetcher.java). So
>> > as suggested above, if I use fetcher.store.content=false, I am assuming
>> the "content" folder will not be created and hence our parse job
>> > won't work because it takes the content folder as an input file. Also,
>> we have added an additional step "avroConversion" which takes input
>> > as "parse_data", "parse_text", "content" and "crawl_fetch" and converts
>> into a specific avro schema defined by us. So I think, I will end up
>> > breaking a lot of things if I add fetcher.store.content=false and do
>> parsing in the fetch phase only (fetcher.parse=true)
>> >
>> > image.png
>> >
>> > 2. In your earlier email, you said "a lot of code accessing the
>> segments still assumes map files", which code are you referring to? In my
>> > use case above, we are not sending the crawled output to any indexers.
>> In the avro conversion step, we just convert data into avro schema
>> > and dump to HDFS. Do you think we still need reducers in the fetch
>> phase? FYI- I tried running with 0 reducers and don't see any impact as
>> > such.
>> >
>> > Appreciate your help.
>> >
>> > Regards
>> > Prateek
>> >
>> > On Tue, Jul 21, 2020 at 9:06 AM Sebastian Nagel <
>> wastl.nagel@googlemail.com.invalid> wrote:
>> >
>> >     Hi Prateek,
>> >
>> >     you're right there is no specific reducer used but without a reduce
>> step
>> >     the segment data isn't (re)partitioned and the data isn't sorted.
>> >     This was a strong requirement once Nutch was a complete search
>> engine
>> >     and the "content" subdir of a segment was used as page cache.
>> >     Getting the content from a segment is fast if the segment is
>> partitioned
>> >     in a predictable way (hash partitioning) and map files are used.
>> >
>> >     Well, this isn't a strong requirement anymore, since Nutch uses
>> Solr,
>> >     Elasticsearch or other index services. But a lot of code accessing
>> >     the segments still assumes map files. Removing the reduce step from
>> >     the fetcher would also mean a lot of work in code and tools
>> accessing
>> >     the segments, esp. to ensure backward compatibility.
>> >
>> >     Have you tried to run the fetcher with
>> >      fetcher.parse=true
>> >      fetcher.store.content=false ?
>> >     This will save a lot of time and without the need to write the large
>> >     raw content the reduce phase should be fast, only a small fraction
>> >     (5-10%) of the fetcher map phase.
>> >
>> >     Best,
>> >     Sebastian
>> >
>> >
>> >     On 7/20/20 11:38 PM, prateek sachdeva wrote:
>> >     > Hi Guys,
>> >     >
>> >     > As per Apache Nutch 1.16 Fetcher class implementation here -
>> >     >
>> https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java
>> ,
>> >     > this is a map only job. I don't see any reducer set in the Job.
>> So my
>> >     > question is why not set job.setNumreduceTasks(0) and save the
>> time by
>> >     > outputting directly to HDFS.
>> >     >
>> >     > Regards
>> >     > Prateek
>> >     >
>> >
>>
>>

Re: Apache Nutch 1.16 Fetcher reducers?

Posted by prateek sachdeva <pr...@gmail.com>.
Makes complete sense. Agreed that 0 reducers in apache nutch fetcher won't
make sense because of tooling that's built around it.
Answering your questions - No, we have not made any changes to
FetcherOutputFormat. Infact, the whole fetcher and parse job is the same as
that of apache nutch 1.16(Fetcher.java and ParseSegment.java). We have
built wrappers around these classes to run using Azkaban (
https://azkaban.github.io/). And still it works if I assign 0 reducers in
the Fetch phase.

Final clarification - If I do fetcher.store.content=true and
fetcher.parse=true, I don't need that Parse Job in my workflow and parsing
will be done as part of fetcher flow only?
Also, I agree with your point that if I modify FetcherOutputFormat to
include avro conversion step, I might get rid of that as well. This will
save some time for sure since Fetcher will be directly creating the final
avro format that I need. So the only question remains is that if I do
fetcher.parse=true, can I get rid of parse Job as a separate step
completely.

Regards
Prateek

On Tue, Jul 21, 2020 at 1:26 PM Sebastian Nagel
<wa...@googlemail.com.invalid> wrote:

> Hi Prateek,
>
> (regarding 1.)
>
> It's also possible to combine fetcher.store.content=true and
> fetcher.parse=true.
> You might save some time unless the fetch job is CPU-bound - it usually is
> limited by network and RAM for buffering content.
>
> > which code are you referring to?
>
> Maybe it isn't "a lot". The SegmentReader is assuming map files, and there
> are probably
> some more tools which also do.  If nothing is used in your workflow,
> that's fine.
> But if a fetcher without the reduce step should become the default for
> Nutch, we'd
> need to take care for all tools and also ensure backward-compatibility.
>
>
> > FYI- I tried running with 0 reducers
>
> I assume you've also adapted FetcherOutputFormat ?
>
> Btw., you could think about inlining the "avroConversion" (or parts of it)
> into FetcherOutputFormat which also could remove the need to
> store the content.
>
> Best,
> Sebastian
>
>
> On 7/21/20 11:28 AM, prateek sachdeva wrote:
> > Hi Sebastian,
> >
> > Thanks for your reply. Couple of questions -
> >
> > 1. We have customized apache nutch jobs a bit like this. We have a
> separate parse job (ParseSegment.java) after fetch job (Fetcher.java). So
> > as suggested above, if I use fetcher.store.content=false, I am assuming
> the "content" folder will not be created and hence our parse job
> > won't work because it takes the content folder as an input file. Also,
> we have added an additional step "avroConversion" which takes input
> > as "parse_data", "parse_text", "content" and "crawl_fetch" and converts
> into a specific avro schema defined by us. So I think, I will end up
> > breaking a lot of things if I add fetcher.store.content=false and do
> parsing in the fetch phase only (fetcher.parse=true)
> >
> > image.png
> >
> > 2. In your earlier email, you said "a lot of code accessing the segments
> still assumes map files", which code are you referring to? In my
> > use case above, we are not sending the crawled output to any indexers.
> In the avro conversion step, we just convert data into avro schema
> > and dump to HDFS. Do you think we still need reducers in the fetch
> phase? FYI- I tried running with 0 reducers and don't see any impact as
> > such.
> >
> > Appreciate your help.
> >
> > Regards
> > Prateek
> >
> > On Tue, Jul 21, 2020 at 9:06 AM Sebastian Nagel <
> wastl.nagel@googlemail.com.invalid> wrote:
> >
> >     Hi Prateek,
> >
> >     you're right there is no specific reducer used but without a reduce
> step
> >     the segment data isn't (re)partitioned and the data isn't sorted.
> >     This was a strong requirement once Nutch was a complete search engine
> >     and the "content" subdir of a segment was used as page cache.
> >     Getting the content from a segment is fast if the segment is
> partitioned
> >     in a predictable way (hash partitioning) and map files are used.
> >
> >     Well, this isn't a strong requirement anymore, since Nutch uses Solr,
> >     Elasticsearch or other index services. But a lot of code accessing
> >     the segments still assumes map files. Removing the reduce step from
> >     the fetcher would also mean a lot of work in code and tools accessing
> >     the segments, esp. to ensure backward compatibility.
> >
> >     Have you tried to run the fetcher with
> >      fetcher.parse=true
> >      fetcher.store.content=false ?
> >     This will save a lot of time and without the need to write the large
> >     raw content the reduce phase should be fast, only a small fraction
> >     (5-10%) of the fetcher map phase.
> >
> >     Best,
> >     Sebastian
> >
> >
> >     On 7/20/20 11:38 PM, prateek sachdeva wrote:
> >     > Hi Guys,
> >     >
> >     > As per Apache Nutch 1.16 Fetcher class implementation here -
> >     >
> https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java
> ,
> >     > this is a map only job. I don't see any reducer set in the Job. So
> my
> >     > question is why not set job.setNumreduceTasks(0) and save the time
> by
> >     > outputting directly to HDFS.
> >     >
> >     > Regards
> >     > Prateek
> >     >
> >
>
>

Re: Apache Nutch 1.16 Fetcher reducers?

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi Prateek,

(regarding 1.)

It's also possible to combine fetcher.store.content=true and fetcher.parse=true.
You might save some time unless the fetch job is CPU-bound - it usually is limited by network and RAM for buffering content.

> which code are you referring to?

Maybe it isn't "a lot". The SegmentReader is assuming map files, and there are probably
some more tools which also do.  If nothing is used in your workflow, that's fine.
But if a fetcher without the reduce step should become the default for Nutch, we'd
need to take care for all tools and also ensure backward-compatibility.


> FYI- I tried running with 0 reducers

I assume you've also adapted FetcherOutputFormat ?

Btw., you could think about inlining the "avroConversion" (or parts of it) into FetcherOutputFormat which also could remove the need to
store the content.

Best,
Sebastian


On 7/21/20 11:28 AM, prateek sachdeva wrote:
> Hi Sebastian,
> 
> Thanks for your reply. Couple of questions -
> 
> 1. We have customized apache nutch jobs a bit like this. We have a separate parse job (ParseSegment.java) after fetch job (Fetcher.java). So
> as suggested above, if I use fetcher.store.content=false, I am assuming the "content" folder will not be created and hence our parse job
> won't work because it takes the content folder as an input file. Also, we have added an additional step "avroConversion" which takes input
> as "parse_data", "parse_text", "content" and "crawl_fetch" and converts into a specific avro schema defined by us. So I think, I will end up
> breaking a lot of things if I add fetcher.store.content=false and do parsing in the fetch phase only (fetcher.parse=true)
> 
> image.png
> 
> 2. In your earlier email, you said "a lot of code accessing the segments still assumes map files", which code are you referring to? In my
> use case above, we are not sending the crawled output to any indexers. In the avro conversion step, we just convert data into avro schema
> and dump to HDFS. Do you think we still need reducers in the fetch phase? FYI- I tried running with 0 reducers and don't see any impact as
> such.
> 
> Appreciate your help.
> 
> Regards
> Prateek
> 
> On Tue, Jul 21, 2020 at 9:06 AM Sebastian Nagel <wa...@googlemail.com.invalid> wrote:
> 
>     Hi Prateek,
> 
>     you're right there is no specific reducer used but without a reduce step
>     the segment data isn't (re)partitioned and the data isn't sorted.
>     This was a strong requirement once Nutch was a complete search engine
>     and the "content" subdir of a segment was used as page cache.
>     Getting the content from a segment is fast if the segment is partitioned
>     in a predictable way (hash partitioning) and map files are used.
> 
>     Well, this isn't a strong requirement anymore, since Nutch uses Solr,
>     Elasticsearch or other index services. But a lot of code accessing
>     the segments still assumes map files. Removing the reduce step from
>     the fetcher would also mean a lot of work in code and tools accessing
>     the segments, esp. to ensure backward compatibility.
> 
>     Have you tried to run the fetcher with
>      fetcher.parse=true
>      fetcher.store.content=false ?
>     This will save a lot of time and without the need to write the large
>     raw content the reduce phase should be fast, only a small fraction
>     (5-10%) of the fetcher map phase.
> 
>     Best,
>     Sebastian
> 
> 
>     On 7/20/20 11:38 PM, prateek sachdeva wrote:
>     > Hi Guys,
>     >
>     > As per Apache Nutch 1.16 Fetcher class implementation here -
>     > https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java,
>     > this is a map only job. I don't see any reducer set in the Job. So my
>     > question is why not set job.setNumreduceTasks(0) and save the time by
>     > outputting directly to HDFS.
>     >
>     > Regards
>     > Prateek
>     >
> 


Re: Apache Nutch 1.16 Fetcher reducers?

Posted by prateek sachdeva <pr...@gmail.com>.
Hi Sebastian,

Thanks for your reply. Couple of questions -

1. We have customized apache nutch jobs a bit like this. We have a separate
parse job (ParseSegment.java) after fetch job (Fetcher.java). So as
suggested above, if I use fetcher.store.content=false, I am assuming the
"content" folder will not be created and hence our parse job won't work
because it takes the content folder as an input file. Also, we have added
an additional step "avroConversion" which takes input as "parse_data",
"parse_text", "content" and "crawl_fetch" and converts into a specific avro
schema defined by us. So I think, I will end up breaking a lot of things if
I add fetcher.store.content=false and do parsing in the fetch phase only
(fetcher.parse=true)

[image: image.png]

2. In your earlier email, you said "a lot of code accessing the segments
still assumes map files", which code are you referring to? In my use case
above, we are not sending the crawled output to any indexers. In the avro
conversion step, we just convert data into avro schema and dump to HDFS. Do
you think we still need reducers in the fetch phase? FYI- I tried running
with 0 reducers and don't see any impact as such.

Appreciate your help.

Regards
Prateek

On Tue, Jul 21, 2020 at 9:06 AM Sebastian Nagel
<wa...@googlemail.com.invalid> wrote:

> Hi Prateek,
>
> you're right there is no specific reducer used but without a reduce step
> the segment data isn't (re)partitioned and the data isn't sorted.
> This was a strong requirement once Nutch was a complete search engine
> and the "content" subdir of a segment was used as page cache.
> Getting the content from a segment is fast if the segment is partitioned
> in a predictable way (hash partitioning) and map files are used.
>
> Well, this isn't a strong requirement anymore, since Nutch uses Solr,
> Elasticsearch or other index services. But a lot of code accessing
> the segments still assumes map files. Removing the reduce step from
> the fetcher would also mean a lot of work in code and tools accessing
> the segments, esp. to ensure backward compatibility.
>
> Have you tried to run the fetcher with
>  fetcher.parse=true
>  fetcher.store.content=false ?
> This will save a lot of time and without the need to write the large
> raw content the reduce phase should be fast, only a small fraction
> (5-10%) of the fetcher map phase.
>
> Best,
> Sebastian
>
>
> On 7/20/20 11:38 PM, prateek sachdeva wrote:
> > Hi Guys,
> >
> > As per Apache Nutch 1.16 Fetcher class implementation here -
> >
> https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java
> ,
> > this is a map only job. I don't see any reducer set in the Job. So my
> > question is why not set job.setNumreduceTasks(0) and save the time by
> > outputting directly to HDFS.
> >
> > Regards
> > Prateek
> >
>
>

Re: Apache Nutch 1.16 Fetcher reducers?

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi Prateek,

you're right there is no specific reducer used but without a reduce step
the segment data isn't (re)partitioned and the data isn't sorted.
This was a strong requirement once Nutch was a complete search engine
and the "content" subdir of a segment was used as page cache.
Getting the content from a segment is fast if the segment is partitioned
in a predictable way (hash partitioning) and map files are used.

Well, this isn't a strong requirement anymore, since Nutch uses Solr,
Elasticsearch or other index services. But a lot of code accessing
the segments still assumes map files. Removing the reduce step from
the fetcher would also mean a lot of work in code and tools accessing
the segments, esp. to ensure backward compatibility.

Have you tried to run the fetcher with
 fetcher.parse=true
 fetcher.store.content=false ?
This will save a lot of time and without the need to write the large
raw content the reduce phase should be fast, only a small fraction
(5-10%) of the fetcher map phase.

Best,
Sebastian


On 7/20/20 11:38 PM, prateek sachdeva wrote:
> Hi Guys,
> 
> As per Apache Nutch 1.16 Fetcher class implementation here -
> https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java,
> this is a map only job. I don't see any reducer set in the Job. So my
> question is why not set job.setNumreduceTasks(0) and save the time by
> outputting directly to HDFS.
> 
> Regards
> Prateek
>