You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Michael Chen <yi...@u.northwestern.edu> on 2017/07/31 19:21:39 UTC

Question on 2.x sitemap functionality

Dear fellow Nutch developers,

I've been trying to use Nutch 2 sitemap function to crawl and index all 
pages on the sitemap indices. It seems that integration with 
CommonCrawler sitemap tools only exist in 2.x branch. But after I got it 
to work with Hbase 1.2.3, it didn't fetch, parse and index the sitemap 
indices and sitemaps at all.

I also looked into the code a bit and everything seems to make sense, 
except I couldn't further trace the data flow beyond Toolrunner.run() in 
the FetchReducer. I'm testing it on Linux with the "crawl" script in 
/bin, so I'm not sure if how I can debug this. Please let me know if 
there's any further information that I can provide you with to help 
troubleshoot this issue. Thanks in advance!

Best regards,

Michael

Re: Question on 2.x sitemap functionality

Posted by kenneth mcfarland <ke...@gmail.com>.

Please know the inquiry is simply to understand how myself and others can
document the code better. Thank you for your response.

Kenneth

On Aug 1, 2017 5:45 PM, "Michael Chen" <yi...@u.northwestern.edu>
wrote:

> Hi Kenneth,
>
> Thanks for following up! Besides the fact that there is almost no javadoc
> available for the sitemap classes and a lot of the main job classes... I
> was mainly using the GSOC project page and lifecycle pdf as reference. The
> nutch 2 lifecycle pdf says that sitemap detection is done during injection,
> but I just found it to be within fetching with the -stmDetect flag. Looking
> at the code also confirms that fetch is the only process that uses the
> CommonCrawler sitemap features. In addition, the sitemap feature wiki page
> contains only a link to the GSOC project for Nutch 2.x, which is what I'm
> using.
>
> In specific, I'm running Nutch 2.x on Ubuntu 16.04 after failing to get it
> working on Windows (hadoop binary file related problems, did extensive
> troubleshooting). Let me know if there's any additional information I can
> provide you with.
>
> I completely understand that documentation for a community project can be
> difficult, and I'll be more than happy to add/fix some if I can. But right
> now I'm still trying to verify/falsify some of the claims in the
> documentation...
>
> Thanks!
>
> Michael
>
> On 08/01/2017 05:30 PM, kenneth mcfarland wrote:
>
> Can you please be more specific about your environment and what you have
> found to be out of date please?
>
> On Aug 1, 2017 5:28 PM, "Michael Chen" <yi...@u.northwestern.edu>
> wrote:
>
>> Problem resolved. The crawl script and web documentation are out of date.
>> Nutch script works fine.
>>
>> Might be a good idea to update sitemap related documentation at some
>> point... takes quite a bit of speculation and experimentation right now...
>>
>> Thanks!
>>
>> Michael
>>
>>
>> On 07/31/2017 12:21 PM, Michael Chen wrote:
>>
>>> Dear fellow Nutch developers,
>>>
>>> I've been trying to use Nutch 2 sitemap function to crawl and index all
>>> pages on the sitemap indices. It seems that integration with CommonCrawler
>>> sitemap tools only exist in 2.x branch. But after I got it to work with
>>> Hbase 1.2.3, it didn't fetch, parse and index the sitemap indices and
>>> sitemaps at all.
>>>
>>> I also looked into the code a bit and everything seems to make sense,
>>> except I couldn't further trace the data flow beyond Toolrunner.run() in
>>> the FetchReducer. I'm testing it on Linux with the "crawl" script in /bin,
>>> so I'm not sure if how I can debug this. Please let me know if there's any
>>> further information that I can provide you with to help troubleshoot this
>>> issue. Thanks in advance!
>>>
>>> Best regards,
>>>
>>> Michael
>>>
>>>
>>>
>>
>

Re: Question on 2.x sitemap functionality

Posted by Michael Chen <yi...@u.northwestern.edu>.

Hi Kenneth,

Thanks for following up! Besides the fact that there is almost no 
javadoc available for the sitemap classes and a lot of the main job 
classes... I was mainly using the GSOC project page and lifecycle pdf as 
reference. The nutch 2 lifecycle pdf says that sitemap detection is done 
during injection, but I just found it to be within fetching with the 
-stmDetect flag. Looking at the code also confirms that fetch is the 
only process that uses the CommonCrawler sitemap features. In addition, 
the sitemap feature wiki page contains only a link to the GSOC project 
for Nutch 2.x, which is what I'm using.

In specific, I'm running Nutch 2.x on Ubuntu 16.04 after failing to get 
it working on Windows (hadoop binary file related problems, did 
extensive troubleshooting). Let me know if there's any additional 
information I can provide you with.

I completely understand that documentation for a community project can 
be difficult, and I'll be more than happy to add/fix some if I can. But 
right now I'm still trying to verify/falsify some of the claims in the 
documentation...

Thanks!

Michael


On 08/01/2017 05:30 PM, kenneth mcfarland wrote:
> Can you please be more specific about your environment and what you 
> have found to be out of date please?
>
> On Aug 1, 2017 5:28 PM, "Michael Chen" 
> <yiningchen2020@u.northwestern.edu 
> <ma...@u.northwestern.edu>> wrote:
>
>     Problem resolved. The crawl script and web documentation are out
>     of date. Nutch script works fine.
>
>     Might be a good idea to update sitemap related documentation at
>     some point... takes quite a bit of speculation and experimentation
>     right now...
>
>     Thanks!
>
>     Michael
>
>
>     On 07/31/2017 12:21 PM, Michael Chen wrote:
>
>         Dear fellow Nutch developers,
>
>         I've been trying to use Nutch 2 sitemap function to crawl and
>         index all pages on the sitemap indices. It seems that
>         integration with CommonCrawler sitemap tools only exist in 2.x
>         branch. But after I got it to work with Hbase 1.2.3, it didn't
>         fetch, parse and index the sitemap indices and sitemaps at all.
>
>         I also looked into the code a bit and everything seems to make
>         sense, except I couldn't further trace the data flow beyond
>         Toolrunner.run() in the FetchReducer. I'm testing it on Linux
>         with the "crawl" script in /bin, so I'm not sure if how I can
>         debug this. Please let me know if there's any further
>         information that I can provide you with to help troubleshoot
>         this issue. Thanks in advance!
>
>         Best regards,
>
>         Michael
>
>
>

Re: Question on 2.x sitemap functionality

Posted by kenneth mcfarland <ke...@gmail.com>.

Can you please be more specific about your environment and what you have
found to be out of date please?

On Aug 1, 2017 5:28 PM, "Michael Chen" <yi...@u.northwestern.edu>
wrote:

> Problem resolved. The crawl script and web documentation are out of date.
> Nutch script works fine.
>
> Might be a good idea to update sitemap related documentation at some
> point... takes quite a bit of speculation and experimentation right now...
>
> Thanks!
>
> Michael
>
>
> On 07/31/2017 12:21 PM, Michael Chen wrote:
>
>> Dear fellow Nutch developers,
>>
>> I've been trying to use Nutch 2 sitemap function to crawl and index all
>> pages on the sitemap indices. It seems that integration with CommonCrawler
>> sitemap tools only exist in 2.x branch. But after I got it to work with
>> Hbase 1.2.3, it didn't fetch, parse and index the sitemap indices and
>> sitemaps at all.
>>
>> I also looked into the code a bit and everything seems to make sense,
>> except I couldn't further trace the data flow beyond Toolrunner.run() in
>> the FetchReducer. I'm testing it on Linux with the "crawl" script in /bin,
>> so I'm not sure if how I can debug this. Please let me know if there's any
>> further information that I can provide you with to help troubleshoot this
>> issue. Thanks in advance!
>>
>> Best regards,
>>
>> Michael
>>
>>
>>
>

Re: Question on 2.x sitemap functionality

Posted by Michael Chen <yi...@u.northwestern.edu>.

Problem resolved. The crawl script and web documentation are out of 
date. Nutch script works fine.

Might be a good idea to update sitemap related documentation at some 
point... takes quite a bit of speculation and experimentation right now...

Thanks!

Michael


On 07/31/2017 12:21 PM, Michael Chen wrote:
> Dear fellow Nutch developers,
>
> I've been trying to use Nutch 2 sitemap function to crawl and index 
> all pages on the sitemap indices. It seems that integration with 
> CommonCrawler sitemap tools only exist in 2.x branch. But after I got 
> it to work with Hbase 1.2.3, it didn't fetch, parse and index the 
> sitemap indices and sitemaps at all.
>
> I also looked into the code a bit and everything seems to make sense, 
> except I couldn't further trace the data flow beyond Toolrunner.run() 
> in the FetchReducer. I'm testing it on Linux with the "crawl" script 
> in /bin, so I'm not sure if how I can debug this. Please let me know 
> if there's any further information that I can provide you with to help 
> troubleshoot this issue. Thanks in advance!
>
> Best regards,
>
> Michael
>
>