You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Michael Chen <yi...@u.northwestern.edu> on 2017/07/31 19:21:39 UTC
Question on 2.x sitemap functionality
Dear fellow Nutch developers,
I've been trying to use Nutch 2 sitemap function to crawl and index all
pages on the sitemap indices. It seems that integration with
CommonCrawler sitemap tools only exist in 2.x branch. But after I got it
to work with Hbase 1.2.3, it didn't fetch, parse and index the sitemap
indices and sitemaps at all.
I also looked into the code a bit and everything seems to make sense,
except I couldn't further trace the data flow beyond Toolrunner.run() in
the FetchReducer. I'm testing it on Linux with the "crawl" script in
/bin, so I'm not sure if how I can debug this. Please let me know if
there's any further information that I can provide you with to help
troubleshoot this issue. Thanks in advance!
Best regards,
Michael
Re: Question on 2.x sitemap functionality
Posted by kenneth mcfarland <ke...@gmail.com>.
Please know the inquiry is simply to understand how myself and others can
document the code better. Thank you for your response.
Kenneth
On Aug 1, 2017 5:45 PM, "Michael Chen" <yi...@u.northwestern.edu>
wrote:
> Hi Kenneth,
>
> Thanks for following up! Besides the fact that there is almost no javadoc
> available for the sitemap classes and a lot of the main job classes... I
> was mainly using the GSOC project page and lifecycle pdf as reference. The
> nutch 2 lifecycle pdf says that sitemap detection is done during injection,
> but I just found it to be within fetching with the -stmDetect flag. Looking
> at the code also confirms that fetch is the only process that uses the
> CommonCrawler sitemap features. In addition, the sitemap feature wiki page
> contains only a link to the GSOC project for Nutch 2.x, which is what I'm
> using.
>
> In specific, I'm running Nutch 2.x on Ubuntu 16.04 after failing to get it
> working on Windows (hadoop binary file related problems, did extensive
> troubleshooting). Let me know if there's any additional information I can
> provide you with.
>
> I completely understand that documentation for a community project can be
> difficult, and I'll be more than happy to add/fix some if I can. But right
> now I'm still trying to verify/falsify some of the claims in the
> documentation...
>
> Thanks!
>
> Michael
>
> On 08/01/2017 05:30 PM, kenneth mcfarland wrote:
>
> Can you please be more specific about your environment and what you have
> found to be out of date please?
>
> On Aug 1, 2017 5:28 PM, "Michael Chen" <yi...@u.northwestern.edu>
> wrote:
>
>> Problem resolved. The crawl script and web documentation are out of date.
>> Nutch script works fine.
>>
>> Might be a good idea to update sitemap related documentation at some
>> point... takes quite a bit of speculation and experimentation right now...
>>
>> Thanks!
>>
>> Michael
>>
>>
>> On 07/31/2017 12:21 PM, Michael Chen wrote:
>>
>>> Dear fellow Nutch developers,
>>>
>>> I've been trying to use Nutch 2 sitemap function to crawl and index all
>>> pages on the sitemap indices. It seems that integration with CommonCrawler
>>> sitemap tools only exist in 2.x branch. But after I got it to work with
>>> Hbase 1.2.3, it didn't fetch, parse and index the sitemap indices and
>>> sitemaps at all.
>>>
>>> I also looked into the code a bit and everything seems to make sense,
>>> except I couldn't further trace the data flow beyond Toolrunner.run() in
>>> the FetchReducer. I'm testing it on Linux with the "crawl" script in /bin,
>>> so I'm not sure if how I can debug this. Please let me know if there's any
>>> further information that I can provide you with to help troubleshoot this
>>> issue. Thanks in advance!
>>>
>>> Best regards,
>>>
>>> Michael
>>>
>>>
>>>
>>
>
Re: Question on 2.x sitemap functionality
Posted by Michael Chen <yi...@u.northwestern.edu>.
Hi Kenneth,
Thanks for following up! Besides the fact that there is almost no
javadoc available for the sitemap classes and a lot of the main job
classes... I was mainly using the GSOC project page and lifecycle pdf as
reference. The nutch 2 lifecycle pdf says that sitemap detection is done
during injection, but I just found it to be within fetching with the
-stmDetect flag. Looking at the code also confirms that fetch is the
only process that uses the CommonCrawler sitemap features. In addition,
the sitemap feature wiki page contains only a link to the GSOC project
for Nutch 2.x, which is what I'm using.
In specific, I'm running Nutch 2.x on Ubuntu 16.04 after failing to get
it working on Windows (hadoop binary file related problems, did
extensive troubleshooting). Let me know if there's any additional
information I can provide you with.
I completely understand that documentation for a community project can
be difficult, and I'll be more than happy to add/fix some if I can. But
right now I'm still trying to verify/falsify some of the claims in the
documentation...
Thanks!
Michael
On 08/01/2017 05:30 PM, kenneth mcfarland wrote:
> Can you please be more specific about your environment and what you
> have found to be out of date please?
>
> On Aug 1, 2017 5:28 PM, "Michael Chen"
> <yiningchen2020@u.northwestern.edu
> <ma...@u.northwestern.edu>> wrote:
>
> Problem resolved. The crawl script and web documentation are out
> of date. Nutch script works fine.
>
> Might be a good idea to update sitemap related documentation at
> some point... takes quite a bit of speculation and experimentation
> right now...
>
> Thanks!
>
> Michael
>
>
> On 07/31/2017 12:21 PM, Michael Chen wrote:
>
> Dear fellow Nutch developers,
>
> I've been trying to use Nutch 2 sitemap function to crawl and
> index all pages on the sitemap indices. It seems that
> integration with CommonCrawler sitemap tools only exist in 2.x
> branch. But after I got it to work with Hbase 1.2.3, it didn't
> fetch, parse and index the sitemap indices and sitemaps at all.
>
> I also looked into the code a bit and everything seems to make
> sense, except I couldn't further trace the data flow beyond
> Toolrunner.run() in the FetchReducer. I'm testing it on Linux
> with the "crawl" script in /bin, so I'm not sure if how I can
> debug this. Please let me know if there's any further
> information that I can provide you with to help troubleshoot
> this issue. Thanks in advance!
>
> Best regards,
>
> Michael
>
>
>
Re: Question on 2.x sitemap functionality
Posted by kenneth mcfarland <ke...@gmail.com>.
Can you please be more specific about your environment and what you have
found to be out of date please?
On Aug 1, 2017 5:28 PM, "Michael Chen" <yi...@u.northwestern.edu>
wrote:
> Problem resolved. The crawl script and web documentation are out of date.
> Nutch script works fine.
>
> Might be a good idea to update sitemap related documentation at some
> point... takes quite a bit of speculation and experimentation right now...
>
> Thanks!
>
> Michael
>
>
> On 07/31/2017 12:21 PM, Michael Chen wrote:
>
>> Dear fellow Nutch developers,
>>
>> I've been trying to use Nutch 2 sitemap function to crawl and index all
>> pages on the sitemap indices. It seems that integration with CommonCrawler
>> sitemap tools only exist in 2.x branch. But after I got it to work with
>> Hbase 1.2.3, it didn't fetch, parse and index the sitemap indices and
>> sitemaps at all.
>>
>> I also looked into the code a bit and everything seems to make sense,
>> except I couldn't further trace the data flow beyond Toolrunner.run() in
>> the FetchReducer. I'm testing it on Linux with the "crawl" script in /bin,
>> so I'm not sure if how I can debug this. Please let me know if there's any
>> further information that I can provide you with to help troubleshoot this
>> issue. Thanks in advance!
>>
>> Best regards,
>>
>> Michael
>>
>>
>>
>
Re: Question on 2.x sitemap functionality
Posted by Michael Chen <yi...@u.northwestern.edu>.
Problem resolved. The crawl script and web documentation are out of
date. Nutch script works fine.
Might be a good idea to update sitemap related documentation at some
point... takes quite a bit of speculation and experimentation right now...
Thanks!
Michael
On 07/31/2017 12:21 PM, Michael Chen wrote:
> Dear fellow Nutch developers,
>
> I've been trying to use Nutch 2 sitemap function to crawl and index
> all pages on the sitemap indices. It seems that integration with
> CommonCrawler sitemap tools only exist in 2.x branch. But after I got
> it to work with Hbase 1.2.3, it didn't fetch, parse and index the
> sitemap indices and sitemaps at all.
>
> I also looked into the code a bit and everything seems to make sense,
> except I couldn't further trace the data flow beyond Toolrunner.run()
> in the FetchReducer. I'm testing it on Linux with the "crawl" script
> in /bin, so I'm not sure if how I can debug this. Please let me know
> if there's any further information that I can provide you with to help
> troubleshoot this issue. Thanks in advance!
>
> Best regards,
>
> Michael
>
>