You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Cam Bazz <ca...@gmail.com> on 2011/07/08 18:06:18 UTC
skipping invalid segments
Hello,
I tried to crawl manually, only a list of urls. I have issued the
following commands:
bin/nutch inject /home/crawl/crawldb /home/urls
bin/nutch generate /home/crawl/crawldb /home/crawl/segments
bin/nutch fetch /home/crawl/segments/123456789
bin/nutch updatedb /home/crawl/crawldb /home/crawl/segments/123456789
-noAdditions
however for the last command: it skips the segment 12345789 saying it
is an invalid segment?
This is exactly what I need (the -noAdditions flag) but it will not
updatedb. What might have done wrong?
Best Regards,
-C.B.
Re: skipping invalid segments
Posted by Cam Bazz <ca...@gmail.com>.
Hello,
It appears that in my previous message I had ommitted to write -dir in
my message, but had actually written -dir in my console.
However, I have found out that I need to nutch parse
/home/crawl/segments/12345 before updating a db.
By the way: what exactly is a segment, and how is data stored under
this segment? I think it is a hadoop format.
Best Regards,
-C.B.
On Fri, Jul 8, 2011 at 11:00 PM, lewis john mcgibbney
<le...@gmail.com> wrote:
> Hi C.B.,
>
> It looks like you may have simply missed the '-dir' when you were specifying
> your crawldb directory to be updated from the fetched segment. Have a look
> here [1]
>
> Can you please try and post your results.
>
> [1] http://wiki.apache.org/nutch/bin/nutch_updatedb
>
>
>
> On Fri, Jul 8, 2011 at 5:06 PM, Cam Bazz <ca...@gmail.com> wrote:
>
>> Hello,
>>
>> I tried to crawl manually, only a list of urls. I have issued the
>> following commands:
>>
>> bin/nutch inject /home/crawl/crawldb /home/urls
>>
>> bin/nutch generate /home/crawl/crawldb /home/crawl/segments
>>
>> bin/nutch fetch /home/crawl/segments/123456789
>>
>> bin/nutch updatedb /home/crawl/crawldb /home/crawl/segments/123456789
>> -noAdditions
>>
>> however for the last command: it skips the segment 12345789 saying it
>> is an invalid segment?
>>
>> This is exactly what I need (the -noAdditions flag) but it will not
>> updatedb. What might have done wrong?
>>
>> Best Regards,
>> -C.B.
>>
>
>
>
> --
> *Lewis*
>
Re: skipping invalid segments
Posted by lewis john mcgibbney <le...@gmail.com>.
Hi C.B.,
It looks like you may have simply missed the '-dir' when you were specifying
your crawldb directory to be updated from the fetched segment. Have a look
here [1]
Can you please try and post your results.
[1] http://wiki.apache.org/nutch/bin/nutch_updatedb
On Fri, Jul 8, 2011 at 5:06 PM, Cam Bazz <ca...@gmail.com> wrote:
> Hello,
>
> I tried to crawl manually, only a list of urls. I have issued the
> following commands:
>
> bin/nutch inject /home/crawl/crawldb /home/urls
>
> bin/nutch generate /home/crawl/crawldb /home/crawl/segments
>
> bin/nutch fetch /home/crawl/segments/123456789
>
> bin/nutch updatedb /home/crawl/crawldb /home/crawl/segments/123456789
> -noAdditions
>
> however for the last command: it skips the segment 12345789 saying it
> is an invalid segment?
>
> This is exactly what I need (the -noAdditions flag) but it will not
> updatedb. What might have done wrong?
>
> Best Regards,
> -C.B.
>
--
*Lewis*