You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Cam Bazz <ca...@gmail.com> on 2011/07/08 18:06:18 UTC

skipping invalid segments

Hello,

I tried to crawl manually, only a list of urls. I have issued the
following commands:

bin/nutch inject /home/crawl/crawldb /home/urls

bin/nutch generate /home/crawl/crawldb /home/crawl/segments

bin/nutch fetch /home/crawl/segments/123456789

bin/nutch updatedb /home/crawl/crawldb /home/crawl/segments/123456789
-noAdditions

however for the last command: it skips the segment 12345789 saying it
is an invalid segment?

This is exactly what I need (the -noAdditions flag) but it will not
updatedb. What might have done wrong?

Best Regards,
-C.B.

Re: skipping invalid segments

Posted by Cam Bazz <ca...@gmail.com>.

Hello,

It appears that in my previous message I had ommitted to write -dir in
my message, but had actually written -dir in my console.

However, I have found out that I need to nutch parse
/home/crawl/segments/12345 before updating a db.

By the way: what exactly is a segment, and how is data stored under
this segment? I think it is a hadoop format.

Best Regards,
-C.B.

On Fri, Jul 8, 2011 at 11:00 PM, lewis john mcgibbney
<le...@gmail.com> wrote:
> Hi C.B.,
>
> It looks like you may have simply missed the '-dir' when you were specifying
> your crawldb directory to be updated from the fetched segment. Have a look
> here [1]
>
> Can you please try and post your results.
>
> [1] http://wiki.apache.org/nutch/bin/nutch_updatedb
>
>
>
> On Fri, Jul 8, 2011 at 5:06 PM, Cam Bazz <ca...@gmail.com> wrote:
>
>> Hello,
>>
>> I tried to crawl manually, only a list of urls. I have issued the
>> following commands:
>>
>> bin/nutch inject /home/crawl/crawldb /home/urls
>>
>> bin/nutch generate /home/crawl/crawldb /home/crawl/segments
>>
>> bin/nutch fetch /home/crawl/segments/123456789
>>
>> bin/nutch updatedb /home/crawl/crawldb /home/crawl/segments/123456789
>> -noAdditions
>>
>> however for the last command: it skips the segment 12345789 saying it
>> is an invalid segment?
>>
>> This is exactly what I need (the -noAdditions flag) but it will not
>> updatedb. What might have done wrong?
>>
>> Best Regards,
>> -C.B.
>>
>
>
>
> --
> *Lewis*
>

Re: skipping invalid segments

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi C.B.,

It looks like you may have simply missed the '-dir' when you were specifying
your crawldb directory to be updated from the fetched segment. Have a look
here [1]

Can you please try and post your results.

[1] http://wiki.apache.org/nutch/bin/nutch_updatedb



On Fri, Jul 8, 2011 at 5:06 PM, Cam Bazz <ca...@gmail.com> wrote:

> Hello,
>
> I tried to crawl manually, only a list of urls. I have issued the
> following commands:
>
> bin/nutch inject /home/crawl/crawldb /home/urls
>
> bin/nutch generate /home/crawl/crawldb /home/crawl/segments
>
> bin/nutch fetch /home/crawl/segments/123456789
>
> bin/nutch updatedb /home/crawl/crawldb /home/crawl/segments/123456789
> -noAdditions
>
> however for the last command: it skips the segment 12345789 saying it
> is an invalid segment?
>
> This is exactly what I need (the -noAdditions flag) but it will not
> updatedb. What might have done wrong?
>
> Best Regards,
> -C.B.
>



-- 
*Lewis*