You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by King Kong <ch...@hotmail.com> on 2006/08/23 10:13:47 UTC
How does Nutch-0.7.2 data upgrade to 0.8?
I had fetched about 3Gbytes pages in Nutch-0.7.2 .
Now, I want to move it to Nutch-0.8, How can I do it ?
Any suggestion is appreciated.
--
View this message in context: http://www.nabble.com/How-does-Nutch-0.7.2-data-upgrade-to-0.8--tf2151013.html#a5940027
Sent from the Nutch - User forum at Nabble.com.
Re: How does Nutch-0.7.2 data upgrade to 0.8?
Posted by Andrzej Bialecki <ab...@getopt.org>.
King Kong wrote:
> Andrzej,How can I do to dump a 0.7 webdb into a text file that it could
> inject into the 0.8 crawldb?
>
bin/nutch readdb webdb -dumppageurl | awk '$1 ~ /^URL:/ {print $2}' >
urls.txt
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: How does Nutch-0.7.2 data upgrade to 0.8?
Posted by King Kong <ch...@hotmail.com>.
Andrzej,How can I do to dump a 0.7 webdb into a text file that it could
inject into the 0.8 crawldb?
Andrzej Bialecki wrote:
>
> King Kong wrote:
>> I had fetched about 3Gbytes pages in Nutch-0.7.2 .
>> Now, I want to move it to Nutch-0.8, How can I do it ?
>>
>
> Unfortunately, the data is not portable between these versions. The only
> thing you could do to preserve your webdb is to dump it into a text
> file, and then inject into a 0.8 crawldb. As for the segments, you will
> have to refetch them.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
>
>
--
View this message in context: http://www.nabble.com/How-does-Nutch-0.7.2-data-upgrade-to-0.8--tf2151013.html#a5949540
Sent from the Nutch - User forum at Nabble.com.
Re: restarting fetch
Posted by Murat Ali Bayir <mu...@agmlab.com>.
except 'crawl_generate' you can delete all directories under segment and
run fetch command again.
There is no need to generate new segments.
Richard Braman wrote:
>If you get a stop error in the middle of a fetch, should you refetch the
>segment, or just do another generate and fetch the newly generated segment?
>
>Likewise, if you have an error during index, can you just rererun the index
>command
>
>
>
>
>.
>
>
>
restarting fetch
Posted by Richard Braman <rb...@bramantax.com>.
If you get a stop error in the middle of a fetch, should you refetch the
segment, or just do another generate and fetch the newly generated segment?
Likewise, if you have an error during index, can you just rererun the index
command
Re: How does Nutch-0.7.2 data upgrade to 0.8?
Posted by Ken Krugler <kk...@transpac.com>.
> >> you could do a quick hack in 0.8 to
>"fetch" the pages from your 0.7 crawl, using a modified fetcher.
>
> what do you mean? Do I have to modify the fetcher code by myself ?
Yes, you'd have to modify the 0.8 fetcher code (or rather create your
own plug-in) that uses a Nutch 0.7 search setup to get at all of the
previously fetched content.
-- Ken
>Ken Krugler wrote:
>>
> >>It's really a sad news for me. I must spend a lot of time on fetching it
>>>again.
>>
>> If it's only just HTML, then you could do a quick hack in 0.8 to
>> "fetch" the pages from your 0.7 crawl, using a modified fetcher. You
>> wouldn't have all of the header info, but if everything is text/html
>> then you might be OK.
>>
>> -- Ken
>>
>>
>>>Andrzej Bialecki wrote:
>>>>
>>>> King Kong wrote:
>>>>> I had fetched about 3Gbytes pages in Nutch-0.7.2 .
>>>>> Now, I want to move it to Nutch-0.8, How can I do it ?
>>>>>
>>>>
>>>> Unfortunately, the data is not portable between these versions. The
>>>> only
>>>> thing you could do to preserve your webdb is to dump it into a text
>>>> file, and then inject into a 0.8 crawldb. As for the segments, you will
>>>> have to refetch them.
>>>>
>>>> --
>>>> Best regards,
>>>> Andrzej Bialecki <><
>>>> ___. ___ ___ ___ _ _ __________________________________
>>>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
>>>> ___|||__|| \| || | Embedded Unix, System Integration
>>> > http://www.sigram.com Contact: info at sigram dot com
>>
>> --
>> Ken Krugler
>> Krugle, Inc.
>> +1 530-210-6378
>> "Find Code, Find Answers"
>>
>>
>
>--
>View this message in context:
>http://www.nabble.com/How-does-Nutch-0.7.2-data-upgrade-to-0.8--tf2151013.html#a5949225
>Sent from the Nutch - User forum at Nabble.com.
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
Re: How does Nutch-0.7.2 data upgrade to 0.8?
Posted by King Kong <ch...@hotmail.com>.
>> you could do a quick hack in 0.8 to
"fetch" the pages from your 0.7 crawl, using a modified fetcher.
what do you mean? Do I have to modify the fetcher code by myself ?
Ken Krugler wrote:
>
>>It's really a sad news for me. I must spend a lot of time on fetching it
>>again.
>
> If it's only just HTML, then you could do a quick hack in 0.8 to
> "fetch" the pages from your 0.7 crawl, using a modified fetcher. You
> wouldn't have all of the header info, but if everything is text/html
> then you might be OK.
>
> -- Ken
>
>
>>Andrzej Bialecki wrote:
>>>
>>> King Kong wrote:
>>>> I had fetched about 3Gbytes pages in Nutch-0.7.2 .
>>>> Now, I want to move it to Nutch-0.8, How can I do it ?
>>>>
>>>
>>> Unfortunately, the data is not portable between these versions. The
>>> only
>>> thing you could do to preserve your webdb is to dump it into a text
>>> file, and then inject into a 0.8 crawldb. As for the segments, you will
>>> have to refetch them.
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki <><
>>> ___. ___ ___ ___ _ _ __________________________________
>>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
>>> ___|||__|| \| || | Embedded Unix, System Integration
>> > http://www.sigram.com Contact: info at sigram dot com
>
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "Find Code, Find Answers"
>
>
--
View this message in context: http://www.nabble.com/How-does-Nutch-0.7.2-data-upgrade-to-0.8--tf2151013.html#a5949225
Sent from the Nutch - User forum at Nabble.com.
Re: How does Nutch-0.7.2 data upgrade to 0.8?
Posted by Ken Krugler <kk...@transpac.com>.
>It's really a sad news for me. I must spend a lot of time on fetching it
>again.
If it's only just HTML, then you could do a quick hack in 0.8 to
"fetch" the pages from your 0.7 crawl, using a modified fetcher. You
wouldn't have all of the header info, but if everything is text/html
then you might be OK.
-- Ken
>Andrzej Bialecki wrote:
>>
>> King Kong wrote:
>>> I had fetched about 3Gbytes pages in Nutch-0.7.2 .
>>> Now, I want to move it to Nutch-0.8, How can I do it ?
>>>
>>
>> Unfortunately, the data is not portable between these versions. The only
>> thing you could do to preserve your webdb is to dump it into a text
>> file, and then inject into a 0.8 crawldb. As for the segments, you will
>> have to refetch them.
>>
>> --
>> Best regards,
>> Andrzej Bialecki <><
>> ___. ___ ___ ___ _ _ __________________________________
>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
>> ___|||__|| \| || | Embedded Unix, System Integration
> > http://www.sigram.com Contact: info at sigram dot com
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
Re: How does Nutch-0.7.2 data upgrade to 0.8?
Posted by King Kong <ch...@hotmail.com>.
It's really a sad news for me. I must spend a lot of time on fetching it
again.
However...
Andrzej,thanks for your help!
Andrzej Bialecki wrote:
>
> King Kong wrote:
>> I had fetched about 3Gbytes pages in Nutch-0.7.2 .
>> Now, I want to move it to Nutch-0.8, How can I do it ?
>>
>
> Unfortunately, the data is not portable between these versions. The only
> thing you could do to preserve your webdb is to dump it into a text
> file, and then inject into a 0.8 crawldb. As for the segments, you will
> have to refetch them.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
>
>
--
View this message in context: http://www.nabble.com/How-does-Nutch-0.7.2-data-upgrade-to-0.8--tf2151013.html#a5943225
Sent from the Nutch - User forum at Nabble.com.
Re: How does Nutch-0.7.2 data upgrade to 0.8?
Posted by Howie Wang <ho...@hotmail.com>.
>>Is it just that no migration utility has been written? Is there something
>>about the structures in 0.8 that make migrating the data impossible,
>>or extremely difficult?
>
>Hey, these are just bits and bytes on the disk, so nothing is impossible ;)
Thanks, Andrzej, it sounds non-trivial :-( For me, it's not
that the data in the segments is so precious that I can't
get it again. It's more that refetching would take a couple
of months for me.
Howie
Re: How does Nutch-0.7.2 data upgrade to 0.8?
Posted by Andrzej Bialecki <ab...@getopt.org>.
Howie Wang wrote:
>> Unfortunately, the data is not portable between these versions. The
>> only thing you could do to preserve your webdb is to dump it into a
>> text file, and then inject into a 0.8 crawldb. As for the segments,
>> you will have to refetch them.
>
> Is it just that no migration utility has been written? Is there something
> about the structures in 0.8 that make migrating the data impossible,
> or extremely difficult?
Hey, these are just bits and bytes on the disk, so nothing is impossible ;)
Sure, given enough time and resources you could write a converter; all
I'm saying is that it would take a lot of tedious and error-prone
coding, so for all practical reasons it's better to dump/inject and
re-fetch. Unless your segment data is so precious that you are willing
to bear the price of writing a converter.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: How does Nutch-0.7.2 data upgrade to 0.8?
Posted by Howie Wang <ho...@hotmail.com>.
>Unfortunately, the data is not portable between these versions. The only
>thing you could do to preserve your webdb is to dump it into a text file,
>and then inject into a 0.8 crawldb. As for the segments, you will have to
>refetch them.
Is it just that no migration utility has been written? Is there something
about the structures in 0.8 that make migrating the data impossible,
or extremely difficult?
Thanks,
Howie
Re: How does Nutch-0.7.2 data upgrade to 0.8?
Posted by Andrzej Bialecki <ab...@getopt.org>.
King Kong wrote:
> I had fetched about 3Gbytes pages in Nutch-0.7.2 .
> Now, I want to move it to Nutch-0.8, How can I do it ?
>
Unfortunately, the data is not portable between these versions. The only
thing you could do to preserve your webdb is to dump it into a text
file, and then inject into a 0.8 crawldb. As for the segments, you will
have to refetch them.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com