You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by King Kong <ch...@hotmail.com> on 2006/08/23 10:13:47 UTC

How does Nutch-0.7.2 data upgrade to 0.8?

I had fetched about 3Gbytes pages in Nutch-0.7.2 .
Now, I want to move it to Nutch-0.8, How can I do it ?


Any suggestion is appreciated.
-- 
View this message in context: http://www.nabble.com/How-does-Nutch-0.7.2-data-upgrade-to-0.8--tf2151013.html#a5940027
Sent from the Nutch - User forum at Nabble.com.


Re: How does Nutch-0.7.2 data upgrade to 0.8?

Posted by Andrzej Bialecki <ab...@getopt.org>.
King Kong wrote:
> Andrzej,How can I do to dump a 0.7 webdb into a text file that it could
> inject into the 0.8 crawldb?
>   


bin/nutch readdb webdb -dumppageurl | awk '$1 ~ /^URL:/ {print $2}' > 
urls.txt

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How does Nutch-0.7.2 data upgrade to 0.8?

Posted by King Kong <ch...@hotmail.com>.
Andrzej,How can I do to dump a 0.7 webdb into a text file that it could
inject into the 0.8 crawldb?



Andrzej Bialecki wrote:
> 
> King Kong wrote:
>> I had fetched about 3Gbytes pages in Nutch-0.7.2 .
>> Now, I want to move it to Nutch-0.8, How can I do it ?
>>   
> 
> Unfortunately, the data is not portable between these versions. The only 
> thing you could do to preserve your webdb is to dump it into a text 
> file, and then inject into a 0.8 crawldb. As for the segments, you will 
> have to refetch them.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/How-does-Nutch-0.7.2-data-upgrade-to-0.8--tf2151013.html#a5949540
Sent from the Nutch - User forum at Nabble.com.


Re: restarting fetch

Posted by Murat Ali Bayir <mu...@agmlab.com>.
except 'crawl_generate' you can delete all directories under segment and 
run fetch command again.
There is no need to generate new segments.

Richard Braman wrote:

>If you get a stop error in the middle of a fetch, should you refetch the
>segment, or just do another generate and fetch the newly generated segment?
>
>Likewise, if you have an error during index, can you just rererun the index
>command
>
>
>
>
>.
>
>  
>


restarting fetch

Posted by Richard Braman <rb...@bramantax.com>.
If you get a stop error in the middle of a fetch, should you refetch the
segment, or just do another generate and fetch the newly generated segment?

Likewise, if you have an error during index, can you just rererun the index
command



Re: How does Nutch-0.7.2 data upgrade to 0.8?

Posted by Ken Krugler <kk...@transpac.com>.
>  >> you could do a quick hack in 0.8 to
>"fetch" the pages from your 0.7 crawl, using a modified fetcher.
>
>   what do you mean? Do I have to modify the fetcher code by myself ?

Yes, you'd have to modify the 0.8 fetcher code (or rather create your 
own plug-in) that uses a Nutch 0.7 search setup to get at all of the 
previously fetched content.

-- Ken


>Ken Krugler wrote:
>>
>  >>It's really  a sad news for me. I must spend a lot of time on fetching it
>>>again.
>>
>>  If it's only just HTML, then you could do a quick hack in 0.8 to
>>  "fetch" the pages from your 0.7 crawl, using a modified fetcher. You
>>  wouldn't have all of the header info, but if everything is text/html
>>  then you might be OK.
>>
>>  -- Ken
>>
>>
>>>Andrzej Bialecki wrote:
>>>>
>>>>   King Kong wrote:
>>>>>   I had fetched about 3Gbytes pages in Nutch-0.7.2 .
>>>>>   Now, I want to move it to Nutch-0.8, How can I do it ?
>>>>> 
>>>>
>>>>   Unfortunately, the data is not portable between these versions. The
>>>>  only
>>>>   thing you could do to preserve your webdb is to dump it into a text
>>>>   file, and then inject into a 0.8 crawldb. As for the segments, you will
>>>>   have to refetch them.
>>>>
>>>>   --
>>>>   Best regards,
>>>>   Andrzej Bialecki     <><
>>>>    ___. ___ ___ ___ _ _   __________________________________
>>>>   [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>>>   ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>>   > http://www.sigram.com  Contact: info at sigram dot com
>>
>>  --
>>  Ken Krugler
>>  Krugle, Inc.
>>  +1 530-210-6378
>>  "Find Code, Find Answers"
>>
>>
>
>--
>View this message in context: 
>http://www.nabble.com/How-does-Nutch-0.7.2-data-upgrade-to-0.8--tf2151013.html#a5949225
>Sent from the Nutch - User forum at Nabble.com.


-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Re: How does Nutch-0.7.2 data upgrade to 0.8?

Posted by King Kong <ch...@hotmail.com>.
>> you could do a quick hack in 0.8 to 
"fetch" the pages from your 0.7 crawl, using a modified fetcher. 
 
  what do you mean? Do I have to modify the fetcher code by myself ?


Ken Krugler wrote:
> 
>>It's really  a sad news for me. I must spend a lot of time on fetching it
>>again.
> 
> If it's only just HTML, then you could do a quick hack in 0.8 to 
> "fetch" the pages from your 0.7 crawl, using a modified fetcher. You 
> wouldn't have all of the header info, but if everything is text/html 
> then you might be OK.
> 
> -- Ken
> 
> 
>>Andrzej Bialecki wrote:
>>>
>>>  King Kong wrote:
>>>>  I had fetched about 3Gbytes pages in Nutch-0.7.2 .
>>>>  Now, I want to move it to Nutch-0.8, How can I do it ?
>>>>  
>>>
>>>  Unfortunately, the data is not portable between these versions. The
>>> only
>>>  thing you could do to preserve your webdb is to dump it into a text
>>>  file, and then inject into a 0.8 crawldb. As for the segments, you will
>>>  have to refetch them.
>>>
>>>  --
>>>  Best regards,
>>>  Andrzej Bialecki     <><
>>>   ___. ___ ___ ___ _ _   __________________________________
>>>  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>>  ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>  > http://www.sigram.com  Contact: info at sigram dot com
> 
> -- 
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "Find Code, Find Answers"
> 
> 

-- 
View this message in context: http://www.nabble.com/How-does-Nutch-0.7.2-data-upgrade-to-0.8--tf2151013.html#a5949225
Sent from the Nutch - User forum at Nabble.com.


Re: How does Nutch-0.7.2 data upgrade to 0.8?

Posted by Ken Krugler <kk...@transpac.com>.
>It's really  a sad news for me. I must spend a lot of time on fetching it
>again.

If it's only just HTML, then you could do a quick hack in 0.8 to 
"fetch" the pages from your 0.7 crawl, using a modified fetcher. You 
wouldn't have all of the header info, but if everything is text/html 
then you might be OK.

-- Ken


>Andrzej Bialecki wrote:
>>
>>  King Kong wrote:
>>>  I had fetched about 3Gbytes pages in Nutch-0.7.2 .
>>>  Now, I want to move it to Nutch-0.8, How can I do it ?
>>>  
>>
>>  Unfortunately, the data is not portable between these versions. The only
>>  thing you could do to preserve your webdb is to dump it into a text
>>  file, and then inject into a 0.8 crawldb. As for the segments, you will
>>  have to refetch them.
>>
>>  --
>>  Best regards,
>>  Andrzej Bialecki     <><
>>   ___. ___ ___ ___ _ _   __________________________________
>>  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>  ___|||__||  \|  ||  |  Embedded Unix, System Integration
>  > http://www.sigram.com  Contact: info at sigram dot com

-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Re: How does Nutch-0.7.2 data upgrade to 0.8?

Posted by King Kong <ch...@hotmail.com>.
It's really  a sad news for me. I must spend a lot of time on fetching it
again.

However...

Andrzej,thanks for your help!



Andrzej Bialecki wrote:
> 
> King Kong wrote:
>> I had fetched about 3Gbytes pages in Nutch-0.7.2 .
>> Now, I want to move it to Nutch-0.8, How can I do it ?
>>   
> 
> Unfortunately, the data is not portable between these versions. The only 
> thing you could do to preserve your webdb is to dump it into a text 
> file, and then inject into a 0.8 crawldb. As for the segments, you will 
> have to refetch them.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/How-does-Nutch-0.7.2-data-upgrade-to-0.8--tf2151013.html#a5943225
Sent from the Nutch - User forum at Nabble.com.


Re: How does Nutch-0.7.2 data upgrade to 0.8?

Posted by Howie Wang <ho...@hotmail.com>.
>>Is it just that no migration utility has been written? Is there something
>>about the structures in 0.8 that make migrating the data impossible,
>>or extremely difficult?
>
>Hey, these are just bits and bytes on the disk, so nothing is impossible ;)

Thanks, Andrzej, it sounds non-trivial :-(  For me, it's not
that the data in the segments is so precious that I can't
get it again. It's more that refetching would take a couple
of months for me.

Howie



Re: How does Nutch-0.7.2 data upgrade to 0.8?

Posted by Andrzej Bialecki <ab...@getopt.org>.
Howie Wang wrote:
>> Unfortunately, the data is not portable between these versions. The 
>> only thing you could do to preserve your webdb is to dump it into a 
>> text file, and then inject into a 0.8 crawldb. As for the segments, 
>> you will have to refetch them.
>
> Is it just that no migration utility has been written? Is there something
> about the structures in 0.8 that make migrating the data impossible,
> or extremely difficult?

Hey, these are just bits and bytes on the disk, so nothing is impossible ;)

Sure, given enough time and resources you could write a converter; all 
I'm saying is that it would take a lot of tedious and error-prone 
coding, so for all practical reasons it's better to dump/inject and 
re-fetch. Unless your segment data is so precious that you are willing 
to bear the price of writing a converter.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How does Nutch-0.7.2 data upgrade to 0.8?

Posted by Howie Wang <ho...@hotmail.com>.
>Unfortunately, the data is not portable between these versions. The only 
>thing you could do to preserve your webdb is to dump it into a text file, 
>and then inject into a 0.8 crawldb. As for the segments, you will have to 
>refetch them.

Is it just that no migration utility has been written? Is there something
about the structures in 0.8 that make migrating the data impossible,
or extremely difficult?

Thanks,
Howie



Re: How does Nutch-0.7.2 data upgrade to 0.8?

Posted by Andrzej Bialecki <ab...@getopt.org>.
King Kong wrote:
> I had fetched about 3Gbytes pages in Nutch-0.7.2 .
> Now, I want to move it to Nutch-0.8, How can I do it ?
>   

Unfortunately, the data is not portable between these versions. The only 
thing you could do to preserve your webdb is to dump it into a text 
file, and then inject into a 0.8 crawldb. As for the segments, you will 
have to refetch them.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com