You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by kaveh minooie <ka...@plutoz.com> on 2013/02/27 02:03:12 UTC

migrating from 1.x to 2.x

me again,

is there anyway that I can import my existing crawldb from a nutch 1.4 
which has about 2.5 B (with a B) links in it and currently resides in a 
hdfs file system into webpages table in hbase?


and what happened to linkdb in nutch 2.x?

thanks,

Re: migrating from 1.x to 2.x

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi kaveh,
Size of crawl database is not an issue with regards to migration between
Nutch versions, it is a compatibility issue which you need to be concerned
about.
There are no tools currently available in Nutch (as far as I know) to read
URLs from hdfs and import/inject your crawl data into your hbase cluster.
This is mostly due to the nature of direction in which Nutch is moving,
which is to do just crawling, at scale, quickly. We don't have immediate
necessity or passion to maintain legacy tools within the codebase and have
been trying to reduce this aspect of the codebase. This however doesn't
help as there was never a tool for this specific purpose anyway (as far as
I know).
It is however becoming something which I am getting interested about (the
notion of obtaining lots of data from various data stores and bootstrapping
Nutch with it). I would really like to read the data with Gora and map it
somewhere. I am interested in the Nutch injecting code and would be
interested to extend it/write new code to solve this issue.

On Tue, Feb 26, 2013 at 5:03 PM, kaveh minooie <ka...@plutoz.com> wrote:

> me again,
>
> is there anyway that I can import my existing crawldb from a nutch 1.4
> which has about 2.5 B (with a B) links in it and currently resides in a
> hdfs file system into webpages table in hbase?
>
>
> and what happened to linkdb in nutch 2.x?
>
> thanks,
>

-- 
*Lewis*