You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by adfel70 <ad...@gmail.com> on 2013/02/26 18:18:03 UTC

nutch-2.1 with hbase - any good tool for querying results?

Anybody using a good tool for performing queries on the crawl results
directly from hbase?
some of the queries I want to make are: get all the url that failed
fetching, get all the urls that failed parsing.

querying hbasedirectly seems more convenient then running readdb, waiting
for results, than parsing the readdb output to get the required information.

thanks.



--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-2-1-with-hbase-any-good-tool-for-querying-results-tp4043109.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch-2.1 with hbase - any good tool for querying results?

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Hi there,

There are two options, and it will depend on how you would like to
access the data.
If you would like to access this using a data flow (Pig directly) you
can find tons of information e.g. [1], but if you want to use a
JDBC-like access, you can use Gora ;)
Let us know what you feel like using.


Renato M.

[1] http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/backend/hadoop/hbase/HBaseStorage.html

2013/2/27 Lewis John Mcgibbney <le...@gmail.com>:
> What for? What do you want?
> We are discussing (in the Gora community) making a gora-pig module so that
> there is a unified mechanism for doing pig driven inference of the data you
> hold in gora-* stores. Are you interested in engaging in that conversation?
> In all honesty (although indirectly linked) the Nutch list is not the
> appropriate platform for us to take this forward. We need to use Gora or
> Pig lists...
> I am interested in this.
>
> On Wednesday, February 27, 2013, adfel70 <ad...@gmail.com> wrote:
>> Can anyone share pig scripts for querying nutch data?
>>
>>
>>
>>
>> --
>> View this message in context:
> http://lucene.472066.n3.nabble.com/nutch-2-1-with-hbase-any-good-tool-for-querying-results-tp4043109p4043319.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
> --
> *Lewis*

Re: nutch-2.1 with hbase - any good tool for querying results?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

What for? What do you want?
We are discussing (in the Gora community) making a gora-pig module so that
there is a unified mechanism for doing pig driven inference of the data you
hold in gora-* stores. Are you interested in engaging in that conversation?
In all honesty (although indirectly linked) the Nutch list is not the
appropriate platform for us to take this forward. We need to use Gora or
Pig lists...
I am interested in this.

On Wednesday, February 27, 2013, adfel70 <ad...@gmail.com> wrote:
> Can anyone share pig scripts for querying nutch data?
>
>
>
>
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/nutch-2-1-with-hbase-any-good-tool-for-querying-results-tp4043109p4043319.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*

Re: nutch-2.1 with hbase - any good tool for querying results?

Posted by adfel70 <ad...@gmail.com>.

Can anyone share pig scripts for querying nutch data?




--
View this message in context: http://lucene.472066.n3.nabble.com/nutch-2-1-with-hbase-any-good-tool-for-querying-results-tp4043109p4043319.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: migrating from 1.x to 2.x

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi kaveh,
Size of crawl database is not an issue with regards to migration between
Nutch versions, it is a compatibility issue which you need to be concerned
about.
There are no tools currently available in Nutch (as far as I know) to read
URLs from hdfs and import/inject your crawl data into your hbase cluster.
This is mostly due to the nature of direction in which Nutch is moving,
which is to do just crawling, at scale, quickly. We don't have immediate
necessity or passion to maintain legacy tools within the codebase and have
been trying to reduce this aspect of the codebase. This however doesn't
help as there was never a tool for this specific purpose anyway (as far as
I know).
It is however becoming something which I am getting interested about (the
notion of obtaining lots of data from various data stores and bootstrapping
Nutch with it). I would really like to read the data with Gora and map it
somewhere. I am interested in the Nutch injecting code and would be
interested to extend it/write new code to solve this issue.

On Tue, Feb 26, 2013 at 5:03 PM, kaveh minooie <ka...@plutoz.com> wrote:

> me again,
>
> is there anyway that I can import my existing crawldb from a nutch 1.4
> which has about 2.5 B (with a B) links in it and currently resides in a
> hdfs file system into webpages table in hbase?
>
>
> and what happened to linkdb in nutch 2.x?
>
> thanks,
>

-- 
*Lewis*

migrating from 1.x to 2.x

Posted by kaveh minooie <ka...@plutoz.com>.

me again,

is there anyway that I can import my existing crawldb from a nutch 1.4 
which has about 2.5 B (with a B) links in it and currently resides in a 
hdfs file system into webpages table in hbase?


and what happened to linkdb in nutch 2.x?

thanks,

Re: nutch-2.1 with hbase - any good tool for querying results?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

We will be working on better support (gora-pig adapter) for this
functionality in Apache Gora > 0.3.
For now Kiran's suggestion is by far the best.
Thank you
Lewis

On Tue, Feb 26, 2013 at 10:17 AM, kiran chitturi
<ch...@gmail.com>wrote:

> I found apache pig [1] convenient to use with Hbase for querying and
> filtering.
>
> 1 - http://pig.apache.org/
>
>
>
>
> On Tue, Feb 26, 2013 at 12:18 PM, adfel70 <ad...@gmail.com> wrote:
>
> > Anybody using a good tool for performing queries on the crawl results
> > directly from hbase?
> > some of the queries I want to make are: get all the url that failed
> > fetching, get all the urls that failed parsing.
> >
> > querying hbasedirectly seems more convenient then running readdb, waiting
> > for results, than parsing the readdb output to get the required
> > information.
> >
> > thanks.
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/nutch-2-1-with-hbase-any-good-tool-for-querying-results-tp4043109.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
>
>
>
> --
> Kiran Chitturi
>



-- 
*Lewis*

Re: nutch-2.1 with hbase - any good tool for querying results?

Posted by kiran chitturi <ch...@gmail.com>.

I found apache pig [1] convenient to use with Hbase for querying and
filtering.

1 - http://pig.apache.org/




On Tue, Feb 26, 2013 at 12:18 PM, adfel70 <ad...@gmail.com> wrote:

> Anybody using a good tool for performing queries on the crawl results
> directly from hbase?
> some of the queries I want to make are: get all the url that failed
> fetching, get all the urls that failed parsing.
>
> querying hbasedirectly seems more convenient then running readdb, waiting
> for results, than parsing the readdb output to get the required
> information.
>
> thanks.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/nutch-2-1-with-hbase-any-good-tool-for-querying-results-tp4043109.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Kiran Chitturi