You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "McGibbney, Lewis John" <Le...@gcu.ac.uk> on 2011/01/14 14:00:50 UTC

Database data storage question

Hello List,

I am gathering information on the above topic as I intend to integrate a database to store fetched data. I would like community input of any experiences using different database implementations before doing so. E.g. comparison between HBase & MySQL etc.

Thank you

Lewis


Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education's Widening Participation Initiative of the Year 2009 and Herald Society's Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Re: Database data storage question

Posted by Alexis <al...@gmail.com>.
>> NoSQL technology scales better, but for a "reasonable" volume MySQL
>> will do the job fine and faster.

Sorry it was not working that well in my tests with Gora code as is
and MySQL backend, because of the broad SELECT statement. The issue is
described here:
http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#Gora

and reported in the JIRA: https://issues.apache.org/jira/browse/GORA-23

Alexis

Re: Database data storage question

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 1/16/11 7:40 AM, Alexis wrote:
> Dear Otis and Lewis,
>
> According to the few tests I made. I feel MySQL has the best
> performance, compared to HSQL and HBase. HSQL is slower and takes up
> so much disk space. HBase uses more resources. Under HBase, I couldn't
> get the Fetch job to complete when holding 5000 pages buffered in
> memory, without having my laptop getting extremely slow. It finally
> worked with a flushing frequency to the store of 2500 pages. Under
> MySQL, it worked out smoothly with a 10000 value.

...and this is of course nowhere near the level of scalability that 1.x 
releases had, as they would easily crawl a hundred million pages. 
There's a lot of remaining work on Gora and its integration with Nutch 
that affects this situation.

Eventually I expect HBase will be the best choice for large scale 
crawling, with MySQL backend suitable for small to medium scale, and 
HSQL being used only for tests or really small crawls < 1000 pages.

>
> NoSQL technology scales better, but for a "reasonable" volume MySQL
> will do the job fine and faster.
>
> It would be nice to test Cassandra as Gora backend. Write operations
> are allegedly faster that Hbase. Haven't tried yet.

There are some concurrency limitations in the Cassandra client - OTOH 
that's maybe where Gora needs to improve.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Database data storage question

Posted by Alexis <al...@gmail.com>.
Dear Otis and Lewis,

According to the few tests I made. I feel MySQL has the best
performance, compared to HSQL and HBase. HSQL is slower and takes up
so much disk space. HBase uses more resources. Under HBase, I couldn't
get the Fetch job to complete when holding 5000 pages buffered in
memory, without having my laptop getting extremely slow. It finally
worked with a flushing frequency to the store of 2500 pages. Under
MySQL, it worked out smoothly with a 10000 value.

NoSQL technology scales better, but for a "reasonable" volume MySQL
will do the job fine and faster.

It would be nice to test Cassandra as Gora backend. Write operations
are allegedly faster that Hbase. Haven't tried yet.

Alexis


On Sun, Jan 16, 2011 at 12:57 PM, McGibbney, Lewis John
<Le...@gcu.ac.uk> wrote:
> Hi Otis,
>
> Thank you for this. From reaading various posts on this list and the roadmap for Nutch 2.0 I had gathered that using HBase was probably the most supported option within the community.
>
> Lewis
>
> ________________________________________
> From: Otis Gospodnetic [ogjunk-nutch@yahoo.com]
> Sent: 16 January 2011 10:45
> To: user@nutch.apache.org
> Subject: Re: Database data storage question
>
> There are lots of factors to consider, so one can't give a good general answer,
> but:
>
> Nutch already uses HBase (trunk), so that's +1 for HBase.  HBase makes it easy
> to scale and has built-in replication thanks to being built on top of HDFS.
>
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> ----- Original Message ----
>> From: "McGibbney, Lewis John" <Le...@gcu.ac.uk>
>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
>> Sent: Fri, January 14, 2011 8:00:50 AM
>> Subject: Database data storage question
>>
>> Hello List,
>>
>> I am gathering information on the above topic as I intend to  integrate a
>>database to store fetched data. I would like community input of any  experiences
>>using different database implementations before doing so. E.g.  comparison
>>between HBase & MySQL etc.
>>
>> Thank  you
>>
>> Lewis
>>
>>
>> Glasgow Caledonian University is a registered  Scottish charity, number
>>SC021474
>>
>> Winner: Times Higher Education's  Widening Participation Initiative of the Year
>>2009 and Herald Society's  Education Initiative of the Year 2009
>>http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>>l
>>
>
> Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems
>
> Glasgow Caledonian University is a registered Scottish charity, number SC021474
>
> Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>

RE: Database data storage question

Posted by "McGibbney, Lewis John" <Le...@gcu.ac.uk>.
Hi Otis,

Thank you for this. From reaading various posts on this list and the roadmap for Nutch 2.0 I had gathered that using HBase was probably the most supported option within the community.

Lewis

________________________________________
From: Otis Gospodnetic [ogjunk-nutch@yahoo.com]
Sent: 16 January 2011 10:45
To: user@nutch.apache.org
Subject: Re: Database data storage question

There are lots of factors to consider, so one can't give a good general answer,
but:

Nutch already uses HBase (trunk), so that's +1 for HBase.  HBase makes it easy
to scale and has built-in replication thanks to being built on top of HDFS.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: "McGibbney, Lewis John" <Le...@gcu.ac.uk>
> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Sent: Fri, January 14, 2011 8:00:50 AM
> Subject: Database data storage question
>
> Hello List,
>
> I am gathering information on the above topic as I intend to  integrate a
>database to store fetched data. I would like community input of any  experiences
>using different database implementations before doing so. E.g.  comparison
>between HBase & MySQL etc.
>
> Thank  you
>
> Lewis
>
>
> Glasgow Caledonian University is a registered  Scottish charity, number
>SC021474
>
> Winner: Times Higher Education's  Widening Participation Initiative of the Year
>2009 and Herald Society's  Education Initiative of the Year 2009
>http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>l
>

Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Re: Database data storage question

Posted by Otis Gospodnetic <og...@yahoo.com>.
There are lots of factors to consider, so one can't give a good general answer, 
but:

Nutch already uses HBase (trunk), so that's +1 for HBase.  HBase makes it easy 
to scale and has built-in replication thanks to being built on top of HDFS.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: "McGibbney, Lewis John" <Le...@gcu.ac.uk>
> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Sent: Fri, January 14, 2011 8:00:50 AM
> Subject: Database data storage question
> 
> Hello List,
> 
> I am gathering information on the above topic as I intend to  integrate a 
>database to store fetched data. I would like community input of any  experiences 
>using different database implementations before doing so. E.g.  comparison 
>between HBase & MySQL etc.
> 
> Thank  you
> 
> Lewis
> 
> 
> Glasgow Caledonian University is a registered  Scottish charity, number 
>SC021474
> 
> Winner: Times Higher Education's  Widening Participation Initiative of the Year 
>2009 and Herald Society's  Education Initiative of the Year 2009
>http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>l
>