You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Alec Taylor <al...@gmail.com> on 2015/01/20 12:40:08 UTC

Low-latency queries, HDFS exclusively or should I go, e.g.: MongoDB?

I am architecting a platform incorporating: recommender systems,
information retrieval (ML), sequence mining, and Natural Language
Processing.

Additionally I have the generic CRUD and authentication components,
with everything exposed RESTfully.

For the storage layer(s), there are a few options which immediately
present themselves:

Generic CRUD layer (high speed needed here, though I suppose I could use Redis…)

- Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
SQL layer atop
- Apache Spark (perhaps piping to HDFS)… ¿maybe?
- MongoDB (or a similar document-store), a graph-database, or even
something like Postgres

Analytics layer (to enable Big Data / Data-intensive computing features)

- Apache Spark
- Hadoop with MapReduce and/or utilising some other Apache /
non-Apache project with integration
- Disco (from Nokia)

________________________________

Should I prefer one layer—e.g.: on HDFS—over multiple disparite
layers? - The advantage here is obvious, but I am certain there are
disadvantages. (and yes, I know there are various ways; automated and
manual; to push data from non HDFS-backed stores to HDFS)

Also, as a bonus answer, which stack would you recommend for this
user-network I'm building?

Re: Low-latency queries, HDFS exclusively or should I go, e.g.: MongoDB?

Posted by daemeon reiydelle <da...@gmail.com>.

At the end of the day, the more data that is pulled from multiple physical
nodes, the (relatively) slower your response time to respond to queries.
Until you reach a point where that response time exceeds your business
requirements, keep it simple. As volumes grow with distributed data sources
to feed the queries, you will need to begin considering a relational or
pseudorelational architecture "on top of" Hadoop.

The driving questions tend to be:

Does the mix of queries access the entire range of base data across the
cluster?
How much latency between receipt of (raw) new data, processing of that data
into the SQL/NotOnlySQL repository,  and delivery of the full mix of
results to your spectrum of users is permissible?

Can you move some mix of queries to a somewhat out of date repository that
is refreshed e.g. daily?

At the end of the day, complexity of your business data requirements and
the complexity of the process drive one to more layers and complex
solutions.

*.......*

*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Tue, Jan 20, 2015 at 9:12 PM, Ted Yu <yu...@gmail.com> wrote:

> bq. Is Apache Spark good as a general database
>
> I don't think Spark itself is a general database though there're
> connectors to various NoSQL databases, including HBase.
>
> bq. using their graph database features?
>
> Sure. Take a look at http://spark.apache.org/graphx/
>
> Cheers
>
> On Tue, Jan 20, 2015 at 9:02 PM, Alec Taylor <al...@gmail.com>
> wrote:
>
>> Small amounts in a one node cluster (at first).
>>
>> As it scales I'll be looking at running various O(nk) algorithms,
>> where n is the number of distinct users and k are the overlapping
>> features I want to consider.
>>
>> Is Apache Spark good as a general database as well as it's more fancy
>> features? - E.g.: considering I'm building a network, maybe using
>> their graph database features?
>>
>> On Wed, Jan 21, 2015 at 2:27 AM, Ted Yu <yu...@gmail.com> wrote:
>> > Apache Spark supports integration with HBase (which has REST API).
>> >
>> > What's the amount of data you want to store in this system ?
>> >
>> > Cheers
>> >
>> > On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor <al...@gmail.com>
>> wrote:
>> >>
>> >> I am architecting a platform incorporating: recommender systems,
>> >> information retrieval (ML), sequence mining, and Natural Language
>> >> Processing.
>> >>
>> >> Additionally I have the generic CRUD and authentication components,
>> >> with everything exposed RESTfully.
>> >>
>> >> For the storage layer(s), there are a few options which immediately
>> >> present themselves:
>> >>
>> >> Generic CRUD layer (high speed needed here, though I suppose I could
>> use
>> >> Redis…)
>> >>
>> >> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
>> >> SQL layer atop
>> >> - Apache Spark (perhaps piping to HDFS)… ¿maybe?
>> >> - MongoDB (or a similar document-store), a graph-database, or even
>> >> something like Postgres
>> >>
>> >> Analytics layer (to enable Big Data / Data-intensive computing
>> features)
>> >>
>> >> - Apache Spark
>> >> - Hadoop with MapReduce and/or utilising some other Apache /
>> >> non-Apache project with integration
>> >> - Disco (from Nokia)
>> >>
>> >> ________________________________
>> >>
>> >> Should I prefer one layer—e.g.: on HDFS—over multiple disparite
>> >> layers? - The advantage here is obvious, but I am certain there are
>> >> disadvantages. (and yes, I know there are various ways; automated and
>> >> manual; to push data from non HDFS-backed stores to HDFS)
>> >>
>> >> Also, as a bonus answer, which stack would you recommend for this
>> >> user-network I'm building?
>> >
>> >
>>
>
>

Re: Low-latency queries, HDFS exclusively or should I go, e.g.: MongoDB?

Posted by daemeon reiydelle <da...@gmail.com>.

At the end of the day, the more data that is pulled from multiple physical
nodes, the (relatively) slower your response time to respond to queries.
Until you reach a point where that response time exceeds your business
requirements, keep it simple. As volumes grow with distributed data sources
to feed the queries, you will need to begin considering a relational or
pseudorelational architecture "on top of" Hadoop.

The driving questions tend to be:

Does the mix of queries access the entire range of base data across the
cluster?
How much latency between receipt of (raw) new data, processing of that data
into the SQL/NotOnlySQL repository,  and delivery of the full mix of
results to your spectrum of users is permissible?

Can you move some mix of queries to a somewhat out of date repository that
is refreshed e.g. daily?

At the end of the day, complexity of your business data requirements and
the complexity of the process drive one to more layers and complex
solutions.

*.......*

*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Tue, Jan 20, 2015 at 9:12 PM, Ted Yu <yu...@gmail.com> wrote:

> bq. Is Apache Spark good as a general database
>
> I don't think Spark itself is a general database though there're
> connectors to various NoSQL databases, including HBase.
>
> bq. using their graph database features?
>
> Sure. Take a look at http://spark.apache.org/graphx/
>
> Cheers
>
> On Tue, Jan 20, 2015 at 9:02 PM, Alec Taylor <al...@gmail.com>
> wrote:
>
>> Small amounts in a one node cluster (at first).
>>
>> As it scales I'll be looking at running various O(nk) algorithms,
>> where n is the number of distinct users and k are the overlapping
>> features I want to consider.
>>
>> Is Apache Spark good as a general database as well as it's more fancy
>> features? - E.g.: considering I'm building a network, maybe using
>> their graph database features?
>>
>> On Wed, Jan 21, 2015 at 2:27 AM, Ted Yu <yu...@gmail.com> wrote:
>> > Apache Spark supports integration with HBase (which has REST API).
>> >
>> > What's the amount of data you want to store in this system ?
>> >
>> > Cheers
>> >
>> > On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor <al...@gmail.com>
>> wrote:
>> >>
>> >> I am architecting a platform incorporating: recommender systems,
>> >> information retrieval (ML), sequence mining, and Natural Language
>> >> Processing.
>> >>
>> >> Additionally I have the generic CRUD and authentication components,
>> >> with everything exposed RESTfully.
>> >>
>> >> For the storage layer(s), there are a few options which immediately
>> >> present themselves:
>> >>
>> >> Generic CRUD layer (high speed needed here, though I suppose I could
>> use
>> >> Redis…)
>> >>
>> >> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
>> >> SQL layer atop
>> >> - Apache Spark (perhaps piping to HDFS)… ¿maybe?
>> >> - MongoDB (or a similar document-store), a graph-database, or even
>> >> something like Postgres
>> >>
>> >> Analytics layer (to enable Big Data / Data-intensive computing
>> features)
>> >>
>> >> - Apache Spark
>> >> - Hadoop with MapReduce and/or utilising some other Apache /
>> >> non-Apache project with integration
>> >> - Disco (from Nokia)
>> >>
>> >> ________________________________
>> >>
>> >> Should I prefer one layer—e.g.: on HDFS—over multiple disparite
>> >> layers? - The advantage here is obvious, but I am certain there are
>> >> disadvantages. (and yes, I know there are various ways; automated and
>> >> manual; to push data from non HDFS-backed stores to HDFS)
>> >>
>> >> Also, as a bonus answer, which stack would you recommend for this
>> >> user-network I'm building?
>> >
>> >
>>
>
>

Re: Low-latency queries, HDFS exclusively or should I go, e.g.: MongoDB?

Posted by daemeon reiydelle <da...@gmail.com>.

At the end of the day, the more data that is pulled from multiple physical
nodes, the (relatively) slower your response time to respond to queries.
Until you reach a point where that response time exceeds your business
requirements, keep it simple. As volumes grow with distributed data sources
to feed the queries, you will need to begin considering a relational or
pseudorelational architecture "on top of" Hadoop.

The driving questions tend to be:

Does the mix of queries access the entire range of base data across the
cluster?
How much latency between receipt of (raw) new data, processing of that data
into the SQL/NotOnlySQL repository,  and delivery of the full mix of
results to your spectrum of users is permissible?

Can you move some mix of queries to a somewhat out of date repository that
is refreshed e.g. daily?

At the end of the day, complexity of your business data requirements and
the complexity of the process drive one to more layers and complex
solutions.

*.......*

*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Tue, Jan 20, 2015 at 9:12 PM, Ted Yu <yu...@gmail.com> wrote:

> bq. Is Apache Spark good as a general database
>
> I don't think Spark itself is a general database though there're
> connectors to various NoSQL databases, including HBase.
>
> bq. using their graph database features?
>
> Sure. Take a look at http://spark.apache.org/graphx/
>
> Cheers
>
> On Tue, Jan 20, 2015 at 9:02 PM, Alec Taylor <al...@gmail.com>
> wrote:
>
>> Small amounts in a one node cluster (at first).
>>
>> As it scales I'll be looking at running various O(nk) algorithms,
>> where n is the number of distinct users and k are the overlapping
>> features I want to consider.
>>
>> Is Apache Spark good as a general database as well as it's more fancy
>> features? - E.g.: considering I'm building a network, maybe using
>> their graph database features?
>>
>> On Wed, Jan 21, 2015 at 2:27 AM, Ted Yu <yu...@gmail.com> wrote:
>> > Apache Spark supports integration with HBase (which has REST API).
>> >
>> > What's the amount of data you want to store in this system ?
>> >
>> > Cheers
>> >
>> > On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor <al...@gmail.com>
>> wrote:
>> >>
>> >> I am architecting a platform incorporating: recommender systems,
>> >> information retrieval (ML), sequence mining, and Natural Language
>> >> Processing.
>> >>
>> >> Additionally I have the generic CRUD and authentication components,
>> >> with everything exposed RESTfully.
>> >>
>> >> For the storage layer(s), there are a few options which immediately
>> >> present themselves:
>> >>
>> >> Generic CRUD layer (high speed needed here, though I suppose I could
>> use
>> >> Redis…)
>> >>
>> >> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
>> >> SQL layer atop
>> >> - Apache Spark (perhaps piping to HDFS)… ¿maybe?
>> >> - MongoDB (or a similar document-store), a graph-database, or even
>> >> something like Postgres
>> >>
>> >> Analytics layer (to enable Big Data / Data-intensive computing
>> features)
>> >>
>> >> - Apache Spark
>> >> - Hadoop with MapReduce and/or utilising some other Apache /
>> >> non-Apache project with integration
>> >> - Disco (from Nokia)
>> >>
>> >> ________________________________
>> >>
>> >> Should I prefer one layer—e.g.: on HDFS—over multiple disparite
>> >> layers? - The advantage here is obvious, but I am certain there are
>> >> disadvantages. (and yes, I know there are various ways; automated and
>> >> manual; to push data from non HDFS-backed stores to HDFS)
>> >>
>> >> Also, as a bonus answer, which stack would you recommend for this
>> >> user-network I'm building?
>> >
>> >
>>
>
>

Re: Low-latency queries, HDFS exclusively or should I go, e.g.: MongoDB?

Posted by daemeon reiydelle <da...@gmail.com>.

At the end of the day, the more data that is pulled from multiple physical
nodes, the (relatively) slower your response time to respond to queries.
Until you reach a point where that response time exceeds your business
requirements, keep it simple. As volumes grow with distributed data sources
to feed the queries, you will need to begin considering a relational or
pseudorelational architecture "on top of" Hadoop.

The driving questions tend to be:

Does the mix of queries access the entire range of base data across the
cluster?
How much latency between receipt of (raw) new data, processing of that data
into the SQL/NotOnlySQL repository,  and delivery of the full mix of
results to your spectrum of users is permissible?

Can you move some mix of queries to a somewhat out of date repository that
is refreshed e.g. daily?

At the end of the day, complexity of your business data requirements and
the complexity of the process drive one to more layers and complex
solutions.

*.......*

*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Tue, Jan 20, 2015 at 9:12 PM, Ted Yu <yu...@gmail.com> wrote:

> bq. Is Apache Spark good as a general database
>
> I don't think Spark itself is a general database though there're
> connectors to various NoSQL databases, including HBase.
>
> bq. using their graph database features?
>
> Sure. Take a look at http://spark.apache.org/graphx/
>
> Cheers
>
> On Tue, Jan 20, 2015 at 9:02 PM, Alec Taylor <al...@gmail.com>
> wrote:
>
>> Small amounts in a one node cluster (at first).
>>
>> As it scales I'll be looking at running various O(nk) algorithms,
>> where n is the number of distinct users and k are the overlapping
>> features I want to consider.
>>
>> Is Apache Spark good as a general database as well as it's more fancy
>> features? - E.g.: considering I'm building a network, maybe using
>> their graph database features?
>>
>> On Wed, Jan 21, 2015 at 2:27 AM, Ted Yu <yu...@gmail.com> wrote:
>> > Apache Spark supports integration with HBase (which has REST API).
>> >
>> > What's the amount of data you want to store in this system ?
>> >
>> > Cheers
>> >
>> > On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor <al...@gmail.com>
>> wrote:
>> >>
>> >> I am architecting a platform incorporating: recommender systems,
>> >> information retrieval (ML), sequence mining, and Natural Language
>> >> Processing.
>> >>
>> >> Additionally I have the generic CRUD and authentication components,
>> >> with everything exposed RESTfully.
>> >>
>> >> For the storage layer(s), there are a few options which immediately
>> >> present themselves:
>> >>
>> >> Generic CRUD layer (high speed needed here, though I suppose I could
>> use
>> >> Redis…)
>> >>
>> >> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
>> >> SQL layer atop
>> >> - Apache Spark (perhaps piping to HDFS)… ¿maybe?
>> >> - MongoDB (or a similar document-store), a graph-database, or even
>> >> something like Postgres
>> >>
>> >> Analytics layer (to enable Big Data / Data-intensive computing
>> features)
>> >>
>> >> - Apache Spark
>> >> - Hadoop with MapReduce and/or utilising some other Apache /
>> >> non-Apache project with integration
>> >> - Disco (from Nokia)
>> >>
>> >> ________________________________
>> >>
>> >> Should I prefer one layer—e.g.: on HDFS—over multiple disparite
>> >> layers? - The advantage here is obvious, but I am certain there are
>> >> disadvantages. (and yes, I know there are various ways; automated and
>> >> manual; to push data from non HDFS-backed stores to HDFS)
>> >>
>> >> Also, as a bonus answer, which stack would you recommend for this
>> >> user-network I'm building?
>> >
>> >
>>
>
>

Re: Low-latency queries, HDFS exclusively or should I go, e.g.: MongoDB?

Posted by Ted Yu <yu...@gmail.com>.

bq. Is Apache Spark good as a general database

I don't think Spark itself is a general database though there're connectors
to various NoSQL databases, including HBase.

bq. using their graph database features?

Sure. Take a look at http://spark.apache.org/graphx/

Cheers

On Tue, Jan 20, 2015 at 9:02 PM, Alec Taylor <al...@gmail.com> wrote:

> Small amounts in a one node cluster (at first).
>
> As it scales I'll be looking at running various O(nk) algorithms,
> where n is the number of distinct users and k are the overlapping
> features I want to consider.
>
> Is Apache Spark good as a general database as well as it's more fancy
> features? - E.g.: considering I'm building a network, maybe using
> their graph database features?
>
> On Wed, Jan 21, 2015 at 2:27 AM, Ted Yu <yu...@gmail.com> wrote:
> > Apache Spark supports integration with HBase (which has REST API).
> >
> > What's the amount of data you want to store in this system ?
> >
> > Cheers
> >
> > On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor <al...@gmail.com>
> wrote:
> >>
> >> I am architecting a platform incorporating: recommender systems,
> >> information retrieval (ML), sequence mining, and Natural Language
> >> Processing.
> >>
> >> Additionally I have the generic CRUD and authentication components,
> >> with everything exposed RESTfully.
> >>
> >> For the storage layer(s), there are a few options which immediately
> >> present themselves:
> >>
> >> Generic CRUD layer (high speed needed here, though I suppose I could use
> >> Redis…)
> >>
> >> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
> >> SQL layer atop
> >> - Apache Spark (perhaps piping to HDFS)… ¿maybe?
> >> - MongoDB (or a similar document-store), a graph-database, or even
> >> something like Postgres
> >>
> >> Analytics layer (to enable Big Data / Data-intensive computing features)
> >>
> >> - Apache Spark
> >> - Hadoop with MapReduce and/or utilising some other Apache /
> >> non-Apache project with integration
> >> - Disco (from Nokia)
> >>
> >> ________________________________
> >>
> >> Should I prefer one layer—e.g.: on HDFS—over multiple disparite
> >> layers? - The advantage here is obvious, but I am certain there are
> >> disadvantages. (and yes, I know there are various ways; automated and
> >> manual; to push data from non HDFS-backed stores to HDFS)
> >>
> >> Also, as a bonus answer, which stack would you recommend for this
> >> user-network I'm building?
> >
> >
>

Re: Low-latency queries, HDFS exclusively or should I go, e.g.: MongoDB?

Posted by Ted Yu <yu...@gmail.com>.

bq. Is Apache Spark good as a general database

I don't think Spark itself is a general database though there're connectors
to various NoSQL databases, including HBase.

bq. using their graph database features?

Sure. Take a look at http://spark.apache.org/graphx/

Cheers

On Tue, Jan 20, 2015 at 9:02 PM, Alec Taylor <al...@gmail.com> wrote:

> Small amounts in a one node cluster (at first).
>
> As it scales I'll be looking at running various O(nk) algorithms,
> where n is the number of distinct users and k are the overlapping
> features I want to consider.
>
> Is Apache Spark good as a general database as well as it's more fancy
> features? - E.g.: considering I'm building a network, maybe using
> their graph database features?
>
> On Wed, Jan 21, 2015 at 2:27 AM, Ted Yu <yu...@gmail.com> wrote:
> > Apache Spark supports integration with HBase (which has REST API).
> >
> > What's the amount of data you want to store in this system ?
> >
> > Cheers
> >
> > On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor <al...@gmail.com>
> wrote:
> >>
> >> I am architecting a platform incorporating: recommender systems,
> >> information retrieval (ML), sequence mining, and Natural Language
> >> Processing.
> >>
> >> Additionally I have the generic CRUD and authentication components,
> >> with everything exposed RESTfully.
> >>
> >> For the storage layer(s), there are a few options which immediately
> >> present themselves:
> >>
> >> Generic CRUD layer (high speed needed here, though I suppose I could use
> >> Redis…)
> >>
> >> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
> >> SQL layer atop
> >> - Apache Spark (perhaps piping to HDFS)… ¿maybe?
> >> - MongoDB (or a similar document-store), a graph-database, or even
> >> something like Postgres
> >>
> >> Analytics layer (to enable Big Data / Data-intensive computing features)
> >>
> >> - Apache Spark
> >> - Hadoop with MapReduce and/or utilising some other Apache /
> >> non-Apache project with integration
> >> - Disco (from Nokia)
> >>
> >> ________________________________
> >>
> >> Should I prefer one layer—e.g.: on HDFS—over multiple disparite
> >> layers? - The advantage here is obvious, but I am certain there are
> >> disadvantages. (and yes, I know there are various ways; automated and
> >> manual; to push data from non HDFS-backed stores to HDFS)
> >>
> >> Also, as a bonus answer, which stack would you recommend for this
> >> user-network I'm building?
> >
> >
>

Re: Low-latency queries, HDFS exclusively or should I go, e.g.: MongoDB?

Posted by Ted Yu <yu...@gmail.com>.

bq. Is Apache Spark good as a general database

I don't think Spark itself is a general database though there're connectors
to various NoSQL databases, including HBase.

bq. using their graph database features?

Sure. Take a look at http://spark.apache.org/graphx/

Cheers

On Tue, Jan 20, 2015 at 9:02 PM, Alec Taylor <al...@gmail.com> wrote:

> Small amounts in a one node cluster (at first).
>
> As it scales I'll be looking at running various O(nk) algorithms,
> where n is the number of distinct users and k are the overlapping
> features I want to consider.
>
> Is Apache Spark good as a general database as well as it's more fancy
> features? - E.g.: considering I'm building a network, maybe using
> their graph database features?
>
> On Wed, Jan 21, 2015 at 2:27 AM, Ted Yu <yu...@gmail.com> wrote:
> > Apache Spark supports integration with HBase (which has REST API).
> >
> > What's the amount of data you want to store in this system ?
> >
> > Cheers
> >
> > On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor <al...@gmail.com>
> wrote:
> >>
> >> I am architecting a platform incorporating: recommender systems,
> >> information retrieval (ML), sequence mining, and Natural Language
> >> Processing.
> >>
> >> Additionally I have the generic CRUD and authentication components,
> >> with everything exposed RESTfully.
> >>
> >> For the storage layer(s), there are a few options which immediately
> >> present themselves:
> >>
> >> Generic CRUD layer (high speed needed here, though I suppose I could use
> >> Redis…)
> >>
> >> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
> >> SQL layer atop
> >> - Apache Spark (perhaps piping to HDFS)… ¿maybe?
> >> - MongoDB (or a similar document-store), a graph-database, or even
> >> something like Postgres
> >>
> >> Analytics layer (to enable Big Data / Data-intensive computing features)
> >>
> >> - Apache Spark
> >> - Hadoop with MapReduce and/or utilising some other Apache /
> >> non-Apache project with integration
> >> - Disco (from Nokia)
> >>
> >> ________________________________
> >>
> >> Should I prefer one layer—e.g.: on HDFS—over multiple disparite
> >> layers? - The advantage here is obvious, but I am certain there are
> >> disadvantages. (and yes, I know there are various ways; automated and
> >> manual; to push data from non HDFS-backed stores to HDFS)
> >>
> >> Also, as a bonus answer, which stack would you recommend for this
> >> user-network I'm building?
> >
> >
>

Re: Low-latency queries, HDFS exclusively or should I go, e.g.: MongoDB?

Posted by Ted Yu <yu...@gmail.com>.

bq. Is Apache Spark good as a general database

I don't think Spark itself is a general database though there're connectors
to various NoSQL databases, including HBase.

bq. using their graph database features?

Sure. Take a look at http://spark.apache.org/graphx/

Cheers

On Tue, Jan 20, 2015 at 9:02 PM, Alec Taylor <al...@gmail.com> wrote:

> Small amounts in a one node cluster (at first).
>
> As it scales I'll be looking at running various O(nk) algorithms,
> where n is the number of distinct users and k are the overlapping
> features I want to consider.
>
> Is Apache Spark good as a general database as well as it's more fancy
> features? - E.g.: considering I'm building a network, maybe using
> their graph database features?
>
> On Wed, Jan 21, 2015 at 2:27 AM, Ted Yu <yu...@gmail.com> wrote:
> > Apache Spark supports integration with HBase (which has REST API).
> >
> > What's the amount of data you want to store in this system ?
> >
> > Cheers
> >
> > On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor <al...@gmail.com>
> wrote:
> >>
> >> I am architecting a platform incorporating: recommender systems,
> >> information retrieval (ML), sequence mining, and Natural Language
> >> Processing.
> >>
> >> Additionally I have the generic CRUD and authentication components,
> >> with everything exposed RESTfully.
> >>
> >> For the storage layer(s), there are a few options which immediately
> >> present themselves:
> >>
> >> Generic CRUD layer (high speed needed here, though I suppose I could use
> >> Redis…)
> >>
> >> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
> >> SQL layer atop
> >> - Apache Spark (perhaps piping to HDFS)… ¿maybe?
> >> - MongoDB (or a similar document-store), a graph-database, or even
> >> something like Postgres
> >>
> >> Analytics layer (to enable Big Data / Data-intensive computing features)
> >>
> >> - Apache Spark
> >> - Hadoop with MapReduce and/or utilising some other Apache /
> >> non-Apache project with integration
> >> - Disco (from Nokia)
> >>
> >> ________________________________
> >>
> >> Should I prefer one layer—e.g.: on HDFS—over multiple disparite
> >> layers? - The advantage here is obvious, but I am certain there are
> >> disadvantages. (and yes, I know there are various ways; automated and
> >> manual; to push data from non HDFS-backed stores to HDFS)
> >>
> >> Also, as a bonus answer, which stack would you recommend for this
> >> user-network I'm building?
> >
> >
>

Re: Low-latency queries, HDFS exclusively or should I go, e.g.: MongoDB?

Posted by Alec Taylor <al...@gmail.com>.

Small amounts in a one node cluster (at first).

As it scales I'll be looking at running various O(nk) algorithms,
where n is the number of distinct users and k are the overlapping
features I want to consider.

Is Apache Spark good as a general database as well as it's more fancy
features? - E.g.: considering I'm building a network, maybe using
their graph database features?

On Wed, Jan 21, 2015 at 2:27 AM, Ted Yu <yu...@gmail.com> wrote:
> Apache Spark supports integration with HBase (which has REST API).
>
> What's the amount of data you want to store in this system ?
>
> Cheers
>
> On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor <al...@gmail.com> wrote:
>>
>> I am architecting a platform incorporating: recommender systems,
>> information retrieval (ML), sequence mining, and Natural Language
>> Processing.
>>
>> Additionally I have the generic CRUD and authentication components,
>> with everything exposed RESTfully.
>>
>> For the storage layer(s), there are a few options which immediately
>> present themselves:
>>
>> Generic CRUD layer (high speed needed here, though I suppose I could use
>> Redis…)
>>
>> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
>> SQL layer atop
>> - Apache Spark (perhaps piping to HDFS)… ¿maybe?
>> - MongoDB (or a similar document-store), a graph-database, or even
>> something like Postgres
>>
>> Analytics layer (to enable Big Data / Data-intensive computing features)
>>
>> - Apache Spark
>> - Hadoop with MapReduce and/or utilising some other Apache /
>> non-Apache project with integration
>> - Disco (from Nokia)
>>
>> ________________________________
>>
>> Should I prefer one layer—e.g.: on HDFS—over multiple disparite
>> layers? - The advantage here is obvious, but I am certain there are
>> disadvantages. (and yes, I know there are various ways; automated and
>> manual; to push data from non HDFS-backed stores to HDFS)
>>
>> Also, as a bonus answer, which stack would you recommend for this
>> user-network I'm building?
>
>

Re: Low-latency queries, HDFS exclusively or should I go, e.g.: MongoDB?

Posted by Alec Taylor <al...@gmail.com>.

Small amounts in a one node cluster (at first).

As it scales I'll be looking at running various O(nk) algorithms,
where n is the number of distinct users and k are the overlapping
features I want to consider.

Is Apache Spark good as a general database as well as it's more fancy
features? - E.g.: considering I'm building a network, maybe using
their graph database features?

On Wed, Jan 21, 2015 at 2:27 AM, Ted Yu <yu...@gmail.com> wrote:
> Apache Spark supports integration with HBase (which has REST API).
>
> What's the amount of data you want to store in this system ?
>
> Cheers
>
> On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor <al...@gmail.com> wrote:
>>
>> I am architecting a platform incorporating: recommender systems,
>> information retrieval (ML), sequence mining, and Natural Language
>> Processing.
>>
>> Additionally I have the generic CRUD and authentication components,
>> with everything exposed RESTfully.
>>
>> For the storage layer(s), there are a few options which immediately
>> present themselves:
>>
>> Generic CRUD layer (high speed needed here, though I suppose I could use
>> Redis…)
>>
>> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
>> SQL layer atop
>> - Apache Spark (perhaps piping to HDFS)… ¿maybe?
>> - MongoDB (or a similar document-store), a graph-database, or even
>> something like Postgres
>>
>> Analytics layer (to enable Big Data / Data-intensive computing features)
>>
>> - Apache Spark
>> - Hadoop with MapReduce and/or utilising some other Apache /
>> non-Apache project with integration
>> - Disco (from Nokia)
>>
>> ________________________________
>>
>> Should I prefer one layer—e.g.: on HDFS—over multiple disparite
>> layers? - The advantage here is obvious, but I am certain there are
>> disadvantages. (and yes, I know there are various ways; automated and
>> manual; to push data from non HDFS-backed stores to HDFS)
>>
>> Also, as a bonus answer, which stack would you recommend for this
>> user-network I'm building?
>
>

Re: Low-latency queries, HDFS exclusively or should I go, e.g.: MongoDB?

Posted by Alec Taylor <al...@gmail.com>.

Small amounts in a one node cluster (at first).

As it scales I'll be looking at running various O(nk) algorithms,
where n is the number of distinct users and k are the overlapping
features I want to consider.

Is Apache Spark good as a general database as well as it's more fancy
features? - E.g.: considering I'm building a network, maybe using
their graph database features?

On Wed, Jan 21, 2015 at 2:27 AM, Ted Yu <yu...@gmail.com> wrote:
> Apache Spark supports integration with HBase (which has REST API).
>
> What's the amount of data you want to store in this system ?
>
> Cheers
>
> On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor <al...@gmail.com> wrote:
>>
>> I am architecting a platform incorporating: recommender systems,
>> information retrieval (ML), sequence mining, and Natural Language
>> Processing.
>>
>> Additionally I have the generic CRUD and authentication components,
>> with everything exposed RESTfully.
>>
>> For the storage layer(s), there are a few options which immediately
>> present themselves:
>>
>> Generic CRUD layer (high speed needed here, though I suppose I could use
>> Redis…)
>>
>> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
>> SQL layer atop
>> - Apache Spark (perhaps piping to HDFS)… ¿maybe?
>> - MongoDB (or a similar document-store), a graph-database, or even
>> something like Postgres
>>
>> Analytics layer (to enable Big Data / Data-intensive computing features)
>>
>> - Apache Spark
>> - Hadoop with MapReduce and/or utilising some other Apache /
>> non-Apache project with integration
>> - Disco (from Nokia)
>>
>> ________________________________
>>
>> Should I prefer one layer—e.g.: on HDFS—over multiple disparite
>> layers? - The advantage here is obvious, but I am certain there are
>> disadvantages. (and yes, I know there are various ways; automated and
>> manual; to push data from non HDFS-backed stores to HDFS)
>>
>> Also, as a bonus answer, which stack would you recommend for this
>> user-network I'm building?
>
>

Re: Low-latency queries, HDFS exclusively or should I go, e.g.: MongoDB?

Posted by Alec Taylor <al...@gmail.com>.

Small amounts in a one node cluster (at first).

As it scales I'll be looking at running various O(nk) algorithms,
where n is the number of distinct users and k are the overlapping
features I want to consider.

Is Apache Spark good as a general database as well as it's more fancy
features? - E.g.: considering I'm building a network, maybe using
their graph database features?

On Wed, Jan 21, 2015 at 2:27 AM, Ted Yu <yu...@gmail.com> wrote:
> Apache Spark supports integration with HBase (which has REST API).
>
> What's the amount of data you want to store in this system ?
>
> Cheers
>
> On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor <al...@gmail.com> wrote:
>>
>> I am architecting a platform incorporating: recommender systems,
>> information retrieval (ML), sequence mining, and Natural Language
>> Processing.
>>
>> Additionally I have the generic CRUD and authentication components,
>> with everything exposed RESTfully.
>>
>> For the storage layer(s), there are a few options which immediately
>> present themselves:
>>
>> Generic CRUD layer (high speed needed here, though I suppose I could use
>> Redis…)
>>
>> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
>> SQL layer atop
>> - Apache Spark (perhaps piping to HDFS)… ¿maybe?
>> - MongoDB (or a similar document-store), a graph-database, or even
>> something like Postgres
>>
>> Analytics layer (to enable Big Data / Data-intensive computing features)
>>
>> - Apache Spark
>> - Hadoop with MapReduce and/or utilising some other Apache /
>> non-Apache project with integration
>> - Disco (from Nokia)
>>
>> ________________________________
>>
>> Should I prefer one layer—e.g.: on HDFS—over multiple disparite
>> layers? - The advantage here is obvious, but I am certain there are
>> disadvantages. (and yes, I know there are various ways; automated and
>> manual; to push data from non HDFS-backed stores to HDFS)
>>
>> Also, as a bonus answer, which stack would you recommend for this
>> user-network I'm building?
>
>

Re: Low-latency queries, HDFS exclusively or should I go, e.g.: MongoDB?

Posted by Ted Yu <yu...@gmail.com>.

Apache Spark supports integration with HBase (which has REST API).

What's the amount of data you want to store in this system ?

Cheers

On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor <al...@gmail.com> wrote:

> I am architecting a platform incorporating: recommender systems,
> information retrieval (ML), sequence mining, and Natural Language
> Processing.
>
> Additionally I have the generic CRUD and authentication components,
> with everything exposed RESTfully.
>
> For the storage layer(s), there are a few options which immediately
> present themselves:
>
> Generic CRUD layer (high speed needed here, though I suppose I could use
> Redis…)
>
> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
> SQL layer atop
> - Apache Spark (perhaps piping to HDFS)… ¿maybe?
> - MongoDB (or a similar document-store), a graph-database, or even
> something like Postgres
>
> Analytics layer (to enable Big Data / Data-intensive computing features)
>
> - Apache Spark
> - Hadoop with MapReduce and/or utilising some other Apache /
> non-Apache project with integration
> - Disco (from Nokia)
>
> ________________________________
>
> Should I prefer one layer—e.g.: on HDFS—over multiple disparite
> layers? - The advantage here is obvious, but I am certain there are
> disadvantages. (and yes, I know there are various ways; automated and
> manual; to push data from non HDFS-backed stores to HDFS)
>
> Also, as a bonus answer, which stack would you recommend for this
> user-network I'm building?
>

Re: Low-latency queries, HDFS exclusively or should I go, e.g.: MongoDB?

Posted by Ted Yu <yu...@gmail.com>.

Apache Spark supports integration with HBase (which has REST API).

What's the amount of data you want to store in this system ?

Cheers

On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor <al...@gmail.com> wrote:

> I am architecting a platform incorporating: recommender systems,
> information retrieval (ML), sequence mining, and Natural Language
> Processing.
>
> Additionally I have the generic CRUD and authentication components,
> with everything exposed RESTfully.
>
> For the storage layer(s), there are a few options which immediately
> present themselves:
>
> Generic CRUD layer (high speed needed here, though I suppose I could use
> Redis…)
>
> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
> SQL layer atop
> - Apache Spark (perhaps piping to HDFS)… ¿maybe?
> - MongoDB (or a similar document-store), a graph-database, or even
> something like Postgres
>
> Analytics layer (to enable Big Data / Data-intensive computing features)
>
> - Apache Spark
> - Hadoop with MapReduce and/or utilising some other Apache /
> non-Apache project with integration
> - Disco (from Nokia)
>
> ________________________________
>
> Should I prefer one layer—e.g.: on HDFS—over multiple disparite
> layers? - The advantage here is obvious, but I am certain there are
> disadvantages. (and yes, I know there are various ways; automated and
> manual; to push data from non HDFS-backed stores to HDFS)
>
> Also, as a bonus answer, which stack would you recommend for this
> user-network I'm building?
>

Re: Low-latency queries, HDFS exclusively or should I go, e.g.: MongoDB?

Posted by Ted Yu <yu...@gmail.com>.

Apache Spark supports integration with HBase (which has REST API).

What's the amount of data you want to store in this system ?

Cheers

On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor <al...@gmail.com> wrote:

> I am architecting a platform incorporating: recommender systems,
> information retrieval (ML), sequence mining, and Natural Language
> Processing.
>
> Additionally I have the generic CRUD and authentication components,
> with everything exposed RESTfully.
>
> For the storage layer(s), there are a few options which immediately
> present themselves:
>
> Generic CRUD layer (high speed needed here, though I suppose I could use
> Redis…)
>
> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
> SQL layer atop
> - Apache Spark (perhaps piping to HDFS)… ¿maybe?
> - MongoDB (or a similar document-store), a graph-database, or even
> something like Postgres
>
> Analytics layer (to enable Big Data / Data-intensive computing features)
>
> - Apache Spark
> - Hadoop with MapReduce and/or utilising some other Apache /
> non-Apache project with integration
> - Disco (from Nokia)
>
> ________________________________
>
> Should I prefer one layer—e.g.: on HDFS—over multiple disparite
> layers? - The advantage here is obvious, but I am certain there are
> disadvantages. (and yes, I know there are various ways; automated and
> manual; to push data from non HDFS-backed stores to HDFS)
>
> Also, as a bonus answer, which stack would you recommend for this
> user-network I'm building?
>

Re: Low-latency queries, HDFS exclusively or should I go, e.g.: MongoDB?

Posted by Ted Yu <yu...@gmail.com>.

Apache Spark supports integration with HBase (which has REST API).

What's the amount of data you want to store in this system ?

Cheers

On Tue, Jan 20, 2015 at 3:40 AM, Alec Taylor <al...@gmail.com> wrote:

> I am architecting a platform incorporating: recommender systems,
> information retrieval (ML), sequence mining, and Natural Language
> Processing.
>
> Additionally I have the generic CRUD and authentication components,
> with everything exposed RESTfully.
>
> For the storage layer(s), there are a few options which immediately
> present themselves:
>
> Generic CRUD layer (high speed needed here, though I suppose I could use
> Redis…)
>
> - Hadoop with HBase, perhaps with Phoenix for an elastic loose-schema
> SQL layer atop
> - Apache Spark (perhaps piping to HDFS)… ¿maybe?
> - MongoDB (or a similar document-store), a graph-database, or even
> something like Postgres
>
> Analytics layer (to enable Big Data / Data-intensive computing features)
>
> - Apache Spark
> - Hadoop with MapReduce and/or utilising some other Apache /
> non-Apache project with integration
> - Disco (from Nokia)
>
> ________________________________
>
> Should I prefer one layer—e.g.: on HDFS—over multiple disparite
> layers? - The advantage here is obvious, but I am certain there are
> disadvantages. (and yes, I know there are various ways; automated and
> manual; to push data from non HDFS-backed stores to HDFS)
>
> Also, as a bonus answer, which stack would you recommend for this
> user-network I'm building?
>