You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by mani kandan <ma...@gmail.com> on 2014/10/01 17:24:10 UTC

Planning to propose Hadoop initiative to company. Need some inputs please.

First off, I'm a mainframe developer, so I don't know much about Java or
Web technology.

I work at an insurance company as a software developer. We have a website
to get quotes, raise and view claims and such.

I came to know about Hadoop and that I might be able to
​ leverage it to the advantage of my company.

​I'm planning to transition myself into Hadoop and propose a Hadoop
initiative to my company to leverage our web usage data.

Here are my questions:

1) How much web usage data will a typical website like ours collect on a
daily basis? (I know I can ask our IT department, but I would like to
gather some background idea before talking to them.)
2) What is the minimum size of data that is recommended for a Hadoop system?
3) How many clusters/nodes would I need to ​run a web usage analytics
system?
4) What are the ways for me to use our data? (One use case I'm thinking of
is to analyze the error messages log for each page on quote process to
redesign the UI. Is this possible?)
5) How long would it take for me to set up and start such a system?
​6) What would be the cost to my company to maintain such a system?
7) What kind of savings can I expect in return?

I'm sorry if some/all of these questions are unanswerable. I just want to
discuss my thoughts, and get an idea of what things can I achieve by going
the way of Hadoop.

Thanks!
Mani​

Re: Planning to propose Hadoop initiative to company. Need some inputs please.

Posted by Ted Yu <yu...@gmail.com>.
Adding hbase user.

On Wed, Oct 1, 2014 at 11:02 AM, Wilm Schumacher <wilm.schumacher@cawoom.com
> wrote:

> Hi,
>
> first: I think hbase is what you are looking for. If I understand
> correctly you want to show the customer his or her data very fast and
> let them manipulate their data. So you need something like a data
> warehouse system. Thus, hbase is the method of choice for you (and I
> think for your kind of data, hbase is a better choice than cassandra or
> mongoDB). But of course you need a running hadoop system to run a hbase.
> So it's not an either/or ;)
>
> (my answers are for hbase, as I think it's what you are looking for. If
> you are not interested, just ignore the following text. Sry @all by
> writing about hbase on this list ;).)
>
> Am 01.10.2014 um 17:24 schrieb mani kandan:
> > 1) How much web usage data will a typical website like ours collect on a
> > daily basis? (I know I can ask our IT department, but I would like to
> > gather some background idea before talking to them.)
> well, if you have the option to ask your IT department you should do
> that, because everyone here would have to guess. You would have to
> explain very detailed what you have to do to let us guess. If you e.g.
> want to track the user on what he or she has clicked, perhaps to make
> personalized ads, than you have to save more data. So, you should ask
> the persons who have the data right away without guessing.
>
> > 3) How many clusters/nodes would I need to ​run a web usage analytics
> > system?
> in the book "hbase in action" there are some recommendations for some
> "case studies" (part IV "deploying hbase"). There are some thoughts on
> the number of nodes, and how to use them, depending on the size of your
> data
>
> > 4) What are the ways for me to use our data? (One use case I'm thinking
> > of is to analyze the error messages log for each page on quote process
> > to redesign the UI. Is this possible?)
> sure. And this should be very easy. I would pump the error log into a
> hbase table. By this method you could read the messages directly from
> the hbase shell (if they are few enough). Or you could use hive to query
> your log a little more "sql like" and make statistics very easy.
>
> > 5) How long would it take for me to set up and start such a system?
> for a novice who have to do it for the first time: for the stand alone
> hbase system perhaps 2 hours. For a complete distributed test cluster
> ... perhaps a day. For the real producing system, with all security
> features ... a little longer ;).
>
> > I'm sorry if some/all of these questions are unanswerable. I just want
> > to discuss my thoughts, and get an idea of what things can I achieve by
> > going the way of Hadoop.
> well, I think, but I could err, that you think of hadoop (or hbase) in a
> way that you just can change the "database backend" from "SQL" to
> "hbase/hadoop" and everything would run right away. This will not be
> that easy. You would have to change the code of your web application in
> a very fundamental way. You have to rethink all the table designs etc.,
> so this could be more complicate than you think right know.
>
> However, hbase/hadoop hase some advantages which are very interesing for
> you. Well first, it is distributed, which enables your company to grow
> almost limitless, or to collect more data about your customers so you
> can get more informations (and sell more stuff). And map reduce is a
> wonderful tool for making real fancy "statistics", which is very
> interesting for an insurance company. Your mathematical economist will
> REALLY love it ;).
>
> Hope this helped.
>
> best wishes
>
> Wilm
>
>
>

Re: Planning to propose Hadoop initiative to company. Need some inputs please.

Posted by Ted Yu <yu...@gmail.com>.
Adding hbase user.

On Wed, Oct 1, 2014 at 11:02 AM, Wilm Schumacher <wilm.schumacher@cawoom.com
> wrote:

> Hi,
>
> first: I think hbase is what you are looking for. If I understand
> correctly you want to show the customer his or her data very fast and
> let them manipulate their data. So you need something like a data
> warehouse system. Thus, hbase is the method of choice for you (and I
> think for your kind of data, hbase is a better choice than cassandra or
> mongoDB). But of course you need a running hadoop system to run a hbase.
> So it's not an either/or ;)
>
> (my answers are for hbase, as I think it's what you are looking for. If
> you are not interested, just ignore the following text. Sry @all by
> writing about hbase on this list ;).)
>
> Am 01.10.2014 um 17:24 schrieb mani kandan:
> > 1) How much web usage data will a typical website like ours collect on a
> > daily basis? (I know I can ask our IT department, but I would like to
> > gather some background idea before talking to them.)
> well, if you have the option to ask your IT department you should do
> that, because everyone here would have to guess. You would have to
> explain very detailed what you have to do to let us guess. If you e.g.
> want to track the user on what he or she has clicked, perhaps to make
> personalized ads, than you have to save more data. So, you should ask
> the persons who have the data right away without guessing.
>
> > 3) How many clusters/nodes would I need to ​run a web usage analytics
> > system?
> in the book "hbase in action" there are some recommendations for some
> "case studies" (part IV "deploying hbase"). There are some thoughts on
> the number of nodes, and how to use them, depending on the size of your
> data
>
> > 4) What are the ways for me to use our data? (One use case I'm thinking
> > of is to analyze the error messages log for each page on quote process
> > to redesign the UI. Is this possible?)
> sure. And this should be very easy. I would pump the error log into a
> hbase table. By this method you could read the messages directly from
> the hbase shell (if they are few enough). Or you could use hive to query
> your log a little more "sql like" and make statistics very easy.
>
> > 5) How long would it take for me to set up and start such a system?
> for a novice who have to do it for the first time: for the stand alone
> hbase system perhaps 2 hours. For a complete distributed test cluster
> ... perhaps a day. For the real producing system, with all security
> features ... a little longer ;).
>
> > I'm sorry if some/all of these questions are unanswerable. I just want
> > to discuss my thoughts, and get an idea of what things can I achieve by
> > going the way of Hadoop.
> well, I think, but I could err, that you think of hadoop (or hbase) in a
> way that you just can change the "database backend" from "SQL" to
> "hbase/hadoop" and everything would run right away. This will not be
> that easy. You would have to change the code of your web application in
> a very fundamental way. You have to rethink all the table designs etc.,
> so this could be more complicate than you think right know.
>
> However, hbase/hadoop hase some advantages which are very interesing for
> you. Well first, it is distributed, which enables your company to grow
> almost limitless, or to collect more data about your customers so you
> can get more informations (and sell more stuff). And map reduce is a
> wonderful tool for making real fancy "statistics", which is very
> interesting for an insurance company. Your mathematical economist will
> REALLY love it ;).
>
> Hope this helped.
>
> best wishes
>
> Wilm
>
>
>

Re: Planning to propose Hadoop initiative to company. Need some inputs please.

Posted by Demai Ni <ni...@gmail.com>.
hi,

glad to see another person moving from mainframe world to the 'big' data
one. I was in the same boat a few years back after working on mainframe for
10+ years.

Wilm got to the pointers already. I'd like to just chime in a bit from
mainframe side.

The example of website usage is a very good one for bigdata comparing to
mainframe, as mainframe is very expensive to provide reliability for
mission-critical workload. One approach is to look at what the current
application running on mainframe or your guys are considering to implement
on mainframe. For a website usage case, the cost to implement and running
would be only 1/10 if on hadoop/hbase, comparing to mainframe. And
mainframe probably not able to scale up if the data goes to TB.

2nd, be careful that Hadoop is not for all your cases. I am pretty such
that your IT department is handling some mission-critical workloads, like
payroll, employee info, customer-payment, etc. Leaving those workloads on
mainframe. for 1) hbase/hadoop are not design for such RDMS workload; for
2) moving from one database to another is way too much risk unless the top
boss force you do so... :-)

Demai


On Wed, Oct 1, 2014 at 11:02 AM, Wilm Schumacher <wilm.schumacher@cawoom.com
> wrote:

> Hi,
>
> first: I think hbase is what you are looking for. If I understand
> correctly you want to show the customer his or her data very fast and
> let them manipulate their data. So you need something like a data
> warehouse system. Thus, hbase is the method of choice for you (and I
> think for your kind of data, hbase is a better choice than cassandra or
> mongoDB). But of course you need a running hadoop system to run a hbase.
> So it's not an either/or ;)
>
> (my answers are for hbase, as I think it's what you are looking for. If
> you are not interested, just ignore the following text. Sry @all by
> writing about hbase on this list ;).)
>
> Am 01.10.2014 um 17:24 schrieb mani kandan:
> > 1) How much web usage data will a typical website like ours collect on a
> > daily basis? (I know I can ask our IT department, but I would like to
> > gather some background idea before talking to them.)
> well, if you have the option to ask your IT department you should do
> that, because everyone here would have to guess. You would have to
> explain very detailed what you have to do to let us guess. If you e.g.
> want to track the user on what he or she has clicked, perhaps to make
> personalized ads, than you have to save more data. So, you should ask
> the persons who have the data right away without guessing.
>
> > 3) How many clusters/nodes would I need to ​run a web usage analytics
> > system?
> in the book "hbase in action" there are some recommendations for some
> "case studies" (part IV "deploying hbase"). There are some thoughts on
> the number of nodes, and how to use them, depending on the size of your
> data
>
> > 4) What are the ways for me to use our data? (One use case I'm thinking
> > of is to analyze the error messages log for each page on quote process
> > to redesign the UI. Is this possible?)
> sure. And this should be very easy. I would pump the error log into a
> hbase table. By this method you could read the messages directly from
> the hbase shell (if they are few enough). Or you could use hive to query
> your log a little more "sql like" and make statistics very easy.
>
> > 5) How long would it take for me to set up and start such a system?
> for a novice who have to do it for the first time: for the stand alone
> hbase system perhaps 2 hours. For a complete distributed test cluster
> ... perhaps a day. For the real producing system, with all security
> features ... a little longer ;).
>
> > I'm sorry if some/all of these questions are unanswerable. I just want
> > to discuss my thoughts, and get an idea of what things can I achieve by
> > going the way of Hadoop.
> well, I think, but I could err, that you think of hadoop (or hbase) in a
> way that you just can change the "database backend" from "SQL" to
> "hbase/hadoop" and everything would run right away. This will not be
> that easy. You would have to change the code of your web application in
> a very fundamental way. You have to rethink all the table designs etc.,
> so this could be more complicate than you think right know.
>
> However, hbase/hadoop hase some advantages which are very interesing for
> you. Well first, it is distributed, which enables your company to grow
> almost limitless, or to collect more data about your customers so you
> can get more informations (and sell more stuff). And map reduce is a
> wonderful tool for making real fancy "statistics", which is very
> interesting for an insurance company. Your mathematical economist will
> REALLY love it ;).
>
> Hope this helped.
>
> best wishes
>
> Wilm
>
>
>

Re: Planning to propose Hadoop initiative to company. Need some inputs please.

Posted by Demai Ni <ni...@gmail.com>.
hi,

glad to see another person moving from mainframe world to the 'big' data
one. I was in the same boat a few years back after working on mainframe for
10+ years.

Wilm got to the pointers already. I'd like to just chime in a bit from
mainframe side.

The example of website usage is a very good one for bigdata comparing to
mainframe, as mainframe is very expensive to provide reliability for
mission-critical workload. One approach is to look at what the current
application running on mainframe or your guys are considering to implement
on mainframe. For a website usage case, the cost to implement and running
would be only 1/10 if on hadoop/hbase, comparing to mainframe. And
mainframe probably not able to scale up if the data goes to TB.

2nd, be careful that Hadoop is not for all your cases. I am pretty such
that your IT department is handling some mission-critical workloads, like
payroll, employee info, customer-payment, etc. Leaving those workloads on
mainframe. for 1) hbase/hadoop are not design for such RDMS workload; for
2) moving from one database to another is way too much risk unless the top
boss force you do so... :-)

Demai


On Wed, Oct 1, 2014 at 11:02 AM, Wilm Schumacher <wilm.schumacher@cawoom.com
> wrote:

> Hi,
>
> first: I think hbase is what you are looking for. If I understand
> correctly you want to show the customer his or her data very fast and
> let them manipulate their data. So you need something like a data
> warehouse system. Thus, hbase is the method of choice for you (and I
> think for your kind of data, hbase is a better choice than cassandra or
> mongoDB). But of course you need a running hadoop system to run a hbase.
> So it's not an either/or ;)
>
> (my answers are for hbase, as I think it's what you are looking for. If
> you are not interested, just ignore the following text. Sry @all by
> writing about hbase on this list ;).)
>
> Am 01.10.2014 um 17:24 schrieb mani kandan:
> > 1) How much web usage data will a typical website like ours collect on a
> > daily basis? (I know I can ask our IT department, but I would like to
> > gather some background idea before talking to them.)
> well, if you have the option to ask your IT department you should do
> that, because everyone here would have to guess. You would have to
> explain very detailed what you have to do to let us guess. If you e.g.
> want to track the user on what he or she has clicked, perhaps to make
> personalized ads, than you have to save more data. So, you should ask
> the persons who have the data right away without guessing.
>
> > 3) How many clusters/nodes would I need to ​run a web usage analytics
> > system?
> in the book "hbase in action" there are some recommendations for some
> "case studies" (part IV "deploying hbase"). There are some thoughts on
> the number of nodes, and how to use them, depending on the size of your
> data
>
> > 4) What are the ways for me to use our data? (One use case I'm thinking
> > of is to analyze the error messages log for each page on quote process
> > to redesign the UI. Is this possible?)
> sure. And this should be very easy. I would pump the error log into a
> hbase table. By this method you could read the messages directly from
> the hbase shell (if they are few enough). Or you could use hive to query
> your log a little more "sql like" and make statistics very easy.
>
> > 5) How long would it take for me to set up and start such a system?
> for a novice who have to do it for the first time: for the stand alone
> hbase system perhaps 2 hours. For a complete distributed test cluster
> ... perhaps a day. For the real producing system, with all security
> features ... a little longer ;).
>
> > I'm sorry if some/all of these questions are unanswerable. I just want
> > to discuss my thoughts, and get an idea of what things can I achieve by
> > going the way of Hadoop.
> well, I think, but I could err, that you think of hadoop (or hbase) in a
> way that you just can change the "database backend" from "SQL" to
> "hbase/hadoop" and everything would run right away. This will not be
> that easy. You would have to change the code of your web application in
> a very fundamental way. You have to rethink all the table designs etc.,
> so this could be more complicate than you think right know.
>
> However, hbase/hadoop hase some advantages which are very interesing for
> you. Well first, it is distributed, which enables your company to grow
> almost limitless, or to collect more data about your customers so you
> can get more informations (and sell more stuff). And map reduce is a
> wonderful tool for making real fancy "statistics", which is very
> interesting for an insurance company. Your mathematical economist will
> REALLY love it ;).
>
> Hope this helped.
>
> best wishes
>
> Wilm
>
>
>

Re: Planning to propose Hadoop initiative to company. Need some inputs please.

Posted by Demai Ni <ni...@gmail.com>.
hi,

glad to see another person moving from mainframe world to the 'big' data
one. I was in the same boat a few years back after working on mainframe for
10+ years.

Wilm got to the pointers already. I'd like to just chime in a bit from
mainframe side.

The example of website usage is a very good one for bigdata comparing to
mainframe, as mainframe is very expensive to provide reliability for
mission-critical workload. One approach is to look at what the current
application running on mainframe or your guys are considering to implement
on mainframe. For a website usage case, the cost to implement and running
would be only 1/10 if on hadoop/hbase, comparing to mainframe. And
mainframe probably not able to scale up if the data goes to TB.

2nd, be careful that Hadoop is not for all your cases. I am pretty such
that your IT department is handling some mission-critical workloads, like
payroll, employee info, customer-payment, etc. Leaving those workloads on
mainframe. for 1) hbase/hadoop are not design for such RDMS workload; for
2) moving from one database to another is way too much risk unless the top
boss force you do so... :-)

Demai


On Wed, Oct 1, 2014 at 11:02 AM, Wilm Schumacher <wilm.schumacher@cawoom.com
> wrote:

> Hi,
>
> first: I think hbase is what you are looking for. If I understand
> correctly you want to show the customer his or her data very fast and
> let them manipulate their data. So you need something like a data
> warehouse system. Thus, hbase is the method of choice for you (and I
> think for your kind of data, hbase is a better choice than cassandra or
> mongoDB). But of course you need a running hadoop system to run a hbase.
> So it's not an either/or ;)
>
> (my answers are for hbase, as I think it's what you are looking for. If
> you are not interested, just ignore the following text. Sry @all by
> writing about hbase on this list ;).)
>
> Am 01.10.2014 um 17:24 schrieb mani kandan:
> > 1) How much web usage data will a typical website like ours collect on a
> > daily basis? (I know I can ask our IT department, but I would like to
> > gather some background idea before talking to them.)
> well, if you have the option to ask your IT department you should do
> that, because everyone here would have to guess. You would have to
> explain very detailed what you have to do to let us guess. If you e.g.
> want to track the user on what he or she has clicked, perhaps to make
> personalized ads, than you have to save more data. So, you should ask
> the persons who have the data right away without guessing.
>
> > 3) How many clusters/nodes would I need to ​run a web usage analytics
> > system?
> in the book "hbase in action" there are some recommendations for some
> "case studies" (part IV "deploying hbase"). There are some thoughts on
> the number of nodes, and how to use them, depending on the size of your
> data
>
> > 4) What are the ways for me to use our data? (One use case I'm thinking
> > of is to analyze the error messages log for each page on quote process
> > to redesign the UI. Is this possible?)
> sure. And this should be very easy. I would pump the error log into a
> hbase table. By this method you could read the messages directly from
> the hbase shell (if they are few enough). Or you could use hive to query
> your log a little more "sql like" and make statistics very easy.
>
> > 5) How long would it take for me to set up and start such a system?
> for a novice who have to do it for the first time: for the stand alone
> hbase system perhaps 2 hours. For a complete distributed test cluster
> ... perhaps a day. For the real producing system, with all security
> features ... a little longer ;).
>
> > I'm sorry if some/all of these questions are unanswerable. I just want
> > to discuss my thoughts, and get an idea of what things can I achieve by
> > going the way of Hadoop.
> well, I think, but I could err, that you think of hadoop (or hbase) in a
> way that you just can change the "database backend" from "SQL" to
> "hbase/hadoop" and everything would run right away. This will not be
> that easy. You would have to change the code of your web application in
> a very fundamental way. You have to rethink all the table designs etc.,
> so this could be more complicate than you think right know.
>
> However, hbase/hadoop hase some advantages which are very interesing for
> you. Well first, it is distributed, which enables your company to grow
> almost limitless, or to collect more data about your customers so you
> can get more informations (and sell more stuff). And map reduce is a
> wonderful tool for making real fancy "statistics", which is very
> interesting for an insurance company. Your mathematical economist will
> REALLY love it ;).
>
> Hope this helped.
>
> best wishes
>
> Wilm
>
>
>

Re: Planning to propose Hadoop initiative to company. Need some inputs please.

Posted by Ted Yu <yu...@gmail.com>.
Adding hbase user.

On Wed, Oct 1, 2014 at 11:02 AM, Wilm Schumacher <wilm.schumacher@cawoom.com
> wrote:

> Hi,
>
> first: I think hbase is what you are looking for. If I understand
> correctly you want to show the customer his or her data very fast and
> let them manipulate their data. So you need something like a data
> warehouse system. Thus, hbase is the method of choice for you (and I
> think for your kind of data, hbase is a better choice than cassandra or
> mongoDB). But of course you need a running hadoop system to run a hbase.
> So it's not an either/or ;)
>
> (my answers are for hbase, as I think it's what you are looking for. If
> you are not interested, just ignore the following text. Sry @all by
> writing about hbase on this list ;).)
>
> Am 01.10.2014 um 17:24 schrieb mani kandan:
> > 1) How much web usage data will a typical website like ours collect on a
> > daily basis? (I know I can ask our IT department, but I would like to
> > gather some background idea before talking to them.)
> well, if you have the option to ask your IT department you should do
> that, because everyone here would have to guess. You would have to
> explain very detailed what you have to do to let us guess. If you e.g.
> want to track the user on what he or she has clicked, perhaps to make
> personalized ads, than you have to save more data. So, you should ask
> the persons who have the data right away without guessing.
>
> > 3) How many clusters/nodes would I need to ​run a web usage analytics
> > system?
> in the book "hbase in action" there are some recommendations for some
> "case studies" (part IV "deploying hbase"). There are some thoughts on
> the number of nodes, and how to use them, depending on the size of your
> data
>
> > 4) What are the ways for me to use our data? (One use case I'm thinking
> > of is to analyze the error messages log for each page on quote process
> > to redesign the UI. Is this possible?)
> sure. And this should be very easy. I would pump the error log into a
> hbase table. By this method you could read the messages directly from
> the hbase shell (if they are few enough). Or you could use hive to query
> your log a little more "sql like" and make statistics very easy.
>
> > 5) How long would it take for me to set up and start such a system?
> for a novice who have to do it for the first time: for the stand alone
> hbase system perhaps 2 hours. For a complete distributed test cluster
> ... perhaps a day. For the real producing system, with all security
> features ... a little longer ;).
>
> > I'm sorry if some/all of these questions are unanswerable. I just want
> > to discuss my thoughts, and get an idea of what things can I achieve by
> > going the way of Hadoop.
> well, I think, but I could err, that you think of hadoop (or hbase) in a
> way that you just can change the "database backend" from "SQL" to
> "hbase/hadoop" and everything would run right away. This will not be
> that easy. You would have to change the code of your web application in
> a very fundamental way. You have to rethink all the table designs etc.,
> so this could be more complicate than you think right know.
>
> However, hbase/hadoop hase some advantages which are very interesing for
> you. Well first, it is distributed, which enables your company to grow
> almost limitless, or to collect more data about your customers so you
> can get more informations (and sell more stuff). And map reduce is a
> wonderful tool for making real fancy "statistics", which is very
> interesting for an insurance company. Your mathematical economist will
> REALLY love it ;).
>
> Hope this helped.
>
> best wishes
>
> Wilm
>
>
>

Re: Planning to propose Hadoop initiative to company. Need some inputs please.

Posted by Demai Ni <ni...@gmail.com>.
hi,

glad to see another person moving from mainframe world to the 'big' data
one. I was in the same boat a few years back after working on mainframe for
10+ years.

Wilm got to the pointers already. I'd like to just chime in a bit from
mainframe side.

The example of website usage is a very good one for bigdata comparing to
mainframe, as mainframe is very expensive to provide reliability for
mission-critical workload. One approach is to look at what the current
application running on mainframe or your guys are considering to implement
on mainframe. For a website usage case, the cost to implement and running
would be only 1/10 if on hadoop/hbase, comparing to mainframe. And
mainframe probably not able to scale up if the data goes to TB.

2nd, be careful that Hadoop is not for all your cases. I am pretty such
that your IT department is handling some mission-critical workloads, like
payroll, employee info, customer-payment, etc. Leaving those workloads on
mainframe. for 1) hbase/hadoop are not design for such RDMS workload; for
2) moving from one database to another is way too much risk unless the top
boss force you do so... :-)

Demai


On Wed, Oct 1, 2014 at 11:02 AM, Wilm Schumacher <wilm.schumacher@cawoom.com
> wrote:

> Hi,
>
> first: I think hbase is what you are looking for. If I understand
> correctly you want to show the customer his or her data very fast and
> let them manipulate their data. So you need something like a data
> warehouse system. Thus, hbase is the method of choice for you (and I
> think for your kind of data, hbase is a better choice than cassandra or
> mongoDB). But of course you need a running hadoop system to run a hbase.
> So it's not an either/or ;)
>
> (my answers are for hbase, as I think it's what you are looking for. If
> you are not interested, just ignore the following text. Sry @all by
> writing about hbase on this list ;).)
>
> Am 01.10.2014 um 17:24 schrieb mani kandan:
> > 1) How much web usage data will a typical website like ours collect on a
> > daily basis? (I know I can ask our IT department, but I would like to
> > gather some background idea before talking to them.)
> well, if you have the option to ask your IT department you should do
> that, because everyone here would have to guess. You would have to
> explain very detailed what you have to do to let us guess. If you e.g.
> want to track the user on what he or she has clicked, perhaps to make
> personalized ads, than you have to save more data. So, you should ask
> the persons who have the data right away without guessing.
>
> > 3) How many clusters/nodes would I need to ​run a web usage analytics
> > system?
> in the book "hbase in action" there are some recommendations for some
> "case studies" (part IV "deploying hbase"). There are some thoughts on
> the number of nodes, and how to use them, depending on the size of your
> data
>
> > 4) What are the ways for me to use our data? (One use case I'm thinking
> > of is to analyze the error messages log for each page on quote process
> > to redesign the UI. Is this possible?)
> sure. And this should be very easy. I would pump the error log into a
> hbase table. By this method you could read the messages directly from
> the hbase shell (if they are few enough). Or you could use hive to query
> your log a little more "sql like" and make statistics very easy.
>
> > 5) How long would it take for me to set up and start such a system?
> for a novice who have to do it for the first time: for the stand alone
> hbase system perhaps 2 hours. For a complete distributed test cluster
> ... perhaps a day. For the real producing system, with all security
> features ... a little longer ;).
>
> > I'm sorry if some/all of these questions are unanswerable. I just want
> > to discuss my thoughts, and get an idea of what things can I achieve by
> > going the way of Hadoop.
> well, I think, but I could err, that you think of hadoop (or hbase) in a
> way that you just can change the "database backend" from "SQL" to
> "hbase/hadoop" and everything would run right away. This will not be
> that easy. You would have to change the code of your web application in
> a very fundamental way. You have to rethink all the table designs etc.,
> so this could be more complicate than you think right know.
>
> However, hbase/hadoop hase some advantages which are very interesing for
> you. Well first, it is distributed, which enables your company to grow
> almost limitless, or to collect more data about your customers so you
> can get more informations (and sell more stuff). And map reduce is a
> wonderful tool for making real fancy "statistics", which is very
> interesting for an insurance company. Your mathematical economist will
> REALLY love it ;).
>
> Hope this helped.
>
> best wishes
>
> Wilm
>
>
>

Re: Planning to propose Hadoop initiative to company. Need some inputs please.

Posted by Ted Yu <yu...@gmail.com>.
Adding hbase user.

On Wed, Oct 1, 2014 at 11:02 AM, Wilm Schumacher <wilm.schumacher@cawoom.com
> wrote:

> Hi,
>
> first: I think hbase is what you are looking for. If I understand
> correctly you want to show the customer his or her data very fast and
> let them manipulate their data. So you need something like a data
> warehouse system. Thus, hbase is the method of choice for you (and I
> think for your kind of data, hbase is a better choice than cassandra or
> mongoDB). But of course you need a running hadoop system to run a hbase.
> So it's not an either/or ;)
>
> (my answers are for hbase, as I think it's what you are looking for. If
> you are not interested, just ignore the following text. Sry @all by
> writing about hbase on this list ;).)
>
> Am 01.10.2014 um 17:24 schrieb mani kandan:
> > 1) How much web usage data will a typical website like ours collect on a
> > daily basis? (I know I can ask our IT department, but I would like to
> > gather some background idea before talking to them.)
> well, if you have the option to ask your IT department you should do
> that, because everyone here would have to guess. You would have to
> explain very detailed what you have to do to let us guess. If you e.g.
> want to track the user on what he or she has clicked, perhaps to make
> personalized ads, than you have to save more data. So, you should ask
> the persons who have the data right away without guessing.
>
> > 3) How many clusters/nodes would I need to ​run a web usage analytics
> > system?
> in the book "hbase in action" there are some recommendations for some
> "case studies" (part IV "deploying hbase"). There are some thoughts on
> the number of nodes, and how to use them, depending on the size of your
> data
>
> > 4) What are the ways for me to use our data? (One use case I'm thinking
> > of is to analyze the error messages log for each page on quote process
> > to redesign the UI. Is this possible?)
> sure. And this should be very easy. I would pump the error log into a
> hbase table. By this method you could read the messages directly from
> the hbase shell (if they are few enough). Or you could use hive to query
> your log a little more "sql like" and make statistics very easy.
>
> > 5) How long would it take for me to set up and start such a system?
> for a novice who have to do it for the first time: for the stand alone
> hbase system perhaps 2 hours. For a complete distributed test cluster
> ... perhaps a day. For the real producing system, with all security
> features ... a little longer ;).
>
> > I'm sorry if some/all of these questions are unanswerable. I just want
> > to discuss my thoughts, and get an idea of what things can I achieve by
> > going the way of Hadoop.
> well, I think, but I could err, that you think of hadoop (or hbase) in a
> way that you just can change the "database backend" from "SQL" to
> "hbase/hadoop" and everything would run right away. This will not be
> that easy. You would have to change the code of your web application in
> a very fundamental way. You have to rethink all the table designs etc.,
> so this could be more complicate than you think right know.
>
> However, hbase/hadoop hase some advantages which are very interesing for
> you. Well first, it is distributed, which enables your company to grow
> almost limitless, or to collect more data about your customers so you
> can get more informations (and sell more stuff). And map reduce is a
> wonderful tool for making real fancy "statistics", which is very
> interesting for an insurance company. Your mathematical economist will
> REALLY love it ;).
>
> Hope this helped.
>
> best wishes
>
> Wilm
>
>
>

Re: Planning to propose Hadoop initiative to company. Need some inputs please.

Posted by Ted Yu <yu...@gmail.com>.
Adding hbase user.

On Wed, Oct 1, 2014 at 11:02 AM, Wilm Schumacher <wilm.schumacher@cawoom.com
> wrote:

> Hi,
>
> first: I think hbase is what you are looking for. If I understand
> correctly you want to show the customer his or her data very fast and
> let them manipulate their data. So you need something like a data
> warehouse system. Thus, hbase is the method of choice for you (and I
> think for your kind of data, hbase is a better choice than cassandra or
> mongoDB). But of course you need a running hadoop system to run a hbase.
> So it's not an either/or ;)
>
> (my answers are for hbase, as I think it's what you are looking for. If
> you are not interested, just ignore the following text. Sry @all by
> writing about hbase on this list ;).)
>
> Am 01.10.2014 um 17:24 schrieb mani kandan:
> > 1) How much web usage data will a typical website like ours collect on a
> > daily basis? (I know I can ask our IT department, but I would like to
> > gather some background idea before talking to them.)
> well, if you have the option to ask your IT department you should do
> that, because everyone here would have to guess. You would have to
> explain very detailed what you have to do to let us guess. If you e.g.
> want to track the user on what he or she has clicked, perhaps to make
> personalized ads, than you have to save more data. So, you should ask
> the persons who have the data right away without guessing.
>
> > 3) How many clusters/nodes would I need to ​run a web usage analytics
> > system?
> in the book "hbase in action" there are some recommendations for some
> "case studies" (part IV "deploying hbase"). There are some thoughts on
> the number of nodes, and how to use them, depending on the size of your
> data
>
> > 4) What are the ways for me to use our data? (One use case I'm thinking
> > of is to analyze the error messages log for each page on quote process
> > to redesign the UI. Is this possible?)
> sure. And this should be very easy. I would pump the error log into a
> hbase table. By this method you could read the messages directly from
> the hbase shell (if they are few enough). Or you could use hive to query
> your log a little more "sql like" and make statistics very easy.
>
> > 5) How long would it take for me to set up and start such a system?
> for a novice who have to do it for the first time: for the stand alone
> hbase system perhaps 2 hours. For a complete distributed test cluster
> ... perhaps a day. For the real producing system, with all security
> features ... a little longer ;).
>
> > I'm sorry if some/all of these questions are unanswerable. I just want
> > to discuss my thoughts, and get an idea of what things can I achieve by
> > going the way of Hadoop.
> well, I think, but I could err, that you think of hadoop (or hbase) in a
> way that you just can change the "database backend" from "SQL" to
> "hbase/hadoop" and everything would run right away. This will not be
> that easy. You would have to change the code of your web application in
> a very fundamental way. You have to rethink all the table designs etc.,
> so this could be more complicate than you think right know.
>
> However, hbase/hadoop hase some advantages which are very interesing for
> you. Well first, it is distributed, which enables your company to grow
> almost limitless, or to collect more data about your customers so you
> can get more informations (and sell more stuff). And map reduce is a
> wonderful tool for making real fancy "statistics", which is very
> interesting for an insurance company. Your mathematical economist will
> REALLY love it ;).
>
> Hope this helped.
>
> best wishes
>
> Wilm
>
>
>

Re: Planning to propose Hadoop initiative to company. Need some inputs please.

Posted by Wilm Schumacher <wi...@cawoom.com>.
Hi,

first: I think hbase is what you are looking for. If I understand
correctly you want to show the customer his or her data very fast and
let them manipulate their data. So you need something like a data
warehouse system. Thus, hbase is the method of choice for you (and I
think for your kind of data, hbase is a better choice than cassandra or
mongoDB). But of course you need a running hadoop system to run a hbase.
So it's not an either/or ;)

(my answers are for hbase, as I think it's what you are looking for. If
you are not interested, just ignore the following text. Sry @all by
writing about hbase on this list ;).)

Am 01.10.2014 um 17:24 schrieb mani kandan:
> 1) How much web usage data will a typical website like ours collect on a
> daily basis? (I know I can ask our IT department, but I would like to
> gather some background idea before talking to them.)
well, if you have the option to ask your IT department you should do
that, because everyone here would have to guess. You would have to
explain very detailed what you have to do to let us guess. If you e.g.
want to track the user on what he or she has clicked, perhaps to make
personalized ads, than you have to save more data. So, you should ask
the persons who have the data right away without guessing.

> 3) How many clusters/nodes would I need to ​run a web usage analytics
> system?
in the book "hbase in action" there are some recommendations for some
"case studies" (part IV "deploying hbase"). There are some thoughts on
the number of nodes, and how to use them, depending on the size of your data

> 4) What are the ways for me to use our data? (One use case I'm thinking
> of is to analyze the error messages log for each page on quote process
> to redesign the UI. Is this possible?)
sure. And this should be very easy. I would pump the error log into a
hbase table. By this method you could read the messages directly from
the hbase shell (if they are few enough). Or you could use hive to query
your log a little more "sql like" and make statistics very easy.

> 5) How long would it take for me to set up and start such a system?
for a novice who have to do it for the first time: for the stand alone
hbase system perhaps 2 hours. For a complete distributed test cluster
... perhaps a day. For the real producing system, with all security
features ... a little longer ;).

> I'm sorry if some/all of these questions are unanswerable. I just want
> to discuss my thoughts, and get an idea of what things can I achieve by
> going the way of Hadoop.
well, I think, but I could err, that you think of hadoop (or hbase) in a
way that you just can change the "database backend" from "SQL" to
"hbase/hadoop" and everything would run right away. This will not be
that easy. You would have to change the code of your web application in
a very fundamental way. You have to rethink all the table designs etc.,
so this could be more complicate than you think right know.

However, hbase/hadoop hase some advantages which are very interesing for
you. Well first, it is distributed, which enables your company to grow
almost limitless, or to collect more data about your customers so you
can get more informations (and sell more stuff). And map reduce is a
wonderful tool for making real fancy "statistics", which is very
interesting for an insurance company. Your mathematical economist will
REALLY love it ;).

Hope this helped.

best wishes

Wilm



Re: Planning to propose Hadoop initiative to company. Need some inputs please.

Posted by Wilm Schumacher <wi...@cawoom.com>.
Hi,

first: I think hbase is what you are looking for. If I understand
correctly you want to show the customer his or her data very fast and
let them manipulate their data. So you need something like a data
warehouse system. Thus, hbase is the method of choice for you (and I
think for your kind of data, hbase is a better choice than cassandra or
mongoDB). But of course you need a running hadoop system to run a hbase.
So it's not an either/or ;)

(my answers are for hbase, as I think it's what you are looking for. If
you are not interested, just ignore the following text. Sry @all by
writing about hbase on this list ;).)

Am 01.10.2014 um 17:24 schrieb mani kandan:
> 1) How much web usage data will a typical website like ours collect on a
> daily basis? (I know I can ask our IT department, but I would like to
> gather some background idea before talking to them.)
well, if you have the option to ask your IT department you should do
that, because everyone here would have to guess. You would have to
explain very detailed what you have to do to let us guess. If you e.g.
want to track the user on what he or she has clicked, perhaps to make
personalized ads, than you have to save more data. So, you should ask
the persons who have the data right away without guessing.

> 3) How many clusters/nodes would I need to ​run a web usage analytics
> system?
in the book "hbase in action" there are some recommendations for some
"case studies" (part IV "deploying hbase"). There are some thoughts on
the number of nodes, and how to use them, depending on the size of your data

> 4) What are the ways for me to use our data? (One use case I'm thinking
> of is to analyze the error messages log for each page on quote process
> to redesign the UI. Is this possible?)
sure. And this should be very easy. I would pump the error log into a
hbase table. By this method you could read the messages directly from
the hbase shell (if they are few enough). Or you could use hive to query
your log a little more "sql like" and make statistics very easy.

> 5) How long would it take for me to set up and start such a system?
for a novice who have to do it for the first time: for the stand alone
hbase system perhaps 2 hours. For a complete distributed test cluster
... perhaps a day. For the real producing system, with all security
features ... a little longer ;).

> I'm sorry if some/all of these questions are unanswerable. I just want
> to discuss my thoughts, and get an idea of what things can I achieve by
> going the way of Hadoop.
well, I think, but I could err, that you think of hadoop (or hbase) in a
way that you just can change the "database backend" from "SQL" to
"hbase/hadoop" and everything would run right away. This will not be
that easy. You would have to change the code of your web application in
a very fundamental way. You have to rethink all the table designs etc.,
so this could be more complicate than you think right know.

However, hbase/hadoop hase some advantages which are very interesing for
you. Well first, it is distributed, which enables your company to grow
almost limitless, or to collect more data about your customers so you
can get more informations (and sell more stuff). And map reduce is a
wonderful tool for making real fancy "statistics", which is very
interesting for an insurance company. Your mathematical economist will
REALLY love it ;).

Hope this helped.

best wishes

Wilm



Re: Planning to propose Hadoop initiative to company. Need some inputs please.

Posted by Wilm Schumacher <wi...@cawoom.com>.
Hi,

first: I think hbase is what you are looking for. If I understand
correctly you want to show the customer his or her data very fast and
let them manipulate their data. So you need something like a data
warehouse system. Thus, hbase is the method of choice for you (and I
think for your kind of data, hbase is a better choice than cassandra or
mongoDB). But of course you need a running hadoop system to run a hbase.
So it's not an either/or ;)

(my answers are for hbase, as I think it's what you are looking for. If
you are not interested, just ignore the following text. Sry @all by
writing about hbase on this list ;).)

Am 01.10.2014 um 17:24 schrieb mani kandan:
> 1) How much web usage data will a typical website like ours collect on a
> daily basis? (I know I can ask our IT department, but I would like to
> gather some background idea before talking to them.)
well, if you have the option to ask your IT department you should do
that, because everyone here would have to guess. You would have to
explain very detailed what you have to do to let us guess. If you e.g.
want to track the user on what he or she has clicked, perhaps to make
personalized ads, than you have to save more data. So, you should ask
the persons who have the data right away without guessing.

> 3) How many clusters/nodes would I need to ​run a web usage analytics
> system?
in the book "hbase in action" there are some recommendations for some
"case studies" (part IV "deploying hbase"). There are some thoughts on
the number of nodes, and how to use them, depending on the size of your data

> 4) What are the ways for me to use our data? (One use case I'm thinking
> of is to analyze the error messages log for each page on quote process
> to redesign the UI. Is this possible?)
sure. And this should be very easy. I would pump the error log into a
hbase table. By this method you could read the messages directly from
the hbase shell (if they are few enough). Or you could use hive to query
your log a little more "sql like" and make statistics very easy.

> 5) How long would it take for me to set up and start such a system?
for a novice who have to do it for the first time: for the stand alone
hbase system perhaps 2 hours. For a complete distributed test cluster
... perhaps a day. For the real producing system, with all security
features ... a little longer ;).

> I'm sorry if some/all of these questions are unanswerable. I just want
> to discuss my thoughts, and get an idea of what things can I achieve by
> going the way of Hadoop.
well, I think, but I could err, that you think of hadoop (or hbase) in a
way that you just can change the "database backend" from "SQL" to
"hbase/hadoop" and everything would run right away. This will not be
that easy. You would have to change the code of your web application in
a very fundamental way. You have to rethink all the table designs etc.,
so this could be more complicate than you think right know.

However, hbase/hadoop hase some advantages which are very interesing for
you. Well first, it is distributed, which enables your company to grow
almost limitless, or to collect more data about your customers so you
can get more informations (and sell more stuff). And map reduce is a
wonderful tool for making real fancy "statistics", which is very
interesting for an insurance company. Your mathematical economist will
REALLY love it ;).

Hope this helped.

best wishes

Wilm



Re: Planning to propose Hadoop initiative to company. Need some inputs please.

Posted by Wilm Schumacher <wi...@cawoom.com>.
Hi,

first: I think hbase is what you are looking for. If I understand
correctly you want to show the customer his or her data very fast and
let them manipulate their data. So you need something like a data
warehouse system. Thus, hbase is the method of choice for you (and I
think for your kind of data, hbase is a better choice than cassandra or
mongoDB). But of course you need a running hadoop system to run a hbase.
So it's not an either/or ;)

(my answers are for hbase, as I think it's what you are looking for. If
you are not interested, just ignore the following text. Sry @all by
writing about hbase on this list ;).)

Am 01.10.2014 um 17:24 schrieb mani kandan:
> 1) How much web usage data will a typical website like ours collect on a
> daily basis? (I know I can ask our IT department, but I would like to
> gather some background idea before talking to them.)
well, if you have the option to ask your IT department you should do
that, because everyone here would have to guess. You would have to
explain very detailed what you have to do to let us guess. If you e.g.
want to track the user on what he or she has clicked, perhaps to make
personalized ads, than you have to save more data. So, you should ask
the persons who have the data right away without guessing.

> 3) How many clusters/nodes would I need to ​run a web usage analytics
> system?
in the book "hbase in action" there are some recommendations for some
"case studies" (part IV "deploying hbase"). There are some thoughts on
the number of nodes, and how to use them, depending on the size of your data

> 4) What are the ways for me to use our data? (One use case I'm thinking
> of is to analyze the error messages log for each page on quote process
> to redesign the UI. Is this possible?)
sure. And this should be very easy. I would pump the error log into a
hbase table. By this method you could read the messages directly from
the hbase shell (if they are few enough). Or you could use hive to query
your log a little more "sql like" and make statistics very easy.

> 5) How long would it take for me to set up and start such a system?
for a novice who have to do it for the first time: for the stand alone
hbase system perhaps 2 hours. For a complete distributed test cluster
... perhaps a day. For the real producing system, with all security
features ... a little longer ;).

> I'm sorry if some/all of these questions are unanswerable. I just want
> to discuss my thoughts, and get an idea of what things can I achieve by
> going the way of Hadoop.
well, I think, but I could err, that you think of hadoop (or hbase) in a
way that you just can change the "database backend" from "SQL" to
"hbase/hadoop" and everything would run right away. This will not be
that easy. You would have to change the code of your web application in
a very fundamental way. You have to rethink all the table designs etc.,
so this could be more complicate than you think right know.

However, hbase/hadoop hase some advantages which are very interesing for
you. Well first, it is distributed, which enables your company to grow
almost limitless, or to collect more data about your customers so you
can get more informations (and sell more stuff). And map reduce is a
wonderful tool for making real fancy "statistics", which is very
interesting for an insurance company. Your mathematical economist will
REALLY love it ;).

Hope this helped.

best wishes

Wilm