You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by hdev ml <hd...@gmail.com> on 2010/08/31 23:21:54 UTC

Question about data warehousing and mining through Mahout

Hi all,

I am currently trying to find out what frameworks/software/product will
support data warehousing/data mining the best.

We get around 1.5+ TB of log data every month and we want to do some
reporting on top of that and later on move on to data mining.

I am a total newbie in this world, coming from a RDBMS background and wanted
to get your opinion on what is the best approach to take in this regard.

I looked around the hadoop movement and the corresponding sub projects.

I found Hive as a framework can support and scale for this large data.

So first of phase of reporting can be done using hive. But can I reuse the
same data for data mining through the Mahout project?

Can somebody please guide me regarding this?

Thanks for your help.

HDev.

Re: Question about data warehousing and mining through Mahout

Posted by Ted Dunning <te...@gmail.com>.

Yes.

Mahout can support this.

On Tue, Aug 31, 2010 at 2:55 PM, hdev ml <hd...@gmail.com> wrote:

> But we also want to mine this data to get some predictive capabilities like
> what is the likelihood that the user will use the same device again or if
> we
> get sales/marketing data (on the roadmap for future), we want to possibly
> predict which region to put more marketing/sales efforts. What is the
> pattern for growth of user base, in which geographical regions etc. What is
> the pattern of user requests failing and a number of requirements like
> these
> from the business.
>
> Does that fit the data mining bill? or am I looking in the wrong place.
>

Re: Question about data warehousing and mining through Mahout

Posted by hdev ml <hd...@gmail.com>.

Thanks Lance. Will take a look at KNime also.

On Wed, Sep 1, 2010 at 7:37 PM, Lance Norskog <go...@gmail.com> wrote:

> The KNime program ("nime") from KNime.org is a great way to get your
> feet wet in data mining. It has some machine learning stuff as well.
> It lets you poke around your data and prototype ways to tease out
> facts. It has a bunch of machine learning tools and just plain
> data-shuffling tools. It's a visual graph programming language, so buy
> a very big monitor. And it wraps Weka and R.
>
> On Wed, Sep 1, 2010 at 10:48 AM, hdev ml <hd...@gmail.com> wrote:
> > I agree with you that there is preparation needed for Mahout processing.
> >
> > I was just trying to save on that effort by re-using the data in hive
> > instead of double processing it.
> >
> > I may have some more questions when I actually dive into the mining part.
> > (possibly a couple of months down the line).
> >
> > Thanks for your inputs.
> >
> > On Wed, Sep 1, 2010 at 12:58 AM, Sean Owen <sr...@gmail.com> wrote:
> >
> >> Hive does something fairly unrelated to Mahout. It's an indexing and
> >> query system. Both might start from the same source data, but to do
> >> different things. There is no common format, no. Mahout generally
> >> operates on text files or "Vectors" in SequenceFiles. So there's some
> >> translation there at least.
> >>
> >> But I think a message here is that there's more preparation and
> >> thought necessary to start data mining. It's not like you point a data
> >> mining tool at some data and answers start flowing automatically.
> >> You'd have to be deliberately extracting and preparing data anyhow.
> >>
> >> On Tue, Aug 31, 2010 at 11:41 PM, hdev ml <hd...@gmail.com> wrote:
> >> > Thanks Sean for the answers. Thanks for Ted for validation.
> >> >
> >> > Now my question is, since I want to do both reporting of large data/
> >> > datawarehouse, let's assume I choose Hive for that.
> >> >
> >> > Now can Mahout integrate with Hive to make use of this data for
> learning,
> >> > mining etc.? or do I have to export the hive data into text files
> which
> >> can
> >> > be hosted by Haddop/HDFS which later on Mahout can use for data
> mining.
> >> >
> >> > In short, can data warehousing part be done by Hive and then can data
> >> mining
> >> > part be done by Mahout on this hive data?
> >> >
> >> > -H
> >> >
> >> > On Tue, Aug 31, 2010 at 3:03 PM, Sean Owen <sr...@gmail.com> wrote:
> >> >
> >> >> On Tue, Aug 31, 2010 at 10:55 PM, hdev ml <hd...@gmail.com> wrote:
> >> >> > Per my understanding of hive, we can do some statistical reporting,
> >> like
> >> >> > frequency of user sessions, which geographical region, which device
> he
> >> is
> >> >> > using the most etc.
> >> >>
> >> >> Yes that's about what Hive is good for, if you're looking for some
> >> >> open-source libraries along those lines.
> >> >>
> >> >> >
> >> >> > But we also want to mine this data to get some predictive
> capabilities
> >> >> like
> >> >> > what is the likelihood that the user will use the same device again
> or
> >> if
> >> >> we
> >> >> > get sales/marketing data (on the roadmap for future), we want to
> >> possibly
> >> >> > predict which region to put more marketing/sales efforts. What is
> the
> >> >> > pattern for growth of user base, in which geographical regions etc.
> >> What
> >> >> is
> >> >> > the pattern of user requests failing and a number of requirements
> like
> >> >> these
> >> >> > from the business.
> >> >>
> >> >> This is pretty broad but I can try to give you the names of problems
> >> >> this sounds like, to guide your search.
> >> >>
> >> >> Predicting user usage of device sounds like a classification problem,
> >> >> like developing a probabilistic model of behavior.
> >> >>
> >> >> Deciding where to put marketing dollars sounds like a business
> >> >> problem, not machine learning. I don't think a computer can tell you
> >> >> that. Some techniques might help you identify trends in sales, but
> >> >> this is simple regression, not really machine learning.
> >> >>
> >> >> Looking for patterns in failure sounds a bit like frequent pattern
> >> >> mining -- trying to find events that go together unusually often.
> >> >>
> >> >
> >>
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Question about data warehousing and mining through Mahout

Posted by Lance Norskog <go...@gmail.com>.

The KNime program ("nime") from KNime.org is a great way to get your
feet wet in data mining. It has some machine learning stuff as well.
It lets you poke around your data and prototype ways to tease out
facts. It has a bunch of machine learning tools and just plain
data-shuffling tools. It's a visual graph programming language, so buy
a very big monitor. And it wraps Weka and R.

On Wed, Sep 1, 2010 at 10:48 AM, hdev ml <hd...@gmail.com> wrote:
> I agree with you that there is preparation needed for Mahout processing.
>
> I was just trying to save on that effort by re-using the data in hive
> instead of double processing it.
>
> I may have some more questions when I actually dive into the mining part.
> (possibly a couple of months down the line).
>
> Thanks for your inputs.
>
> On Wed, Sep 1, 2010 at 12:58 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> Hive does something fairly unrelated to Mahout. It's an indexing and
>> query system. Both might start from the same source data, but to do
>> different things. There is no common format, no. Mahout generally
>> operates on text files or "Vectors" in SequenceFiles. So there's some
>> translation there at least.
>>
>> But I think a message here is that there's more preparation and
>> thought necessary to start data mining. It's not like you point a data
>> mining tool at some data and answers start flowing automatically.
>> You'd have to be deliberately extracting and preparing data anyhow.
>>
>> On Tue, Aug 31, 2010 at 11:41 PM, hdev ml <hd...@gmail.com> wrote:
>> > Thanks Sean for the answers. Thanks for Ted for validation.
>> >
>> > Now my question is, since I want to do both reporting of large data/
>> > datawarehouse, let's assume I choose Hive for that.
>> >
>> > Now can Mahout integrate with Hive to make use of this data for learning,
>> > mining etc.? or do I have to export the hive data into text files which
>> can
>> > be hosted by Haddop/HDFS which later on Mahout can use for data mining.
>> >
>> > In short, can data warehousing part be done by Hive and then can data
>> mining
>> > part be done by Mahout on this hive data?
>> >
>> > -H
>> >
>> > On Tue, Aug 31, 2010 at 3:03 PM, Sean Owen <sr...@gmail.com> wrote:
>> >
>> >> On Tue, Aug 31, 2010 at 10:55 PM, hdev ml <hd...@gmail.com> wrote:
>> >> > Per my understanding of hive, we can do some statistical reporting,
>> like
>> >> > frequency of user sessions, which geographical region, which device he
>> is
>> >> > using the most etc.
>> >>
>> >> Yes that's about what Hive is good for, if you're looking for some
>> >> open-source libraries along those lines.
>> >>
>> >> >
>> >> > But we also want to mine this data to get some predictive capabilities
>> >> like
>> >> > what is the likelihood that the user will use the same device again or
>> if
>> >> we
>> >> > get sales/marketing data (on the roadmap for future), we want to
>> possibly
>> >> > predict which region to put more marketing/sales efforts. What is the
>> >> > pattern for growth of user base, in which geographical regions etc.
>> What
>> >> is
>> >> > the pattern of user requests failing and a number of requirements like
>> >> these
>> >> > from the business.
>> >>
>> >> This is pretty broad but I can try to give you the names of problems
>> >> this sounds like, to guide your search.
>> >>
>> >> Predicting user usage of device sounds like a classification problem,
>> >> like developing a probabilistic model of behavior.
>> >>
>> >> Deciding where to put marketing dollars sounds like a business
>> >> problem, not machine learning. I don't think a computer can tell you
>> >> that. Some techniques might help you identify trends in sales, but
>> >> this is simple regression, not really machine learning.
>> >>
>> >> Looking for patterns in failure sounds a bit like frequent pattern
>> >> mining -- trying to find events that go together unusually often.
>> >>
>> >
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Question about data warehousing and mining through Mahout

Posted by hdev ml <hd...@gmail.com>.

I agree with you that there is preparation needed for Mahout processing.

I was just trying to save on that effort by re-using the data in hive
instead of double processing it.

I may have some more questions when I actually dive into the mining part.
(possibly a couple of months down the line).

Thanks for your inputs.

On Wed, Sep 1, 2010 at 12:58 AM, Sean Owen <sr...@gmail.com> wrote:

> Hive does something fairly unrelated to Mahout. It's an indexing and
> query system. Both might start from the same source data, but to do
> different things. There is no common format, no. Mahout generally
> operates on text files or "Vectors" in SequenceFiles. So there's some
> translation there at least.
>
> But I think a message here is that there's more preparation and
> thought necessary to start data mining. It's not like you point a data
> mining tool at some data and answers start flowing automatically.
> You'd have to be deliberately extracting and preparing data anyhow.
>
> On Tue, Aug 31, 2010 at 11:41 PM, hdev ml <hd...@gmail.com> wrote:
> > Thanks Sean for the answers. Thanks for Ted for validation.
> >
> > Now my question is, since I want to do both reporting of large data/
> > datawarehouse, let's assume I choose Hive for that.
> >
> > Now can Mahout integrate with Hive to make use of this data for learning,
> > mining etc.? or do I have to export the hive data into text files which
> can
> > be hosted by Haddop/HDFS which later on Mahout can use for data mining.
> >
> > In short, can data warehousing part be done by Hive and then can data
> mining
> > part be done by Mahout on this hive data?
> >
> > -H
> >
> > On Tue, Aug 31, 2010 at 3:03 PM, Sean Owen <sr...@gmail.com> wrote:
> >
> >> On Tue, Aug 31, 2010 at 10:55 PM, hdev ml <hd...@gmail.com> wrote:
> >> > Per my understanding of hive, we can do some statistical reporting,
> like
> >> > frequency of user sessions, which geographical region, which device he
> is
> >> > using the most etc.
> >>
> >> Yes that's about what Hive is good for, if you're looking for some
> >> open-source libraries along those lines.
> >>
> >> >
> >> > But we also want to mine this data to get some predictive capabilities
> >> like
> >> > what is the likelihood that the user will use the same device again or
> if
> >> we
> >> > get sales/marketing data (on the roadmap for future), we want to
> possibly
> >> > predict which region to put more marketing/sales efforts. What is the
> >> > pattern for growth of user base, in which geographical regions etc.
> What
> >> is
> >> > the pattern of user requests failing and a number of requirements like
> >> these
> >> > from the business.
> >>
> >> This is pretty broad but I can try to give you the names of problems
> >> this sounds like, to guide your search.
> >>
> >> Predicting user usage of device sounds like a classification problem,
> >> like developing a probabilistic model of behavior.
> >>
> >> Deciding where to put marketing dollars sounds like a business
> >> problem, not machine learning. I don't think a computer can tell you
> >> that. Some techniques might help you identify trends in sales, but
> >> this is simple regression, not really machine learning.
> >>
> >> Looking for patterns in failure sounds a bit like frequent pattern
> >> mining -- trying to find events that go together unusually often.
> >>
> >
>

Re: Question about data warehousing and mining through Mahout

Posted by Sean Owen <sr...@gmail.com>.

Hive does something fairly unrelated to Mahout. It's an indexing and
query system. Both might start from the same source data, but to do
different things. There is no common format, no. Mahout generally
operates on text files or "Vectors" in SequenceFiles. So there's some
translation there at least.

But I think a message here is that there's more preparation and
thought necessary to start data mining. It's not like you point a data
mining tool at some data and answers start flowing automatically.
You'd have to be deliberately extracting and preparing data anyhow.

On Tue, Aug 31, 2010 at 11:41 PM, hdev ml <hd...@gmail.com> wrote:
> Thanks Sean for the answers. Thanks for Ted for validation.
>
> Now my question is, since I want to do both reporting of large data/
> datawarehouse, let's assume I choose Hive for that.
>
> Now can Mahout integrate with Hive to make use of this data for learning,
> mining etc.? or do I have to export the hive data into text files which can
> be hosted by Haddop/HDFS which later on Mahout can use for data mining.
>
> In short, can data warehousing part be done by Hive and then can data mining
> part be done by Mahout on this hive data?
>
> -H
>
> On Tue, Aug 31, 2010 at 3:03 PM, Sean Owen <sr...@gmail.com> wrote:
>
>> On Tue, Aug 31, 2010 at 10:55 PM, hdev ml <hd...@gmail.com> wrote:
>> > Per my understanding of hive, we can do some statistical reporting, like
>> > frequency of user sessions, which geographical region, which device he is
>> > using the most etc.
>>
>> Yes that's about what Hive is good for, if you're looking for some
>> open-source libraries along those lines.
>>
>> >
>> > But we also want to mine this data to get some predictive capabilities
>> like
>> > what is the likelihood that the user will use the same device again or if
>> we
>> > get sales/marketing data (on the roadmap for future), we want to possibly
>> > predict which region to put more marketing/sales efforts. What is the
>> > pattern for growth of user base, in which geographical regions etc. What
>> is
>> > the pattern of user requests failing and a number of requirements like
>> these
>> > from the business.
>>
>> This is pretty broad but I can try to give you the names of problems
>> this sounds like, to guide your search.
>>
>> Predicting user usage of device sounds like a classification problem,
>> like developing a probabilistic model of behavior.
>>
>> Deciding where to put marketing dollars sounds like a business
>> problem, not machine learning. I don't think a computer can tell you
>> that. Some techniques might help you identify trends in sales, but
>> this is simple regression, not really machine learning.
>>
>> Looking for patterns in failure sounds a bit like frequent pattern
>> mining -- trying to find events that go together unusually often.
>>
>

Re: Question about data warehousing and mining through Mahout

Posted by hdev ml <hd...@gmail.com>.

Thanks Ted. That again validates my path.

Thanks Ted, Chris and Sean for your valuable inputs.

Community Rocks!!!! Off the topic - A few years back, I was on the JavaCC
mailing list. There were 2 guys - one from New Zealand and the other one
from France - replying to my problems. I was literally getting
round-the-clock support. More power to the community!!!!

-H

On Tue, Aug 31, 2010 at 3:48 PM, Ted Dunning <te...@gmail.com> wrote:

> For categorization, there are several different answers to the integration
> problem, but text
> export of a sampled and curated data file is pretty typical as a data path.
>
> The on-line sequential classifiers are a bit more flexible and would allow
> different input
> formats at the cost of coding on your part.
>
> Keep in mind that Hive is keeping flat files in HDFS anyway.  Adding an
> additional format
> so that you don't have to copy a Hive output file one extra time isn't
> hard,
> but neither is
> it hard to have Hive pop out something like comma separated values.
>
> On Tue, Aug 31, 2010 at 3:41 PM, hdev ml <hd...@gmail.com> wrote:
>
> > Now can Mahout integrate with Hive to make use of this data for learning,
> > mining etc.? or do I have to export the hive data into text files which
> can
> > be hosted by Haddop/HDFS which later on Mahout can use for data mining.
> >
> > In short, can data warehousing part be done by Hive and then can data
> > mining
> > part be done by Mahout on this hive data?
> >
>

Re: Question about data warehousing and mining through Mahout

Posted by Ted Dunning <te...@gmail.com>.

For categorization, there are several different answers to the integration
problem, but text
export of a sampled and curated data file is pretty typical as a data path.

The on-line sequential classifiers are a bit more flexible and would allow
different input
formats at the cost of coding on your part.

Keep in mind that Hive is keeping flat files in HDFS anyway.  Adding an
additional format
so that you don't have to copy a Hive output file one extra time isn't hard,
but neither is
it hard to have Hive pop out something like comma separated values.

On Tue, Aug 31, 2010 at 3:41 PM, hdev ml <hd...@gmail.com> wrote:

> Now can Mahout integrate with Hive to make use of this data for learning,
> mining etc.? or do I have to export the hive data into text files which can
> be hosted by Haddop/HDFS which later on Mahout can use for data mining.
>
> In short, can data warehousing part be done by Hive and then can data
> mining
> part be done by Mahout on this hive data?
>

Re: Question about data warehousing and mining through Mahout

Posted by hdev ml <hd...@gmail.com>.

Thanks Sean for the answers. Thanks for Ted for validation.

Now my question is, since I want to do both reporting of large data/
datawarehouse, let's assume I choose Hive for that.

Now can Mahout integrate with Hive to make use of this data for learning,
mining etc.? or do I have to export the hive data into text files which can
be hosted by Haddop/HDFS which later on Mahout can use for data mining.

In short, can data warehousing part be done by Hive and then can data mining
part be done by Mahout on this hive data?

-H

On Tue, Aug 31, 2010 at 3:03 PM, Sean Owen <sr...@gmail.com> wrote:

> On Tue, Aug 31, 2010 at 10:55 PM, hdev ml <hd...@gmail.com> wrote:
> > Per my understanding of hive, we can do some statistical reporting, like
> > frequency of user sessions, which geographical region, which device he is
> > using the most etc.
>
> Yes that's about what Hive is good for, if you're looking for some
> open-source libraries along those lines.
>
> >
> > But we also want to mine this data to get some predictive capabilities
> like
> > what is the likelihood that the user will use the same device again or if
> we
> > get sales/marketing data (on the roadmap for future), we want to possibly
> > predict which region to put more marketing/sales efforts. What is the
> > pattern for growth of user base, in which geographical regions etc. What
> is
> > the pattern of user requests failing and a number of requirements like
> these
> > from the business.
>
> This is pretty broad but I can try to give you the names of problems
> this sounds like, to guide your search.
>
> Predicting user usage of device sounds like a classification problem,
> like developing a probabilistic model of behavior.
>
> Deciding where to put marketing dollars sounds like a business
> problem, not machine learning. I don't think a computer can tell you
> that. Some techniques might help you identify trends in sales, but
> this is simple regression, not really machine learning.
>
> Looking for patterns in failure sounds a bit like frequent pattern
> mining -- trying to find events that go together unusually often.
>

Re: Question about data warehousing and mining through Mahout

Posted by Ted Dunning <te...@gmail.com>.

The Manning book on Mahout will have caught up with you by then and will
have a killer section on building classification models to go with the
recommendations and clustering sections that aere already available.

On Tue, Aug 31, 2010 at 7:00 PM, hdev ml <hd...@gmail.com> wrote:

> I think after a couple of months when things settle down with Hadoop and
> Hive, I will take the Mahout course.
>

Re: Question about data warehousing and mining through Mahout

Posted by hdev ml <hd...@gmail.com>.

I see what you are saying. Yes, I am going into the Hive direction for now.
Currently, installing/configuring the packages.

I think after a couple of months when things settle down with Hadoop and
Hive, I will take the Mahout course.

Thanks

-H

On Tue, Aug 31, 2010 at 5:34 PM, Ted Dunning <te...@gmail.com> wrote:

> I think that Chris was actually recommending stuff that is too simple to
> call data-mining.
>
> Basically this stuff is simpler than any machine learning algorithm so
> there
> isn't anything really
> to write.
>
> An example for recommendations is to simply recommend the most popular
> items
> to everybody,
> possibly with a bit of dithering so it doesn't look so static.  This *is*
> actually a recommendation
> algorithm just like random selection is.  Both of these provide interesting
> baseline levels for
> clicks and engagement.  You *might* want to use Mahout to implement these,
> but it is probably
> better to get the rest of the framework in place first.
>
> On Tue, Aug 31, 2010 at 4:03 PM, hdev ml <hd...@gmail.com> wrote:
>
> > 3. Hhhmm..That seems like a very good suggestion. I am not averse to the
> > idea of writing my own implementation of mining algorithms. I am just
> > worried about their accuracy and stability. So summary is basically do
> the
> > transformation and statistical part first. When it comes to data mining,
> > write your own algorithms or use Mahout (if at all hive integration is
> > possible, or maybe reuse the raw text files or output dump of Hive
> tables)
> >
>

Re: Question about data warehousing and mining through Mahout

Posted by Ted Dunning <te...@gmail.com>.

I think that Chris was actually recommending stuff that is too simple to
call data-mining.

Basically this stuff is simpler than any machine learning algorithm so there
isn't anything really
to write.

An example for recommendations is to simply recommend the most popular items
to everybody,
possibly with a bit of dithering so it doesn't look so static.  This *is*
actually a recommendation
algorithm just like random selection is.  Both of these provide interesting
baseline levels for
clicks and engagement.  You *might* want to use Mahout to implement these,
but it is probably
better to get the rest of the framework in place first.

On Tue, Aug 31, 2010 at 4:03 PM, hdev ml <hd...@gmail.com> wrote:

> 3. Hhhmm..That seems like a very good suggestion. I am not averse to the
> idea of writing my own implementation of mining algorithms. I am just
> worried about their accuracy and stability. So summary is basically do the
> transformation and statistical part first. When it comes to data mining,
> write your own algorithms or use Mahout (if at all hive integration is
> possible, or maybe reuse the raw text files or output dump of Hive tables)
>

Re: Question about data warehousing and mining through Mahout

Posted by hdev ml <hd...@gmail.com>.

Thanks Chris for the answers.

1. The data is just going to grow. This 1.5TB of data is just from one
module. There are other modules, which may have similar kind of data in the
log. After discarding the data this 3.0TB becomes 1.5TB. It is still huge.
Since this is just a months data, and we would want to make use of atleast
past 6-12 months, the data size goes in the 10TB-20TB area. So I am guessing
Hadoop is the right answer. I am just not sure which sub-project to use in
this case.

2. When you say querying with Hive, note that I want to use the same hive
data for future data mining, so my question was -- can that be done with
Mahout integrating with Hive layer, instead of Hadoop layer directly. or
Maybe we can use the Hive data directly if at all I can reverse engineer the
data format that hive uses internally. Hopefully it does not have compressed
data.

3. Hhhmm..That seems like a very good suggestion. I am not averse to the
idea of writing my own implementation of mining algorithms. I am just
worried about their accuracy and stability. So summary is basically do the
transformation and statistical part first. When it comes to data mining,
write your own algorithms or use Mahout (if at all hive integration is
possible, or maybe reuse the raw text files or output dump of Hive tables)

On Tue, Aug 31, 2010 at 3:38 PM, Chris Bates <
christopher.andrew.bates@gmail.com> wrote:

> From my experience (merging machine learning with business goals), I'll
> offer a few pieces of advice that may help guide you.
>
> 1.  First determine what data you have (and how much of it), and how you
> want to store/ query it.
> -  If you have 1.5 TB of log data, you are in the realm of Hadoop.  If you
> find however that you only need to operate on a subset of this data
> (~100mb), you may just want to stick with loading it up in memory and using
> something like Octave, R, Matlab, Python to run algorithms against it.
> Probably the easiest.  In fact, I'd say do that first before you go
> whole-hog on the distributed system.
>
> 2.  Second, come up with questions about your data that you want to answer
> (or have someone give those questions to you).  Make those questions as
> specific as possible.
> - The type of question will tell you what tool you need to use. Sometimes
> this means querying with Hive (ie. How many unique users viewed this type
> of
> page?) if the data is too much/too sparse to put into MySQL.  Sometimes
> this
> means just writing a Python/Ruby script with a few Regex's and hunting
> through the data.  If the questions are predictive in nature, you may need
> to use some machine learning tools.
>
> 3.  Simple techniques often will get you 80% of the way to your goal.
>  Machine Learning gets you the other 20% (or sometimes only 5%!).
> - I would say to use machine learning once you know the domain of the
> problem you're trying to solve extremely well.  Because it will take effort
> and you should be immediately skeptical of any result you get back.  It's a
> black box that you should really know the inner workings of, so my advice
> is
> to exhaust all non-machine learning options first, then go for that extra
> accuracy if its warranted.
>
> Good luck!
>
> On Tue, Aug 31, 2010 at 6:03 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > On Tue, Aug 31, 2010 at 10:55 PM, hdev ml <hd...@gmail.com> wrote:
> > > Per my understanding of hive, we can do some statistical reporting,
> like
> > > frequency of user sessions, which geographical region, which device he
> is
> > > using the most etc.
> >
> > Yes that's about what Hive is good for, if you're looking for some
> > open-source libraries along those lines.
> >
> > >
> > > But we also want to mine this data to get some predictive capabilities
> > like
> > > what is the likelihood that the user will use the same device again or
> if
> > we
> > > get sales/marketing data (on the roadmap for future), we want to
> possibly
> > > predict which region to put more marketing/sales efforts. What is the
> > > pattern for growth of user base, in which geographical regions etc.
> What
> > is
> > > the pattern of user requests failing and a number of requirements like
> > these
> > > from the business.
> >
> > This is pretty broad but I can try to give you the names of problems
> > this sounds like, to guide your search.
> >
> > Predicting user usage of device sounds like a classification problem,
> > like developing a probabilistic model of behavior.
> >
> > Deciding where to put marketing dollars sounds like a business
> > problem, not machine learning. I don't think a computer can tell you
> > that. Some techniques might help you identify trends in sales, but
> > this is simple regression, not really machine learning.
> >
> > Looking for patterns in failure sounds a bit like frequent pattern
> > mining -- trying to find events that go together unusually often.
> >
>

Re: Question about data warehousing and mining through Mahout

Posted by Chris Bates <ch...@gmail.com>.

>From my experience (merging machine learning with business goals), I'll
offer a few pieces of advice that may help guide you.

1.  First determine what data you have (and how much of it), and how you
want to store/ query it.
-  If you have 1.5 TB of log data, you are in the realm of Hadoop.  If you
find however that you only need to operate on a subset of this data
(~100mb), you may just want to stick with loading it up in memory and using
something like Octave, R, Matlab, Python to run algorithms against it.
Probably the easiest.  In fact, I'd say do that first before you go
whole-hog on the distributed system.

2.  Second, come up with questions about your data that you want to answer
(or have someone give those questions to you).  Make those questions as
specific as possible.
- The type of question will tell you what tool you need to use. Sometimes
this means querying with Hive (ie. How many unique users viewed this type of
page?) if the data is too much/too sparse to put into MySQL.  Sometimes this
means just writing a Python/Ruby script with a few Regex's and hunting
through the data.  If the questions are predictive in nature, you may need
to use some machine learning tools.

3.  Simple techniques often will get you 80% of the way to your goal.
 Machine Learning gets you the other 20% (or sometimes only 5%!).
- I would say to use machine learning once you know the domain of the
problem you're trying to solve extremely well.  Because it will take effort
and you should be immediately skeptical of any result you get back.  It's a
black box that you should really know the inner workings of, so my advice is
to exhaust all non-machine learning options first, then go for that extra
accuracy if its warranted.

Good luck!

On Tue, Aug 31, 2010 at 6:03 PM, Sean Owen <sr...@gmail.com> wrote:

> On Tue, Aug 31, 2010 at 10:55 PM, hdev ml <hd...@gmail.com> wrote:
> > Per my understanding of hive, we can do some statistical reporting, like
> > frequency of user sessions, which geographical region, which device he is
> > using the most etc.
>
> Yes that's about what Hive is good for, if you're looking for some
> open-source libraries along those lines.
>
> >
> > But we also want to mine this data to get some predictive capabilities
> like
> > what is the likelihood that the user will use the same device again or if
> we
> > get sales/marketing data (on the roadmap for future), we want to possibly
> > predict which region to put more marketing/sales efforts. What is the
> > pattern for growth of user base, in which geographical regions etc. What
> is
> > the pattern of user requests failing and a number of requirements like
> these
> > from the business.
>
> This is pretty broad but I can try to give you the names of problems
> this sounds like, to guide your search.
>
> Predicting user usage of device sounds like a classification problem,
> like developing a probabilistic model of behavior.
>
> Deciding where to put marketing dollars sounds like a business
> problem, not machine learning. I don't think a computer can tell you
> that. Some techniques might help you identify trends in sales, but
> this is simple regression, not really machine learning.
>
> Looking for patterns in failure sounds a bit like frequent pattern
> mining -- trying to find events that go together unusually often.
>

Re: Question about data warehousing and mining through Mahout

Posted by Sean Owen <sr...@gmail.com>.

On Tue, Aug 31, 2010 at 10:55 PM, hdev ml <hd...@gmail.com> wrote:
> Per my understanding of hive, we can do some statistical reporting, like
> frequency of user sessions, which geographical region, which device he is
> using the most etc.

Yes that's about what Hive is good for, if you're looking for some
open-source libraries along those lines.

>
> But we also want to mine this data to get some predictive capabilities like
> what is the likelihood that the user will use the same device again or if we
> get sales/marketing data (on the roadmap for future), we want to possibly
> predict which region to put more marketing/sales efforts. What is the
> pattern for growth of user base, in which geographical regions etc. What is
> the pattern of user requests failing and a number of requirements like these
> from the business.

This is pretty broad but I can try to give you the names of problems
this sounds like, to guide your search.

Predicting user usage of device sounds like a classification problem,
like developing a probabilistic model of behavior.

Deciding where to put marketing dollars sounds like a business
problem, not machine learning. I don't think a computer can tell you
that. Some techniques might help you identify trends in sales, but
this is simple regression, not really machine learning.

Looking for patterns in failure sounds a bit like frequent pattern
mining -- trying to find events that go together unusually often.

Re: Question about data warehousing and mining through Mahout

Posted by hdev ml <hd...@gmail.com>.

Hi Sean,

I may not be able to divulge a lot of information about the business because
of confidentiality and since I am a new employee here :), but

the log data has

1. different types of user requests - Different types of requests and its
related data.
2. different session parameters - When did the user session start. What is
the stickiness of the user etc.
3. different context parameters, such as user location, where he is going
etc.

Per my understanding of hive, we can do some statistical reporting, like
frequency of user sessions, which geographical region, which device he is
using the most etc.

But we also want to mine this data to get some predictive capabilities like
what is the likelihood that the user will use the same device again or if we
get sales/marketing data (on the roadmap for future), we want to possibly
predict which region to put more marketing/sales efforts. What is the
pattern for growth of user base, in which geographical regions etc. What is
the pattern of user requests failing and a number of requirements like these
from the business.

Does that fit the data mining bill? or am I looking in the wrong place.

Again thanks for your time and help.

HDev

On Tue, Aug 31, 2010 at 2:40 PM, Sean Owen <sr...@gmail.com> wrote:

> I think you'd have to begin to define what you want to do with the
> logs? What do you mean when you say "data mining"?
>
> On Tue, Aug 31, 2010 at 10:21 PM, hdev ml <hd...@gmail.com> wrote:
> > Hi all,
> >
> > I am currently trying to find out what frameworks/software/product will
> > support data warehousing/data mining the best.
> >
> > We get around 1.5+ TB of log data every month and we want to do some
> > reporting on top of that and later on move on to data mining.
> >
> > I am a total newbie in this world, coming from a RDBMS background and
> wanted
> > to get your opinion on what is the best approach to take in this regard.
> >
> > I looked around the hadoop movement and the corresponding sub projects.
> >
> > I found Hive as a framework can support and scale for this large data.
> >
> > So first of phase of reporting can be done using hive. But can I reuse
> the
> > same data for data mining through the Mahout project?
> >
> > Can somebody please guide me regarding this?
> >
> > Thanks for your help.
> >
> > HDev.
> >
>

Re: Question about data warehousing and mining through Mahout

Posted by Sean Owen <sr...@gmail.com>.

I think you'd have to begin to define what you want to do with the
logs? What do you mean when you say "data mining"?

On Tue, Aug 31, 2010 at 10:21 PM, hdev ml <hd...@gmail.com> wrote:
> Hi all,
>
> I am currently trying to find out what frameworks/software/product will
> support data warehousing/data mining the best.
>
> We get around 1.5+ TB of log data every month and we want to do some
> reporting on top of that and later on move on to data mining.
>
> I am a total newbie in this world, coming from a RDBMS background and wanted
> to get your opinion on what is the best approach to take in this regard.
>
> I looked around the hadoop movement and the corresponding sub projects.
>
> I found Hive as a framework can support and scale for this large data.
>
> So first of phase of reporting can be done using hive. But can I reuse the
> same data for data mining through the Mahout project?
>
> Can somebody please guide me regarding this?
>
> Thanks for your help.
>
> HDev.
>