You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Tim Hughes <th...@troglobyte.com> on 2009/08/22 20:50:21 UTC

Custom Algorithm (C/C++) ?

I'm working on a project which is considering the Apache Lucine/SOLR/Mahout
tech stack for a data mining & machine learning project.

The issue of Java algorithm performance vs C/C++ has come up, and I would
like to know if it is possible to create custom algorithms in C/C++ and use
them within the Mahout framework. I have been unable to find information on
this.
-- 
View this message in context: http://www.nabble.com/Custom-Algorithm-%28C-C%2B%2B%29---tp25096676p25096676.html
Sent from the Mahout User List mailing list archive at Nabble.com.


Re: Using Mahot for optimizing ad delievery

Posted by Sean Owen <sr...@gmail.com>.
There are many tools you could apply to this problem. Since I know
about recommenders, I can tell you about how the Mahout recommender
might apply. But this is just one approach.

It is fairly easy to map your input and output to a normal recommender
problem. The challenge is scale. I would:

- Your 'items' here to recommend are advertisements
- Ignore ad impressions. These are not going to be useful data points
- Use ad clicks. A click will establish a connection between a user and ad
- Since you typically have a click, or none at all, I would not
attempt to create some kind of 'rating' or preference value between
the user and ad. You simply have an associate, or don't have one. In
Mahout CF, this means using the 'Boolean' data model, algorithms, etc.
This we can discuss more later.
- To start, I'd ignore category and demographics

The only things that make this non-trivial are:

1. Your 'items' are very transient. While you want to use a lot of
historical click data to compute similarities between users, perhaps,
you never recommend an ad that isn't part of a currently running
campaign. To address this:

- Use a user-based recommender
- *Precompute* user-user similarities based on a similarity metric
like LogLikelihoodSimilarity -- don't do this at runtime, it'll be too
slow. Use all available data to perform these computations
- At runtime, create a data model which contains only current ads as
items. Feed it this precomputed similarities.

2. Large scale

- I would heavily filter your data. Obviously, don't consider users
with no clicks, or few clicks, or that aren't active. I find that,
typically, most data available to a recommender computation is noise
-- doesn't help the results. So achieving scalability usually involves
knowing what little data to bother keeping.
- Because you kind of need to pre-compute all those user-user
similarities, and they don't necessarily change often, this is
something you can parallelize using Hadoop. Mahout does not offer a
pre-built job to compute user-user similarities, but, you can easily
create a job that calls Mahout classes to do this
- Unfortunately, I don't think much more than this can be offline and
use distributed systems. You have to produce very fresh
recommendations in this system and can't simply compute
recommendations every night, or even every hour I guess.

You wanted some more technical detail. I won't bother discussing the
particular code or classes to use. At a high level I think you'd have:

- A job of some kind that periodically reads all your click data,
throws out the data which is probably not useful to this computation,
and stores the data in an HDFS cluster (Hadoop's storage system).
- A Hadoop cluster running a nightly job that recomputes all user-user
similarities
- Another job which stores these results in a database that is
available to your online systems
- Some database table which has, at least, currently running, active
ads and which users have clicked them (that is, not all click data,
just active ad click data)
- A server application which actually makes recommendations in
realtime. I imagine speed will be vital in an ad system, so would plan
to read all the user-user similarities into memory. This will probably
need to have lots of RAM. You can scale this by adding more servers.

That's roughly it, to get you started.


On Mon, Aug 24, 2009 at 2:32 PM, Benjamin
Dageroth<Be...@webtrekk.com> wrote:
> Hi,
>
> I got the demo up and running and am now trying to figure out how to go forward with a few tries on my own to determine, whether we can actually use Mahout. We are getting a lot of data on many users and would like to use this data in order to provide more relevant ads - relevant not only according  to the content of the side, but to the interests of the user and what he liked in the past. So I know e.g. the type of site he is on (twenty categories), the type of sites he has visited in the past, the ads he saw, the ads  clicked, including a category to which the ad belongs.
> Furthermore I'd like to build a profile of interests and if I can, I'd gather some demographical data for a number of sites - this should enable me to use naïve Bayes to deduce gender and age with some probability depending on the recorded history of sites someone visited within the ad network.
>
> All this information I'd like to use in order to make recommendation which ad to deliever, either because similar users cliked it, or because a user clicked on ad, which has often been clicked with another ad. (item based, user based depending which one provides a better result) Other interesting data points would be time (are there specific times at which ads do perform well or bad?) and location and  the actual combinations of site and ad.
>
> I am not a very good programmer and am working more the conceptual angle and look for technologies which we could use. So I am not sure how to store the data I collect (I created a database scheme) to make it available to Mahout, as it seems to run on Hadoop and not with a normal database? I am still looking for more documentation, so if you could point me to something or have some idea how to proceed, I'd appreciate it.
>
> We definitely something which scales as the ad network is creating billions of ad impressions per month with millions of users and Mahout seemed to be the only thing, that seems suitable, although it is still pretty early in its development process.
>
>
> Thanks for any pointers and opinions,
>
> Benjamin
>
>
> _______________________________________
> Benjamin Dageroth, Key Account Manager / Softwareentwickler
> Webtrekk GmbH
> Boxhagener Str. 76-78, 10245 Berlin
> fon 030 - 755 415 - 360
> fax 030 - 755 415 - 100
> benjamin.dageroth@webtrekk.com
> http://www.webtrekk.com
> Amtsgericht Berlin, HRB 93435 B
> Geschäftsführer Christian Sauer
>
>
> _______________________________________
>

Using Mahot for optimizing ad delievery

Posted by Benjamin Dageroth <Be...@webtrekk.com>.
Hi,

I got the demo up and running and am now trying to figure out how to go forward with a few tries on my own to determine, whether we can actually use Mahout. We are getting a lot of data on many users and would like to use this data in order to provide more relevant ads - relevant not only according  to the content of the side, but to the interests of the user and what he liked in the past. So I know e.g. the type of site he is on (twenty categories), the type of sites he has visited in the past, the ads he saw, the ads  clicked, including a category to which the ad belongs.
Furthermore I'd like to build a profile of interests and if I can, I'd gather some demographical data for a number of sites - this should enable me to use naïve Bayes to deduce gender and age with some probability depending on the recorded history of sites someone visited within the ad network.

All this information I'd like to use in order to make recommendation which ad to deliever, either because similar users cliked it, or because a user clicked on ad, which has often been clicked with another ad. (item based, user based depending which one provides a better result) Other interesting data points would be time (are there specific times at which ads do perform well or bad?) and location and  the actual combinations of site and ad.

I am not a very good programmer and am working more the conceptual angle and look for technologies which we could use. So I am not sure how to store the data I collect (I created a database scheme) to make it available to Mahout, as it seems to run on Hadoop and not with a normal database? I am still looking for more documentation, so if you could point me to something or have some idea how to proceed, I'd appreciate it.

We definitely something which scales as the ad network is creating billions of ad impressions per month with millions of users and Mahout seemed to be the only thing, that seems suitable, although it is still pretty early in its development process.


Thanks for any pointers and opinions,

Benjamin


_______________________________________
Benjamin Dageroth, Key Account Manager / Softwareentwickler
Webtrekk GmbH
Boxhagener Str. 76-78, 10245 Berlin
fon 030 - 755 415 - 360
fax 030 - 755 415 - 100
benjamin.dageroth@webtrekk.com
http://www.webtrekk.com
Amtsgericht Berlin, HRB 93435 B
Geschäftsführer Christian Sauer


_______________________________________

Re: Custom Algorithm (C/C++) ?

Posted by Grant Ingersoll <gs...@apache.org>.
You might also note that Solr 1.4 has Carrot2 integrated with it and  
will eventually have Mahout support too.  Carrot2 is often appropriate  
for smaller clustering jobs, as it is an in memory model.  See http://wiki.apache.org/solr/ClusteringComponent 
.  Also see the Carrot2 project at http://project.carrot2.org

-Grant

On Aug 22, 2009, at 4:52 PM, Sean Owen wrote:

> The good news is that is very small volume. Lucene and Mahout operate,
> broadly, in the realm of tens of millions of things or more. At this  
> scale I
> think performance will not be an issue no matter what you choose, so  
> choose
> based on your other requirements.
>
> On Aug 22, 2009 9:18 PM, "Tim Hughes" <th...@troglobyte.com> wrote:
>
>
> We are looking to do a query of documents & abstracts from a legacy  
> system,
> then retrieve the docs for clustering & classification via Mahout.  
> Expected
> volume is something on the order of 2,000 - 3,000 documents.
>
> Ted Dunning wrote: > > Can you say more about your application? > >  
> Mahout
> is a very young proj...
> --
> View this message in context:
> http://www.nabble.com/Custom-Algorithm-%28C-C%2B%2B%29---tp25096676p25097395.html
>
> Sent from the Mahout User List mailing list archive at Nabble.com.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Re: Custom Algorithm (C/C++) ?

Posted by Sean Owen <sr...@gmail.com>.
The good news is that is very small volume. Lucene and Mahout operate,
broadly, in the realm of tens of millions of things or more. At this scale I
think performance will not be an issue no matter what you choose, so choose
based on your other requirements.

On Aug 22, 2009 9:18 PM, "Tim Hughes" <th...@troglobyte.com> wrote:


We are looking to do a query of documents & abstracts from a legacy system,
then retrieve the docs for clustering & classification via Mahout. Expected
volume is something on the order of 2,000 - 3,000 documents.

Ted Dunning wrote: > > Can you say more about your application? > > Mahout
is a very young proj...
--
View this message in context:
http://www.nabble.com/Custom-Algorithm-%28C-C%2B%2B%29---tp25096676p25097395.html

Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Custom Algorithm (C/C++) ?

Posted by Tim Hughes <th...@troglobyte.com>.
We are looking to do a query of documents & abstracts from a legacy system,
then retrieve the docs for clustering & classification via Mahout. Expected
volume is something on the order of 2,000 - 3,000 documents.



Ted Dunning wrote:
> 
> Can you say more about your application?
> 
> Mahout is a very young project and is known to be sub-standard in a number
> of respects due to youth.  Depending on what you need, it might be
> excellent, or seriously deficient (at the moment).  The deficiencies will
> be
> addressed over time, but full disclosure now is important.
> 
> Depending on what you need, an on-line learning system like vowpal might
> be
> much better for you.
> 
> On Sat, Aug 22, 2009 at 12:59 PM, Tim Hughes <th...@troglobyte.com>
> wrote:
> 
>> We're looking specifically at Mahout (on top of the other supporting
>> Apache
>> projects). One of the roadblocks to moving in that direction is the
>> concern
>> about Java performance. We could not go the Mahout direction if there was
>> no
>> way to use C/C++; since there is, we can bypass the "premature
>> optimization"
>> and run Mahout as designed, yet have the ability to fall back to custom C
>> code if the user's expectations are not met.
>>
> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve
> 
> 

-- 
View this message in context: http://www.nabble.com/Custom-Algorithm-%28C-C%2B%2B%29---tp25096676p25097395.html
Sent from the Mahout User List mailing list archive at Nabble.com.


Re: Custom Algorithm (C/C++) ?

Posted by Ted Dunning <te...@gmail.com>.
Can you say more about your application?

Mahout is a very young project and is known to be sub-standard in a number
of respects due to youth.  Depending on what you need, it might be
excellent, or seriously deficient (at the moment).  The deficiencies will be
addressed over time, but full disclosure now is important.

Depending on what you need, an on-line learning system like vowpal might be
much better for you.

On Sat, Aug 22, 2009 at 12:59 PM, Tim Hughes <th...@troglobyte.com> wrote:

> We're looking specifically at Mahout (on top of the other supporting Apache
> projects). One of the roadblocks to moving in that direction is the concern
> about Java performance. We could not go the Mahout direction if there was
> no
> way to use C/C++; since there is, we can bypass the "premature
> optimization"
> and run Mahout as designed, yet have the ability to fall back to custom C
> code if the user's expectations are not met.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Custom Algorithm (C/C++) ?

Posted by Tim Hughes <th...@troglobyte.com>.
Thanks! That's exactly the info I need.

I too think Java will perform as required for this system, but there is one
user who is raising performance as an issue. 

We're looking specifically at Mahout (on top of the other supporting Apache
projects). One of the roadblocks to moving in that direction is the concern
about Java performance. We could not go the Mahout direction if there was no
way to use C/C++; since there is, we can bypass the "premature optimization"
and run Mahout as designed, yet have the ability to fall back to custom C
code if the user's expectations are not met. 




srowen wrote:
> 
> Lucene came out on top over native code search solutions in this
> particular benchmark, for instance:
> http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/
> But that's just one test and one could quibble with how the tests were
> run.
> 
> If you're interested in Lucene, there is a native port in the works:
> http://lucene.apache.org/lucy/
> 
> I think the answer to your question is 'yes' in general, since the
> libraries are reasonably extensible, and Java allows native code
> invocation through JNI. What in particular are you considering?
> "Lucene" covers a lot of ground.
> 
> Very broadly speaking, with proper care and feeding and decent code,
> and a modern JVM, the native/Java performance gap is not significant.
> I would not begin with an assumption that native code is a must. I
> might suggest you try Lucene/Mahout. It may surprise you with
> performance. If not, ask the list for pointers -- these things
> inevitably need tuning to run optimally. *Then* think about writing a
> native code solution.
> 
> Sean
> 
> On Sat, Aug 22, 2009 at 7:50 PM, Tim Hughes<th...@troglobyte.com> wrote:
>>
>> I'm working on a project which is considering the Apache
>> Lucine/SOLR/Mahout
>> tech stack for a data mining & machine learning project.
>>
>> The issue of Java algorithm performance vs C/C++ has come up, and I would
>> like to know if it is possible to create custom algorithms in C/C++ and
>> use
>> them within the Mahout framework. I have been unable to find information
>> on
>> this.
>> --
>> View this message in context:
>> http://www.nabble.com/Custom-Algorithm-%28C-C%2B%2B%29---tp25096676p25096676.html
>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Custom-Algorithm-%28C-C%2B%2B%29---tp25096676p25097210.html
Sent from the Mahout User List mailing list archive at Nabble.com.


Re: Custom Algorithm (C/C++) ?

Posted by Sean Owen <sr...@gmail.com>.
Lucene came out on top over native code search solutions in this
particular benchmark, for instance:
http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/
But that's just one test and one could quibble with how the tests were run.

If you're interested in Lucene, there is a native port in the works:
http://lucene.apache.org/lucy/

I think the answer to your question is 'yes' in general, since the
libraries are reasonably extensible, and Java allows native code
invocation through JNI. What in particular are you considering?
"Lucene" covers a lot of ground.

Very broadly speaking, with proper care and feeding and decent code,
and a modern JVM, the native/Java performance gap is not significant.
I would not begin with an assumption that native code is a must. I
might suggest you try Lucene/Mahout. It may surprise you with
performance. If not, ask the list for pointers -- these things
inevitably need tuning to run optimally. *Then* think about writing a
native code solution.

Sean

On Sat, Aug 22, 2009 at 7:50 PM, Tim Hughes<th...@troglobyte.com> wrote:
>
> I'm working on a project which is considering the Apache Lucine/SOLR/Mahout
> tech stack for a data mining & machine learning project.
>
> The issue of Java algorithm performance vs C/C++ has come up, and I would
> like to know if it is possible to create custom algorithms in C/C++ and use
> them within the Mahout framework. I have been unable to find information on
> this.
> --
> View this message in context: http://www.nabble.com/Custom-Algorithm-%28C-C%2B%2B%29---tp25096676p25096676.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>
>

Re: Custom Algorithm (C/C++) ?

Posted by Ted Dunning <te...@gmail.com>.
To add my voice to Sean's, I have found over and over that any advantage of
C++ over Java in terms of performance is purely theoretical and is
completely over-shadowed by the fact that with C++ you spend your time
chasing silly mistakes whereas in Java you get to spend much more time on
optimization.

I just went through a major exercise where we replaced well over a hundred
thousand lines of "tuned" C++.  The result in Java is (a) 10x less code, (b)
much faster, (c) easier to test automatically, (d) easily scalable using
Hadoop and Katta, and (e) now works correctly.  Other major benefits have
been that I was able to leverage open source code resources to add massive
amounts of functionality.

Even in numerical code, there is no important difference:

http://blog.mikiobraun.de/2009/04/some-benchmark-numbers-for-jblas.html
http://www.mail-archive.com/dev@commons.apache.org/msg09945.html

What you see here is that if you compare java versus pure C++, you get
essentially identical results.  If you compare pure java against C+specially
tuned assembler, you get as much as a 2x advantage for C.  Of course, you
can use the specially tuned assembler code in java as well.

On Sat, Aug 22, 2009 at 11:50 AM, Tim Hughes <th...@troglobyte.com> wrote:

>
> I'm working on a project which is considering the Apache Lucine/SOLR/Mahout
> tech stack for a data mining & machine learning project.
>
> The issue of Java algorithm performance vs C/C++ has come up, and I would
> like to know if it is possible to create custom algorithms in C/C++ and use
> them within the Mahout framework. I have been unable to find information on
> this.
> --
> View this message in context:
> http://www.nabble.com/Custom-Algorithm-%28C-C%2B%2B%29---tp25096676p25096676.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>
>


-- 
Ted Dunning, CTO
DeepDyve