You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Arian Pasquali <ar...@arianpasquali.com> on 2014/10/01 13:18:05 UTC
Re: word weights using BM25
Hey guys,
I think it is fair to give you some feedback.
I managed to implement BM25+ <http://en.wikipedia.org/wiki/Okapi_BM25> term
score on Mahout.
It was straightforward using the current TFIDF implementation as an example.
Basically what I did was implement the interface
org.apache.mahout.vectorizer.Weight, create a BM25Converter and
BM25PartialVectorReducer similar to TFIDFConverter
<https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.html>
and
TFIDFPartialVectorReducer
<https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.html>
respectively .
cheers
Arian
Arian Pasquali
http://about.me/arianpasquali
2014-09-24 14:14 GMT+01:00 Arian Pasquali <ar...@arianpasquali.com>:
> Yes,
> I'm studying his work <http://nlp.uned.es/~jperezi/Lucene-BM25/> and the
> current mahout's tfidf code.
> Trying to understand how I would port that to mr.
> I ll try to share something if I succeed.
>
> Arian Pasquali
> http://about.me/arianpasquali
>
> 2014-09-24 5:12 GMT+01:00 Suneel Marthi <su...@gmail.com>:
>
>> Lucene 4.x supports okapi-bm25. So it should be easy to implement.
>>
>> On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>> > Should be pretty easy. I haven't heard of anyone doing it.
>> >
>> > Sent from my iPhone
>> >
>> > > On Sep 23, 2014, at 18:53, Arian Pasquali <ar...@arianpasquali.com>
>> > wrote:
>> > >
>> > > Hi,
>> > > I was wondering if would be possible to support bm25 term weighting
>> > > extending Mahout's tf-idf implementation.
>> > >
>> > > I was curious to know if anyone here has already tried to do so.
>> > > If not, what would be your suggestion for such implementation on
>> Mahout?
>> > >
>> > >
>> > > Arian Pasquali
>> > > http://about.me/arianpasquali
>> >
>>
>
>
Re: word weights using BM25
Posted by Pat Ferrel <pa...@occamsmachete.com>.
We are moving to higher performance platforms than Hadoop mapreduce, like Spark. You can still do map/reduce style code but Mahout's not taking new Hadoop mr code.
On Oct 1, 2014, at 6:30 AM, Arian Pasquali <ar...@arianpasquali.com> wrote:
Yes Suneel,
Indeed It is in MR fashion.
What exactly do you mean when you said Mahout is not accepting any new
MapReduce code?
Do you mean for submitting a patch?
I'm sure there might be better ways to implement it, but I'm more
interesting in the results right now.
What would be your suggestion?
best
Arian Pasquali
http://about.me/arianpasquali
2014-10-01 13:10 GMT+01:00 Suneel Marthi <sm...@apache.org>:
> How did u implement BM25PartialVectorReducer and BM25Converter?? The
> present implementations for TFIDFConverter and Reducer are MR.
> Mahout is not accepting any new MapReduce code.
>
> On Wed, Oct 1, 2014 at 7:18 AM, Arian Pasquali <ar...@arianpasquali.com>
> wrote:
>
>> Hey guys,
>> I think it is fair to give you some feedback.
>> I managed to implement BM25+ <http://en.wikipedia.org/wiki/Okapi_BM25>
>> term
>> score on Mahout.
>> It was straightforward using the current TFIDF implementation as an
>> example.
>>
>> Basically what I did was implement the interface
>> org.apache.mahout.vectorizer.Weight, create a BM25Converter and
>> BM25PartialVectorReducer similar to TFIDFConverter
>> <
>>
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.html
>>>
>> and
>> TFIDFPartialVectorReducer
>> <
>>
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.html
>>>
>> respectively .
>>
>> cheers
>> Arian
>>
>> Arian Pasquali
>> http://about.me/arianpasquali
>>
>> 2014-09-24 14:14 GMT+01:00 Arian Pasquali <ar...@arianpasquali.com>:
>>
>>> Yes,
>>> I'm studying his work <http://nlp.uned.es/~jperezi/Lucene-BM25/> and
> the
>>> current mahout's tfidf code.
>>> Trying to understand how I would port that to mr.
>>> I ll try to share something if I succeed.
>>>
>>> Arian Pasquali
>>> http://about.me/arianpasquali
>>>
>>> 2014-09-24 5:12 GMT+01:00 Suneel Marthi <su...@gmail.com>:
>>>
>>>> Lucene 4.x supports okapi-bm25. So it should be easy to implement.
>>>>
>>>> On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com>
>>>> wrote:
>>>>
>>>>> Should be pretty easy. I haven't heard of anyone doing it.
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>>> On Sep 23, 2014, at 18:53, Arian Pasquali <
> arian@arianpasquali.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>> I was wondering if would be possible to support bm25 term
> weighting
>>>>>> extending Mahout's tf-idf implementation.
>>>>>>
>>>>>> I was curious to know if anyone here has already tried to do so.
>>>>>> If not, what would be your suggestion for such implementation on
>>>> Mahout?
>>>>>>
>>>>>>
>>>>>> Arian Pasquali
>>>>>> http://about.me/arianpasquali
>>>>>
>>>>
>>>
>>>
>>
>
Re: word weights using BM25
Posted by Arian Pasquali <ar...@arianpasquali.com>.
Yes Suneel,
Indeed It is in MR fashion.
What exactly do you mean when you said Mahout is not accepting any new
MapReduce code?
Do you mean for submitting a patch?
I'm sure there might be better ways to implement it, but I'm more
interesting in the results right now.
What would be your suggestion?
best
Arian Pasquali
http://about.me/arianpasquali
2014-10-01 13:10 GMT+01:00 Suneel Marthi <sm...@apache.org>:
> How did u implement BM25PartialVectorReducer and BM25Converter?? The
> present implementations for TFIDFConverter and Reducer are MR.
> Mahout is not accepting any new MapReduce code.
>
> On Wed, Oct 1, 2014 at 7:18 AM, Arian Pasquali <ar...@arianpasquali.com>
> wrote:
>
> > Hey guys,
> > I think it is fair to give you some feedback.
> > I managed to implement BM25+ <http://en.wikipedia.org/wiki/Okapi_BM25>
> > term
> > score on Mahout.
> > It was straightforward using the current TFIDF implementation as an
> > example.
> >
> > Basically what I did was implement the interface
> > org.apache.mahout.vectorizer.Weight, create a BM25Converter and
> > BM25PartialVectorReducer similar to TFIDFConverter
> > <
> >
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.html
> > >
> > and
> > TFIDFPartialVectorReducer
> > <
> >
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.html
> > >
> > respectively .
> >
> > cheers
> > Arian
> >
> > Arian Pasquali
> > http://about.me/arianpasquali
> >
> > 2014-09-24 14:14 GMT+01:00 Arian Pasquali <ar...@arianpasquali.com>:
> >
> > > Yes,
> > > I'm studying his work <http://nlp.uned.es/~jperezi/Lucene-BM25/> and
> the
> > > current mahout's tfidf code.
> > > Trying to understand how I would port that to mr.
> > > I ll try to share something if I succeed.
> > >
> > > Arian Pasquali
> > > http://about.me/arianpasquali
> > >
> > > 2014-09-24 5:12 GMT+01:00 Suneel Marthi <su...@gmail.com>:
> > >
> > >> Lucene 4.x supports okapi-bm25. So it should be easy to implement.
> > >>
> > >> On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com>
> > >> wrote:
> > >>
> > >> > Should be pretty easy. I haven't heard of anyone doing it.
> > >> >
> > >> > Sent from my iPhone
> > >> >
> > >> > > On Sep 23, 2014, at 18:53, Arian Pasquali <
> arian@arianpasquali.com>
> > >> > wrote:
> > >> > >
> > >> > > Hi,
> > >> > > I was wondering if would be possible to support bm25 term
> weighting
> > >> > > extending Mahout's tf-idf implementation.
> > >> > >
> > >> > > I was curious to know if anyone here has already tried to do so.
> > >> > > If not, what would be your suggestion for such implementation on
> > >> Mahout?
> > >> > >
> > >> > >
> > >> > > Arian Pasquali
> > >> > > http://about.me/arianpasquali
> > >> >
> > >>
> > >
> > >
> >
>
Re: word weights using BM25
Posted by Suneel Marthi <sm...@apache.org>.
How did u implement BM25PartialVectorReducer and BM25Converter?? The
present implementations for TFIDFConverter and Reducer are MR.
Mahout is not accepting any new MapReduce code.
On Wed, Oct 1, 2014 at 7:18 AM, Arian Pasquali <ar...@arianpasquali.com>
wrote:
> Hey guys,
> I think it is fair to give you some feedback.
> I managed to implement BM25+ <http://en.wikipedia.org/wiki/Okapi_BM25>
> term
> score on Mahout.
> It was straightforward using the current TFIDF implementation as an
> example.
>
> Basically what I did was implement the interface
> org.apache.mahout.vectorizer.Weight, create a BM25Converter and
> BM25PartialVectorReducer similar to TFIDFConverter
> <
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.html
> >
> and
> TFIDFPartialVectorReducer
> <
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.html
> >
> respectively .
>
> cheers
> Arian
>
> Arian Pasquali
> http://about.me/arianpasquali
>
> 2014-09-24 14:14 GMT+01:00 Arian Pasquali <ar...@arianpasquali.com>:
>
> > Yes,
> > I'm studying his work <http://nlp.uned.es/~jperezi/Lucene-BM25/> and the
> > current mahout's tfidf code.
> > Trying to understand how I would port that to mr.
> > I ll try to share something if I succeed.
> >
> > Arian Pasquali
> > http://about.me/arianpasquali
> >
> > 2014-09-24 5:12 GMT+01:00 Suneel Marthi <su...@gmail.com>:
> >
> >> Lucene 4.x supports okapi-bm25. So it should be easy to implement.
> >>
> >> On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com>
> >> wrote:
> >>
> >> > Should be pretty easy. I haven't heard of anyone doing it.
> >> >
> >> > Sent from my iPhone
> >> >
> >> > > On Sep 23, 2014, at 18:53, Arian Pasquali <ar...@arianpasquali.com>
> >> > wrote:
> >> > >
> >> > > Hi,
> >> > > I was wondering if would be possible to support bm25 term weighting
> >> > > extending Mahout's tf-idf implementation.
> >> > >
> >> > > I was curious to know if anyone here has already tried to do so.
> >> > > If not, what would be your suggestion for such implementation on
> >> Mahout?
> >> > >
> >> > >
> >> > > Arian Pasquali
> >> > > http://about.me/arianpasquali
> >> >
> >>
> >
> >
>
Re: word weights using BM25
Posted by Ted Dunning <te...@gmail.com>.
On Wed, Oct 1, 2014 at 7:52 AM, Arian Pasquali <ar...@arianpasquali.com>
wrote:
> My dataset is a collection of documents in german and I can say that the
> scores seems better compared to my TFIDF scores. Results make more sense
> now, specially my bi-grams.
>
OK.
I will take note.
Re: word weights using BM25
Posted by Arian Pasquali <ar...@arianpasquali.com>.
Hi Ted,
My dataset is a collection of documents in german and I can say that the
scores seems better compared to my TFIDF scores. Results make more sense
now, specially my bi-grams.
Arian Pasquali
http://about.me/arianpasquali
2014-10-01 13:09 GMT+01:00 Ted Dunning <te...@gmail.com>:
> Thanks so much for the feedback. Glad to hear it was straightforward.
>
>
> But the important question is ....
>
> how did BM25 work for you?
>
>
>
> On Wed, Oct 1, 2014 at 6:18 AM, Arian Pasquali <ar...@arianpasquali.com>
> wrote:
>
> > Hey guys,
> > I think it is fair to give you some feedback.
> > I managed to implement BM25+ <http://en.wikipedia.org/wiki/Okapi_BM25>
> > term
> > score on Mahout.
> > It was straightforward using the current TFIDF implementation as an
> > example.
> >
> > Basically what I did was implement the interface
> > org.apache.mahout.vectorizer.Weight, create a BM25Converter and
> > BM25PartialVectorReducer similar to TFIDFConverter
> > <
> >
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.html
> > >
> > and
> > TFIDFPartialVectorReducer
> > <
> >
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.html
> > >
> > respectively .
> >
> > cheers
> > Arian
> >
> > Arian Pasquali
> > http://about.me/arianpasquali
> >
> > 2014-09-24 14:14 GMT+01:00 Arian Pasquali <ar...@arianpasquali.com>:
> >
> > > Yes,
> > > I'm studying his work <http://nlp.uned.es/~jperezi/Lucene-BM25/> and
> the
> > > current mahout's tfidf code.
> > > Trying to understand how I would port that to mr.
> > > I ll try to share something if I succeed.
> > >
> > > Arian Pasquali
> > > http://about.me/arianpasquali
> > >
> > > 2014-09-24 5:12 GMT+01:00 Suneel Marthi <su...@gmail.com>:
> > >
> > >> Lucene 4.x supports okapi-bm25. So it should be easy to implement.
> > >>
> > >> On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com>
> > >> wrote:
> > >>
> > >> > Should be pretty easy. I haven't heard of anyone doing it.
> > >> >
> > >> > Sent from my iPhone
> > >> >
> > >> > > On Sep 23, 2014, at 18:53, Arian Pasquali <
> arian@arianpasquali.com>
> > >> > wrote:
> > >> > >
> > >> > > Hi,
> > >> > > I was wondering if would be possible to support bm25 term
> weighting
> > >> > > extending Mahout's tf-idf implementation.
> > >> > >
> > >> > > I was curious to know if anyone here has already tried to do so.
> > >> > > If not, what would be your suggestion for such implementation on
> > >> Mahout?
> > >> > >
> > >> > >
> > >> > > Arian Pasquali
> > >> > > http://about.me/arianpasquali
> > >> >
> > >>
> > >
> > >
> >
>
Re: word weights using BM25
Posted by Ted Dunning <te...@gmail.com>.
Thanks so much for the feedback. Glad to hear it was straightforward.
But the important question is ....
how did BM25 work for you?
On Wed, Oct 1, 2014 at 6:18 AM, Arian Pasquali <ar...@arianpasquali.com>
wrote:
> Hey guys,
> I think it is fair to give you some feedback.
> I managed to implement BM25+ <http://en.wikipedia.org/wiki/Okapi_BM25>
> term
> score on Mahout.
> It was straightforward using the current TFIDF implementation as an
> example.
>
> Basically what I did was implement the interface
> org.apache.mahout.vectorizer.Weight, create a BM25Converter and
> BM25PartialVectorReducer similar to TFIDFConverter
> <
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.html
> >
> and
> TFIDFPartialVectorReducer
> <
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.html
> >
> respectively .
>
> cheers
> Arian
>
> Arian Pasquali
> http://about.me/arianpasquali
>
> 2014-09-24 14:14 GMT+01:00 Arian Pasquali <ar...@arianpasquali.com>:
>
> > Yes,
> > I'm studying his work <http://nlp.uned.es/~jperezi/Lucene-BM25/> and the
> > current mahout's tfidf code.
> > Trying to understand how I would port that to mr.
> > I ll try to share something if I succeed.
> >
> > Arian Pasquali
> > http://about.me/arianpasquali
> >
> > 2014-09-24 5:12 GMT+01:00 Suneel Marthi <su...@gmail.com>:
> >
> >> Lucene 4.x supports okapi-bm25. So it should be easy to implement.
> >>
> >> On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com>
> >> wrote:
> >>
> >> > Should be pretty easy. I haven't heard of anyone doing it.
> >> >
> >> > Sent from my iPhone
> >> >
> >> > > On Sep 23, 2014, at 18:53, Arian Pasquali <ar...@arianpasquali.com>
> >> > wrote:
> >> > >
> >> > > Hi,
> >> > > I was wondering if would be possible to support bm25 term weighting
> >> > > extending Mahout's tf-idf implementation.
> >> > >
> >> > > I was curious to know if anyone here has already tried to do so.
> >> > > If not, what would be your suggestion for such implementation on
> >> Mahout?
> >> > >
> >> > >
> >> > > Arian Pasquali
> >> > > http://about.me/arianpasquali
> >> >
> >>
> >
> >
>