You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2015/05/09 02:03:48 UTC

Replacement for DefaultAnalyzer

Hi Folks,
I'm making an upgrade from Mahout 0.7 --> 0.9.
I am experiencing the same problem as experienced in the following post [0].
Can someone please suggest what I should replace DefaultAnalyzer with? I am
aware that it was removed from the Mahout API in 0.8?
In the meantime I am going to tst an implementation of Lucene's base
implementation for the Lucene version matching Mahout 0.9.
Thanks in advance to anyone who has the context here.
Best
Lewis

[0] http://www.mail-archive.com/user%40mahout.apache.org/msg14344.html

-- 
*Lewis*

Re: Replacement for DefaultAnalyzer

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Suneel,
Just for context, I've implemented the following.

    @Override
    protected void map(Text key, BehemothDocument value, Context context)
            throws IOException, InterruptedException {
        String sContent = value.getText();
        if (sContent == null) {
            // no text available? skip
            context.getCounter("LuceneTokenizer", "BehemothDocWithoutText")
                    .increment(1);
            return;
        }
        analyzer = new StandardAnalyzer(matchVersion); // or any other
analyzer
        TokenStream ts = analyzer.tokenStream(key.toString(), new
StringReader(sContent.toString()));
        // The Analyzer class will construct the Tokenizer, TokenFilter(s),
and CharFilter(s),
        //   and pass the resulting Reader to the Tokenizer.
        @SuppressWarnings("unused")
        OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);

        CharTermAttribute termAtt = ts
                .addAttribute(CharTermAttribute.class);
        StringTuple document = new StringTuple();
        try {
            ts.reset(); // Resets this stream to the beginning. (Required)
            while (ts.incrementToken()) {
                if (termAtt.length() > 0) {
                    document.add(new String(termAtt.buffer(), 0,
termAtt.length()));
                }
            }
            ts.end();   // Perform end-of-stream operations, e.g. set the
final offset.
        } finally {
            ts.close(); // Release resources associated with this stream.
      }
        context.write(key, document);
    }

I'll be testing and will update is anything else comes up.
Thanks
Lewis


On Mon, May 11, 2015 at 2:12 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> I found Mike's blog post regarding Lucene 4.X from a while ago [0].
> In the* '*Other Changes*'* section Mike states "Analyzers must always
> provide a reusable token stream, by implementing the
> Analyzer.createComponents method (reusableTokenStream has been removed
> and tokenStream is now final, in Analzyer)."
> This provides a good bit ore context therefore I'm going to continue on
> createComponents route with the aim of implementing the newer 4.X Lucene
> API.
> In the meantime, if you get any updated or have a code sample it would be
> very much appreciated.
> Thanks
> Lewis
>
> [0]
> http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
>
> On Mon, May 11, 2015 at 2:03 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> Hi Suneel,
>>
>> On Sat, May 9, 2015 at 11:21 AM, Suneel Marthi <sm...@apache.org>
>> wrote:
>>
>>> Mahout 0.9 and 0.10.0 are using Lucene 4.6.1. There's been a change in
>>> the
>>> TokenStream workflow in Lucene post-Lucene 4.5.
>>>
>>
>> Yes I know that after looking into the codebase. Thanks for clarifying!
>>
>>
>>>
>>> What exactly are u trying to do and where is it u r stuck now? It would
>>> help if u posted a code snippet or something.
>>>
>>>
>> In particular I am working on the following implementation [0] which uses
>> the following code
>>
>> TokenStream stream = analyzer.reusableTokenStream(key.toString(), new
>> StringReader(sContent.toString()));
>>
>> Of note here is that the analyzer object is instantiated as of type
>> DefaultAnalyzer [1]. It is further noted that the analyzer.reusableTokenStream
>> API is deprecated as you've noted so I am just wondering what the suggested
>> API semantics are in order to achieve the desired upgrade.
>> Thanks in advance again for any input.
>> Lewis
>>
>> [0]
>> https://github.com/DigitalPebble/behemoth/blob/master/mahout/src/main/java/com/digitalpebble/behemoth/mahout/LuceneTokenizerMapper.java#L52-L53
>> [1]
>> http://svn.apache.org/repos/asf/mahout/tags/mahout-0.7/core/src/main/java/org/apache/mahout/vectorizer/DefaultAnalyzer.java
>>
>>
>>
>
>
>
> --
> *Lewis*
>



-- 
*Lewis*

Re: Replacement for DefaultAnalyzer

Posted by Lewis John Mcgibbney <le...@gmail.com>.
I found Mike's blog post regarding Lucene 4.X from a while ago [0].
In the* '*Other Changes*'* section Mike states "Analyzers must always
provide a reusable token stream, by implementing the
Analyzer.createComponents method (reusableTokenStream has been removed and
tokenStream is now final, in Analzyer)."
This provides a good bit ore context therefore I'm going to continue on
createComponents route with the aim of implementing the newer 4.X Lucene
API.
In the meantime, if you get any updated or have a code sample it would be
very much appreciated.
Thanks
Lewis

[0]
http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html

On Mon, May 11, 2015 at 2:03 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Suneel,
>
> On Sat, May 9, 2015 at 11:21 AM, Suneel Marthi <sm...@apache.org> wrote:
>
>> Mahout 0.9 and 0.10.0 are using Lucene 4.6.1. There's been a change in the
>> TokenStream workflow in Lucene post-Lucene 4.5.
>>
>
> Yes I know that after looking into the codebase. Thanks for clarifying!
>
>
>>
>> What exactly are u trying to do and where is it u r stuck now? It would
>> help if u posted a code snippet or something.
>>
>>
> In particular I am working on the following implementation [0] which uses
> the following code
>
> TokenStream stream = analyzer.reusableTokenStream(key.toString(), new
> StringReader(sContent.toString()));
>
> Of note here is that the analyzer object is instantiated as of type
> DefaultAnalyzer [1]. It is further noted that the analyzer.reusableTokenStream
> API is deprecated as you've noted so I am just wondering what the suggested
> API semantics are in order to achieve the desired upgrade.
> Thanks in advance again for any input.
> Lewis
>
> [0]
> https://github.com/DigitalPebble/behemoth/blob/master/mahout/src/main/java/com/digitalpebble/behemoth/mahout/LuceneTokenizerMapper.java#L52-L53
> [1]
> http://svn.apache.org/repos/asf/mahout/tags/mahout-0.7/core/src/main/java/org/apache/mahout/vectorizer/DefaultAnalyzer.java
>
>
>



-- 
*Lewis*

Re: Replacement for DefaultAnalyzer

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Suneel,

On Sat, May 9, 2015 at 11:21 AM, Suneel Marthi <sm...@apache.org> wrote:

> Mahout 0.9 and 0.10.0 are using Lucene 4.6.1. There's been a change in the
> TokenStream workflow in Lucene post-Lucene 4.5.
>

Yes I know that after looking into the codebase. Thanks for clarifying!


>
> What exactly are u trying to do and where is it u r stuck now? It would
> help if u posted a code snippet or something.
>
>
In particular I am working on the following implementation [0] which uses
the following code

TokenStream stream = analyzer.reusableTokenStream(key.toString(), new
StringReader(sContent.toString()));

Of note here is that the analyzer object is instantiated as of type
DefaultAnalyzer [1]. It is further noted that the analyzer.reusableTokenStream
API is deprecated as you've noted so I am just wondering what the suggested
API semantics are in order to achieve the desired upgrade.
Thanks in advance again for any input.
Lewis

[0]
https://github.com/DigitalPebble/behemoth/blob/master/mahout/src/main/java/com/digitalpebble/behemoth/mahout/LuceneTokenizerMapper.java#L52-L53
[1]
http://svn.apache.org/repos/asf/mahout/tags/mahout-0.7/core/src/main/java/org/apache/mahout/vectorizer/DefaultAnalyzer.java

Re: Replacement for DefaultAnalyzer

Posted by Suneel Marthi <sm...@apache.org>.
Mahout 0.9 and 0.10.0 are using Lucene 4.6.1. There's been a change in the
TokenStream workflow in Lucene post-Lucene 4.5.

What exactly are u trying to do and where is it u r stuck now? It would
help if u posted a code snippet or something.

On Sat, May 9, 2015 at 10:51 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Suneel,
> Yes this is true. It was dropped exactly due to the Lucene upgrade.
> I'm still working on understanding what to use as the underlying Analyzer
> interface from Lucene also dropped the Definition of reusableTokenStream
> method call!
> Is there any other advice to hand with what to suggest is suitable for
> making the upgrade.
> The project I am working on is using Hadoop 1.2.X and will not be upgrading
> for a while. mahout 0.9 would work perfectly well with this distro however
> upgrade is slightly so fusing right now based on the API being broken as
> oppose to deprecated.
> Thanks again for any help.
> Lewis
>
> On Saturday, May 9, 2015, Suneel Marthi <sm...@apache.org> wrote:
>
> > Not sure how this was used in 0.7 (its > 3 yrs legacy). But I am guessing
> > this would have been required for Lucene 3x back then and must have been
> > dropped for the Lucene 4x upgrade for 0.8 (circa late 2012).
> >
> > On Fri, May 8, 2015 at 8:03 PM, Lewis John Mcgibbney <
> > lewis.mcgibbney@gmail.com <javascript:;>> wrote:
> >
> > > Hi Folks,
> > > I'm making an upgrade from Mahout 0.7 --> 0.9.
> > > I am experiencing the same problem as experienced in the following post
> > > [0].
> > > Can someone please suggest what I should replace DefaultAnalyzer with?
> I
> > am
> > > aware that it was removed from the Mahout API in 0.8?
> > > In the meantime I am going to tst an implementation of Lucene's base
> > > implementation for the Lucene version matching Mahout 0.9.
> > > Thanks in advance to anyone who has the context here.
> > > Best
> > > Lewis
> > >
> > > [0] http://www.mail-archive.com/user%40mahout.apache.org/msg14344.html
> > >
> > > --
> > > *Lewis*
> > >
> >
>
>
> --
> *Lewis*
>

Re: Replacement for DefaultAnalyzer

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Suneel,
Yes this is true. It was dropped exactly due to the Lucene upgrade.
I'm still working on understanding what to use as the underlying Analyzer
interface from Lucene also dropped the Definition of reusableTokenStream
method call!
Is there any other advice to hand with what to suggest is suitable for
making the upgrade.
The project I am working on is using Hadoop 1.2.X and will not be upgrading
for a while. mahout 0.9 would work perfectly well with this distro however
upgrade is slightly so fusing right now based on the API being broken as
oppose to deprecated.
Thanks again for any help.
Lewis

On Saturday, May 9, 2015, Suneel Marthi <sm...@apache.org> wrote:

> Not sure how this was used in 0.7 (its > 3 yrs legacy). But I am guessing
> this would have been required for Lucene 3x back then and must have been
> dropped for the Lucene 4x upgrade for 0.8 (circa late 2012).
>
> On Fri, May 8, 2015 at 8:03 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com <javascript:;>> wrote:
>
> > Hi Folks,
> > I'm making an upgrade from Mahout 0.7 --> 0.9.
> > I am experiencing the same problem as experienced in the following post
> > [0].
> > Can someone please suggest what I should replace DefaultAnalyzer with? I
> am
> > aware that it was removed from the Mahout API in 0.8?
> > In the meantime I am going to tst an implementation of Lucene's base
> > implementation for the Lucene version matching Mahout 0.9.
> > Thanks in advance to anyone who has the context here.
> > Best
> > Lewis
> >
> > [0] http://www.mail-archive.com/user%40mahout.apache.org/msg14344.html
> >
> > --
> > *Lewis*
> >
>


-- 
*Lewis*

Re: Replacement for DefaultAnalyzer

Posted by Suneel Marthi <sm...@apache.org>.
Not sure how this was used in 0.7 (its > 3 yrs legacy). But I am guessing
this would have been required for Lucene 3x back then and must have been
dropped for the Lucene 4x upgrade for 0.8 (circa late 2012).

On Fri, May 8, 2015 at 8:03 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Folks,
> I'm making an upgrade from Mahout 0.7 --> 0.9.
> I am experiencing the same problem as experienced in the following post
> [0].
> Can someone please suggest what I should replace DefaultAnalyzer with? I am
> aware that it was removed from the Mahout API in 0.8?
> In the meantime I am going to tst an implementation of Lucene's base
> implementation for the Lucene version matching Mahout 0.9.
> Thanks in advance to anyone who has the context here.
> Best
> Lewis
>
> [0] http://www.mail-archive.com/user%40mahout.apache.org/msg14344.html
>
> --
> *Lewis*
>