You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Manjula Wijewickrema <ma...@gmail.com> on 2014/06/11 08:23:43 UTC

ShingleAnalyzerWrapper question

Hi,

In my programme, I can index and search a document based on unigrams. I
modified the code as follows to obtain the results based on bigrams.
However, it did not give me the desired output.

*****************

*public* *static* *void* createIndex() *throws* CorruptIndexException,
LockObtainFailedException,



IOException {





            *final* String[] NEW_STOP_WORDS = {"a", "able", "about",
"actually", "after", "allow", "almost", "already", "also", "although",
"always", "am",   "an", "and", "any", "anybody"};  //only a portion



            SnowballAnalyzer analyzer = *new* SnowballAnalyzer("English",
NEW_STOP_WORDS );

            Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*
);



            ShingleAnalyzerWrapper sw=*new*
ShingleAnalyzerWrapper(analyzer,2);

            sw.setOutputUnigrams(*false*);



            IndexWriter w= *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
*true*,IndexWriter.MaxFieldLength.*UNLIMITED*);

            File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);

            File[] files = dir.listFiles();





            *for* (File file : files) {



                  Document doc = *new* Document();

                  String text="";

                  doc.add(*new* Field("contents",text,Field.Store.*YES*,
Field.Index.UN_TOKENIZED,Field.TermVector.*YES*));





                  Reader reader = *new* FileReader(file);

                  doc.add(*new* Field(*FIELD_CONTENTS*, reader));

                  w.addDocument(doc);

            }

            w.optimize();

            w.close();



      }


****************

Still the output is;


{contents: /1, assist/1, fine/1, librari/1, librarian/1, main/1, manjula/3,
name/1, sabaragamuwa/1, univers/1}

*******************


If anybody can, please help me to obtain the correct output.


Thanks,


Manjula.

Re: ShingleAnalyzerWrapper question

Posted by Manjula Wijewickrema <ma...@gmail.com>.

Dear Steve,

It works. Thanks.




On Wed, Jun 11, 2014 at 6:18 PM, Steve Rowe <sa...@gmail.com> wrote:

> You should give sw rather than analyzer in the IndexWriter actor.
>
> Steve
> www.lucidworks.com
>  On Jun 11, 2014 2:24 AM, "Manjula Wijewickrema" <ma...@gmail.com>
> wrote:
>
> > Hi,
> >
> > In my programme, I can index and search a document based on unigrams. I
> > modified the code as follows to obtain the results based on bigrams.
> > However, it did not give me the desired output.
> >
> > *****************
> >
> > *public* *static* *void* createIndex() *throws* CorruptIndexException,
> > LockObtainFailedException,
> >
> >
> >
> > IOException {
> >
> >
> >
> >
> >
> >             *final* String[] NEW_STOP_WORDS = {"a", "able", "about",
> > "actually", "after", "allow", "almost", "already", "also", "although",
> > "always", "am",   "an", "and", "any", "anybody"};  //only a portion
> >
> >
> >
> >             SnowballAnalyzer analyzer = *new* SnowballAnalyzer("English",
> > NEW_STOP_WORDS );
> >
> >             Directory directory =
> > FSDirectory.getDirectory(*INDEX_DIRECTORY*
> > );
> >
> >
> >
> >             ShingleAnalyzerWrapper sw=*new*
> > ShingleAnalyzerWrapper(analyzer,2);
> >
> >             sw.setOutputUnigrams(*false*);
> >
> >
> >
> >             IndexWriter w= *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
> > *true*,IndexWriter.MaxFieldLength.*UNLIMITED*);
> >
> >             File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);
> >
> >             File[] files = dir.listFiles();
> >
> >
> >
> >
> >
> >             *for* (File file : files) {
> >
> >
> >
> >                   Document doc = *new* Document();
> >
> >                   String text="";
> >
> >                   doc.add(*new* Field("contents",text,Field.Store.*YES*,
> > Field.Index.UN_TOKENIZED,Field.TermVector.*YES*));
> >
> >
> >
> >
> >
> >                   Reader reader = *new* FileReader(file);
> >
> >                   doc.add(*new* Field(*FIELD_CONTENTS*, reader));
> >
> >                   w.addDocument(doc);
> >
> >             }
> >
> >             w.optimize();
> >
> >             w.close();
> >
> >
> >
> >       }
> >
> >
> > ****************
> >
> > Still the output is;
> >
> >
> > {contents: /1, assist/1, fine/1, librari/1, librarian/1, main/1,
> manjula/3,
> > name/1, sabaragamuwa/1, univers/1}
> >
> > *******************
> >
> >
> > If anybody can, please help me to obtain the correct output.
> >
> >
> > Thanks,
> >
> >
> > Manjula.
> >
>

Re: ShingleAnalyzerWrapper question

Posted by Steve Rowe <sa...@gmail.com>.

You should give sw rather than analyzer in the IndexWriter actor.

Steve
www.lucidworks.com
 On Jun 11, 2014 2:24 AM, "Manjula Wijewickrema" <ma...@gmail.com>
wrote:

> Hi,
>
> In my programme, I can index and search a document based on unigrams. I
> modified the code as follows to obtain the results based on bigrams.
> However, it did not give me the desired output.
>
> *****************
>
> *public* *static* *void* createIndex() *throws* CorruptIndexException,
> LockObtainFailedException,
>
>
>
> IOException {
>
>
>
>
>
>             *final* String[] NEW_STOP_WORDS = {"a", "able", "about",
> "actually", "after", "allow", "almost", "already", "also", "although",
> "always", "am",   "an", "and", "any", "anybody"};  //only a portion
>
>
>
>             SnowballAnalyzer analyzer = *new* SnowballAnalyzer("English",
> NEW_STOP_WORDS );
>
>             Directory directory =
> FSDirectory.getDirectory(*INDEX_DIRECTORY*
> );
>
>
>
>             ShingleAnalyzerWrapper sw=*new*
> ShingleAnalyzerWrapper(analyzer,2);
>
>             sw.setOutputUnigrams(*false*);
>
>
>
>             IndexWriter w= *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
> *true*,IndexWriter.MaxFieldLength.*UNLIMITED*);
>
>             File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);
>
>             File[] files = dir.listFiles();
>
>
>
>
>
>             *for* (File file : files) {
>
>
>
>                   Document doc = *new* Document();
>
>                   String text="";
>
>                   doc.add(*new* Field("contents",text,Field.Store.*YES*,
> Field.Index.UN_TOKENIZED,Field.TermVector.*YES*));
>
>
>
>
>
>                   Reader reader = *new* FileReader(file);
>
>                   doc.add(*new* Field(*FIELD_CONTENTS*, reader));
>
>                   w.addDocument(doc);
>
>             }
>
>             w.optimize();
>
>             w.close();
>
>
>
>       }
>
>
> ****************
>
> Still the output is;
>
>
> {contents: /1, assist/1, fine/1, librari/1, librarian/1, main/1, manjula/3,
> name/1, sabaragamuwa/1, univers/1}
>
> *******************
>
>
> If anybody can, please help me to obtain the correct output.
>
>
> Thanks,
>
>
> Manjula.
>