You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Manjula Wijewickrema <ma...@gmail.com> on 2014/06/11 08:23:43 UTC
ShingleAnalyzerWrapper question
Hi,
In my programme, I can index and search a document based on unigrams. I
modified the code as follows to obtain the results based on bigrams.
However, it did not give me the desired output.
*****************
*public* *static* *void* createIndex() *throws* CorruptIndexException,
LockObtainFailedException,
IOException {
*final* String[] NEW_STOP_WORDS = {"a", "able", "about",
"actually", "after", "allow", "almost", "already", "also", "although",
"always", "am", "an", "and", "any", "anybody"}; //only a portion
SnowballAnalyzer analyzer = *new* SnowballAnalyzer("English",
NEW_STOP_WORDS );
Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*
);
ShingleAnalyzerWrapper sw=*new*
ShingleAnalyzerWrapper(analyzer,2);
sw.setOutputUnigrams(*false*);
IndexWriter w= *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
*true*,IndexWriter.MaxFieldLength.*UNLIMITED*);
File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);
File[] files = dir.listFiles();
*for* (File file : files) {
Document doc = *new* Document();
String text="";
doc.add(*new* Field("contents",text,Field.Store.*YES*,
Field.Index.UN_TOKENIZED,Field.TermVector.*YES*));
Reader reader = *new* FileReader(file);
doc.add(*new* Field(*FIELD_CONTENTS*, reader));
w.addDocument(doc);
}
w.optimize();
w.close();
}
****************
Still the output is;
{contents: /1, assist/1, fine/1, librari/1, librarian/1, main/1, manjula/3,
name/1, sabaragamuwa/1, univers/1}
*******************
If anybody can, please help me to obtain the correct output.
Thanks,
Manjula.
Re: ShingleAnalyzerWrapper question
Posted by Manjula Wijewickrema <ma...@gmail.com>.
Dear Steve,
It works. Thanks.
On Wed, Jun 11, 2014 at 6:18 PM, Steve Rowe <sa...@gmail.com> wrote:
> You should give sw rather than analyzer in the IndexWriter actor.
>
> Steve
> www.lucidworks.com
> On Jun 11, 2014 2:24 AM, "Manjula Wijewickrema" <ma...@gmail.com>
> wrote:
>
> > Hi,
> >
> > In my programme, I can index and search a document based on unigrams. I
> > modified the code as follows to obtain the results based on bigrams.
> > However, it did not give me the desired output.
> >
> > *****************
> >
> > *public* *static* *void* createIndex() *throws* CorruptIndexException,
> > LockObtainFailedException,
> >
> >
> >
> > IOException {
> >
> >
> >
> >
> >
> > *final* String[] NEW_STOP_WORDS = {"a", "able", "about",
> > "actually", "after", "allow", "almost", "already", "also", "although",
> > "always", "am", "an", "and", "any", "anybody"}; //only a portion
> >
> >
> >
> > SnowballAnalyzer analyzer = *new* SnowballAnalyzer("English",
> > NEW_STOP_WORDS );
> >
> > Directory directory =
> > FSDirectory.getDirectory(*INDEX_DIRECTORY*
> > );
> >
> >
> >
> > ShingleAnalyzerWrapper sw=*new*
> > ShingleAnalyzerWrapper(analyzer,2);
> >
> > sw.setOutputUnigrams(*false*);
> >
> >
> >
> > IndexWriter w= *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
> > *true*,IndexWriter.MaxFieldLength.*UNLIMITED*);
> >
> > File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);
> >
> > File[] files = dir.listFiles();
> >
> >
> >
> >
> >
> > *for* (File file : files) {
> >
> >
> >
> > Document doc = *new* Document();
> >
> > String text="";
> >
> > doc.add(*new* Field("contents",text,Field.Store.*YES*,
> > Field.Index.UN_TOKENIZED,Field.TermVector.*YES*));
> >
> >
> >
> >
> >
> > Reader reader = *new* FileReader(file);
> >
> > doc.add(*new* Field(*FIELD_CONTENTS*, reader));
> >
> > w.addDocument(doc);
> >
> > }
> >
> > w.optimize();
> >
> > w.close();
> >
> >
> >
> > }
> >
> >
> > ****************
> >
> > Still the output is;
> >
> >
> > {contents: /1, assist/1, fine/1, librari/1, librarian/1, main/1,
> manjula/3,
> > name/1, sabaragamuwa/1, univers/1}
> >
> > *******************
> >
> >
> > If anybody can, please help me to obtain the correct output.
> >
> >
> > Thanks,
> >
> >
> > Manjula.
> >
>
Re: ShingleAnalyzerWrapper question
Posted by Steve Rowe <sa...@gmail.com>.
You should give sw rather than analyzer in the IndexWriter actor.
Steve
www.lucidworks.com
On Jun 11, 2014 2:24 AM, "Manjula Wijewickrema" <ma...@gmail.com>
wrote:
> Hi,
>
> In my programme, I can index and search a document based on unigrams. I
> modified the code as follows to obtain the results based on bigrams.
> However, it did not give me the desired output.
>
> *****************
>
> *public* *static* *void* createIndex() *throws* CorruptIndexException,
> LockObtainFailedException,
>
>
>
> IOException {
>
>
>
>
>
> *final* String[] NEW_STOP_WORDS = {"a", "able", "about",
> "actually", "after", "allow", "almost", "already", "also", "although",
> "always", "am", "an", "and", "any", "anybody"}; //only a portion
>
>
>
> SnowballAnalyzer analyzer = *new* SnowballAnalyzer("English",
> NEW_STOP_WORDS );
>
> Directory directory =
> FSDirectory.getDirectory(*INDEX_DIRECTORY*
> );
>
>
>
> ShingleAnalyzerWrapper sw=*new*
> ShingleAnalyzerWrapper(analyzer,2);
>
> sw.setOutputUnigrams(*false*);
>
>
>
> IndexWriter w= *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
> *true*,IndexWriter.MaxFieldLength.*UNLIMITED*);
>
> File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);
>
> File[] files = dir.listFiles();
>
>
>
>
>
> *for* (File file : files) {
>
>
>
> Document doc = *new* Document();
>
> String text="";
>
> doc.add(*new* Field("contents",text,Field.Store.*YES*,
> Field.Index.UN_TOKENIZED,Field.TermVector.*YES*));
>
>
>
>
>
> Reader reader = *new* FileReader(file);
>
> doc.add(*new* Field(*FIELD_CONTENTS*, reader));
>
> w.addDocument(doc);
>
> }
>
> w.optimize();
>
> w.close();
>
>
>
> }
>
>
> ****************
>
> Still the output is;
>
>
> {contents: /1, assist/1, fine/1, librari/1, librarian/1, main/1, manjula/3,
> name/1, sabaragamuwa/1, univers/1}
>
> *******************
>
>
> If anybody can, please help me to obtain the correct output.
>
>
> Thanks,
>
>
> Manjula.
>