You are viewing a plain text version of this content. The canonical link for it is here.
Posted to slide-user@jakarta.apache.org by Markus Maeder <Ma...@oase.ch> on 2004/11/01 13:00:15 UTC

Re: lucene, extractor and pdf - HOW?

Zitat von Unico Hommes <un...@hippo.nl>:
> ...
> <contentindexer classname="org.apache.slide.index.TextContentIndexer">
>   <parameter name="indexpath">store/index</parameter>
>   <parameter
> name="analyzer">org.apache.lucene.analysis.de.GermanAnalyzer</parameter>
> </contentindexer>

If I look at the CVS source, I see

        IndexWriter writer =
            new IndexWriter(indexDb, new StandardAnalyzer(), false);

in LuceneIndexer.java

As I am no java crack, I might misunterstand this. But IMHO this will always
call then StandardAnalyzer... 

The problem with the PDF files not getting indexed still exists. The content
gets extracted with PDFBox. I will trace LuceneIndexer now.


Regards
Markus


---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org


Re: lucene, extractor and pdf - HOW? - Problem Solved

Posted by Markus Maeder <Ma...@oase.ch>.
Sorry, I looked at code not beeing used anymore. :)

I have solved the PDF-problem by upgrading to the new PDFBox-Version 0.6.7a

Regards
Markus

Zitat von Markus Maeder <Ma...@oase.ch>:

> Zitat von Unico Hommes <un...@hippo.nl>:
> > ...
> > <contentindexer classname="org.apache.slide.index.TextContentIndexer">
> >   <parameter name="indexpath">store/index</parameter>
> >   <parameter
> > name="analyzer">org.apache.lucene.analysis.de.GermanAnalyzer</parameter>
> > </contentindexer>
> 
> If I look at the CVS source, I see
> 
>         IndexWriter writer =
>             new IndexWriter(indexDb, new StandardAnalyzer(), false);
> 
> in LuceneIndexer.java
> 
> As I am no java crack, I might misunterstand this. But IMHO this will always
> call then StandardAnalyzer... 
> 
> The problem with the PDF files not getting indexed still exists. The content
> gets extracted with PDFBox. I will trace LuceneIndexer now.
> 
> 
> Regards
> Markus
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: slide-user-help@jakarta.apache.org
> 
> 
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org