You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by blaplante <bl...@netwebapps.com> on 2003/05/06 17:43:05 UTC

Corrupt index problem

I am not sure if I might be doing something wrong but the result of
searching the index produces a summary field that looks like the following.

%PDF-1.2 %âãÏÓ 10 0 obj << /Length 11 0 R /Filter /FlateDecode >> stream
H?µWÛnÜHýÿC=&?X®«.oÛñeá
¾?ÝN0?_äî²­?ºå?Ôvü÷KÖM¥NË3?Ì"Ð??Å"yxxøy¾'?HDJT?Èüh?üÓ>?½?Jò$?ðß÷{4¡?sø¹
ûø[2øýB>Ü\ÍæÇGäúâdþ}vuLð?

I am using: 
lucene-1.2.jar
PDFBox-0.6.2.jar

I wrote a class that formulates a dynamic url and uses https to connect via
the web and retrieve an InputStream using a GET_METHOD. I am passing the
InputStream off the HTMLParser(InputStream stream) constructor and calling
parser.getReader() to store the content into the index and I am calling
parser.getSummary() to store the summary into the index. When I used the
demo that came with lucene and did a local index on my own machine the
HTMLParser(File file) constructor worked just fine. If anyone has idea's as
to what I might be doing wrong I would appreciate the heads up.

Thanks

Bryan LaPlante

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Corrupt index problem

Posted by Ben Litchfield <be...@csh.rit.edu>.
Are you using the org.pdfbox.searchengines.LucenePDFDocument class?  It
looks like you are using the HTMLParser to parse a PDF document which will
not work.  So in your indexer class you need to have something to this
effect

if( PDF Document )
{
    use LucenePDFDocument
}
else if( HTML Document )
{
    use HTMLParser
}

You can see a working example of this in the
org.pdfbox.searchengine.lucene.IndexFiles source file.

It would be nice if lucene had some sort Front class that implemented a
command pattern.  So you could feed it a URL or File it would determine
the mime type and it would call to a parser configured for that mimetype.
It seems silly that everybody has all of these if statements in all of
their indexers to switch on the contentType.  Maybe I will work on that
one of these days if there is some interest.

Ben


On Tue, 6 May 2003, blaplante wrote:

> I am not sure if I might be doing something wrong but the result of
> searching the index produces a summary field that looks like the following.
>
> %PDF-1.2 %âãÏÓ 10 0 obj << /Length 11 0 R /Filter /FlateDecode >> stream
> H?µWÛnÜHýÿC=&?X®«.oÛñeá
> ¾?ÝN0?_äî²­?ºå?Ôvü÷KÖM¥NË3?Ì"Ð??Å"yxxøy¾'?HDJT?Èüh?üÓ>?½?Jò$?ðß÷{4¡?sø¹
> ûø[2øýB>Ü\ÍæÇGäúâdþ}vuLð?
>
> I am using:
> lucene-1.2.jar
> PDFBox-0.6.2.jar
>
> I wrote a class that formulates a dynamic url and uses https to connect via
> the web and retrieve an InputStream using a GET_METHOD. I am passing the
> InputStream off the HTMLParser(InputStream stream) constructor and calling
> parser.getReader() to store the content into the index and I am calling
> parser.getSummary() to store the summary into the index. When I used the
> demo that came with lucene and did a local index on my own machine the
> HTMLParser(File file) constructor worked just fine. If anyone has idea's as
> to what I might be doing wrong I would appreciate the heads up.
>
> Thanks
>
> Bryan LaPlante
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org