You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by neetika <ne...@iflexsolutions.com> on 2007/07/17 13:17:12 UTC

getting problem while indexing pdf files with pdfbox

http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf 

hi all,

 i am able to convert a pdf in to a text file using pdfbox. 
 and this is the code that I used, but I am not able to index it
 
// code for parsing and making index

		public Document getDocument(InputStream is)
		{
		COSDocument cosDoc = null;
		try {
			PDFParser parser = new PDFParser(is);
			parser.parse();
		cosDoc = parser.getDocument();
		}
		catch (IOException e) {
		e.printStackTrace();
				}
		String docText = null;
		try {
		PDFTextStripper stripper = new PDFTextStripper();
		docText = stripper.getText(new PDDocument(cosDoc));
		}
		catch (IOException e) {
			e.printStackTrace();
		}
		Document doc = new Document();
		if (docText != null) {
		doc.add(new Field("body", docText, Field.Store.YES, 
        		Field.Index.TOKENIZED));
		}
		return doc;
		}
		
		public static void main(String[] args) throws Exception {
			TestPDFParser handler = new TestPDFParser();
			
		Document doc = handler.getDocument(new FileInputStream(new
File("D:\\lucenePdf\\DRra0026.pdf")));

		System.out.println(doc);
		
		//Following code is for making index
		
		IndexWriter f_writer = new IndexWriter("D:\\lucenePdf", new
StandardAnalyzer(), true);

		f_writer.addDocument(doc);
		
		}
		}
 //code for searching a particular string..

 public static void main(String[] args) throws Exception {
        String indexDir = "D:\\lucenePdf";
        String q = "RA0083";


        Directory fsDir = FSDirectory.getDirectory(indexDir);
        IndexSearcher is = new IndexSearcher(fsDir);

        Query query = new QueryParser("body", new
StandardAnalyzer()).parse(q); 

        Hits hits = is.search(query);
        System.out.println("Found " + hits.length() + " documents that
matched query '" + q + "':");
        for (int i = 0; i < hits.length(); i++) {
            Document doc = hits.doc(i);
            
        }
    }


When I run the above code...I get folowing output as a result of running
indexer class

Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT
: RA0083 
000099062000062000000021000000100220468148001102006PAYOUT : RA0083 
000099062000063000000021000000100330468153601102006PAYOUT : RA0083 
000099062000064700000021000000100440468155401102006PAYOUT : RA0083 
000099062000065700000021000000100550468156201102006PAYOUT : RA0083 

and following  files are generated in the specified path..

segments.gen
write.lock
segments_4


but when I run the search class it gives the result as:

Found 0 documents that matched query 'RA0083':

I am also attaching the corresponding pdf file for reference.
It seems as the index is not getting created..

Please help me with some of your inputs,it will be very helpfull for me.
-- 
View this message in context: http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11647342
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: getting problem while indexing pdf files with pdfbox

Posted by neetika <ne...@iflexsolutions.com>.
Hi Erick,

I am able to get the result fine.
The problem was, I forgot to close the writer and so the index file (.cfs)
was not getting generated.

Thanks a lot for the timely help.

Regards,
Neetika


Erick Erickson wrote:
> 
>  You have NOT supplied an example of the text you extracted
> from the document. But let's assume that the interesting
> string is exactly what you expect.
> 
> Have you looked at your index with Luke to see if the data is there?
> I *strongly* suggest you get a copy of Luke (google lucene luke) to
> examine indexes with.
> 
> The existence of the write.lock file suggests that you haven't closed your
> index prior to searching it. Although flushing it would probably work. Be
> aware that you cannot see changes to an index if the reader you use
> has been opened before the indexing operation. Also, there is some period
> of time when the indexed data is buffered by the writer, and I'm unsure
> (but
> doubt) it's available until it's been flushed.
> 
> I suspect that your problem is not related to PDF, but rather to whether
> you've properly indexed data and closed your index prior to searching
> it.....
> 
> The other possibility is that your analyzer is parsing things
> "interestingly".
> StandardAnalyzer() does some interesting things when tokenizing, including
> lowercasing the input stream. Although that shouldn't have been a problem
> since you use the same analyzer for indexing and searching.
> 
> Also, try query.toString to see what is actually searched, that often
> gives
> insights. The aforementioned Luke will allow you to submit queries to the
> index, including explaining what the actual query produced is.
> What are the file sizes of your index files?
> 
> 
> Best
> Erick
> 
> On 7/17/07, neetika <ne...@iflexsolutions.com> wrote:
>>
>>
>> Hi Erick,
>>
>> Befoe indexing I have printed the doc, and I have given the output
>> also.It
>> is printing well.
>> Kindly please check my post again following...
>>
>> " System.out.println(doc);
>>                  //Following code is for making index"
>>
>> and the corresponding output is...
>>
>>
>> Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT:
>> RA0083
>> 000099062000062000000021000000100220468148001102006PAYOUT : RA0083
>> 000099062000063000000021000000100330468153601102006PAYOUT : RA0083
>> 000099062000064700000021000000100440468155401102006PAYOUT : RA0083
>> 0000099062000065700000021000000100550468156201102006PAYOUT : RA0083
>> which is as expected...but  my problem is...index file is not getting
>> generated.
>>
>> Please help
>>
>>
>>
>> Erick Erickson wrote:
>> >
>> > Offhand I'd assume that your problem is using PDFbox. Have you
>> > tried printing out the docText string you get  back from
>> >
>> > docText = stripper.getText(new PDDocument(cosDoc))?
>> >
>> > I'd recommend you assure yourself that you get valid text back from
>> > the PDF document before worrying about indexing it.
>> >
>> > Best
>> > Erick
>> >
>> > On 7/17/07, neetika <ne...@iflexsolutions.com> wrote:
>> >>
>> >>
>> >> http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf
>> >>
>> >> hi all,
>> >>
>> >> i am able to convert a pdf in to a text file using pdfbox.
>> >> and this is the code that I used, but I am not able to index it
>> >>
>> >> // code for parsing and making index
>> >>
>> >>                 public Document getDocument(InputStream is)
>> >>                 {
>> >>                 COSDocument cosDoc = null;
>> >>                 try {
>> >>                         PDFParser parser = new PDFParser(is);
>> >>                         parser.parse();
>> >>                 cosDoc = parser.getDocument();
>> >>                 }
>> >>                 catch (IOException e) {
>> >>                 e.printStackTrace();
>> >>                                 }
>> >>                 String docText = null;
>> >>                 try {
>> >>                 PDFTextStripper stripper = new PDFTextStripper();
>> >>                 docText = stripper.getText(new PDDocument(cosDoc));
>> >>                 }
>> >>                 catch (IOException e) {
>> >>                         e.printStackTrace();
>> >>                 }
>> >>                 Document doc = new Document();
>> >>                 if (docText != null) {
>> >>                 doc.add(new Field("body", docText, Field.Store.YES,
>> >>                         Field.Index.TOKENIZED));
>> >>                 }
>> >>                 return doc;
>> >>                 }
>> >>
>> >>                 public static void main(String[] args) throws
>> Exception
>> {
>> >>                         TestPDFParser handler = new TestPDFParser();
>> >>
>> >>                 Document doc = handler.getDocument(new
>> >> FileInputStream(new
>> >> File("D:\\lucenePdf\\DRra0026.pdf")));
>> >>
>> >>                 System.out.println(doc);
>> >>
>> >>                 //Following code is for making index
>> >>
>> >>                 IndexWriter f_writer = new
>> IndexWriter("D:\\lucenePdf",
>> >> new
>> >> StandardAnalyzer(), true);
>> >>
>> >>                 f_writer.addDocument(doc);
>> >>
>> >>                 }
>> >>                 }
>> >> //code for searching a particular string..
>> >>
>> >> public static void main(String[] args) throws Exception {
>> >>         String indexDir = "D:\\lucenePdf";
>> >>         String q = "RA0083";
>> >>
>> >>
>> >>         Directory fsDir = FSDirectory.getDirectory(indexDir);
>> >>         IndexSearcher is = new IndexSearcher(fsDir);
>> >>
>> >>         Query query = new QueryParser("body", new
>> >> StandardAnalyzer()).parse(q);
>> >>
>> >>         Hits hits = is.search(query);
>> >>         System.out.println("Found " + hits.length() + " documents that
>> >> matched query '" + q + "':");
>> >>         for (int i = 0; i < hits.length(); i++) {
>> >>             Document doc = hits.doc(i);
>> >>
>> >>         }
>> >>     }
>> >>
>> >>
>> >> When I run the above code...I get folowing output as a result of
>> running
>> >> indexer class
>> >>
>> >>
>> >>
>> Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT
>> >> : RA0083
>> >> 000099062000062000000021000000100220468148001102006PAYOUT : RA0083
>> >> 000099062000063000000021000000100330468153601102006PAYOUT : RA0083
>> >> 000099062000064700000021000000100440468155401102006PAYOUT : RA0083
>> >> 000099062000065700000021000000100550468156201102006PAYOUT : RA0083
>> >>
>> >> and following  files are generated in the specified path..
>> >>
>> >> segments.gen
>> >> write.lock
>> >> segments_4
>> >>
>> >>
>> >> but when I run the search class it gives the result as:
>> >>
>> >> Found 0 documents that matched query 'RA0083':
>> >>
>> >> I am also attaching the corresponding pdf file for reference.
>> >> It seems as the index is not getting created..
>> >>
>> >> Please help me with some of your inputs,it will be very helpfull for
>> me.
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11647342
>> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11653883
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11662355
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: getting problem while indexing pdf files with pdfbox

Posted by Erick Erickson <er...@gmail.com>.
 You have NOT supplied an example of the text you extracted
from the document. But let's assume that the interesting
string is exactly what you expect.

Have you looked at your index with Luke to see if the data is there?
I *strongly* suggest you get a copy of Luke (google lucene luke) to
examine indexes with.

The existence of the write.lock file suggests that you haven't closed your
index prior to searching it. Although flushing it would probably work. Be
aware that you cannot see changes to an index if the reader you use
has been opened before the indexing operation. Also, there is some period
of time when the indexed data is buffered by the writer, and I'm unsure (but
doubt) it's available until it's been flushed.

I suspect that your problem is not related to PDF, but rather to whether
you've properly indexed data and closed your index prior to searching
it.....

The other possibility is that your analyzer is parsing things
"interestingly".
StandardAnalyzer() does some interesting things when tokenizing, including
lowercasing the input stream. Although that shouldn't have been a problem
since you use the same analyzer for indexing and searching.

Also, try query.toString to see what is actually searched, that often gives
insights. The aforementioned Luke will allow you to submit queries to the
index, including explaining what the actual query produced is.
What are the file sizes of your index files?


Best
Erick

On 7/17/07, neetika <ne...@iflexsolutions.com> wrote:
>
>
> Hi Erick,
>
> Befoe indexing I have printed the doc, and I have given the output also.It
> is printing well.
> Kindly please check my post again following...
>
> " System.out.println(doc);
>                  //Following code is for making index"
>
> and the corresponding output is...
>
>
> Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT:
> RA0083
> 000099062000062000000021000000100220468148001102006PAYOUT : RA0083
> 000099062000063000000021000000100330468153601102006PAYOUT : RA0083
> 000099062000064700000021000000100440468155401102006PAYOUT : RA0083
> 0000099062000065700000021000000100550468156201102006PAYOUT : RA0083
> which is as expected...but  my problem is...index file is not getting
> generated.
>
> Please help
>
>
>
> Erick Erickson wrote:
> >
> > Offhand I'd assume that your problem is using PDFbox. Have you
> > tried printing out the docText string you get  back from
> >
> > docText = stripper.getText(new PDDocument(cosDoc))?
> >
> > I'd recommend you assure yourself that you get valid text back from
> > the PDF document before worrying about indexing it.
> >
> > Best
> > Erick
> >
> > On 7/17/07, neetika <ne...@iflexsolutions.com> wrote:
> >>
> >>
> >> http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf
> >>
> >> hi all,
> >>
> >> i am able to convert a pdf in to a text file using pdfbox.
> >> and this is the code that I used, but I am not able to index it
> >>
> >> // code for parsing and making index
> >>
> >>                 public Document getDocument(InputStream is)
> >>                 {
> >>                 COSDocument cosDoc = null;
> >>                 try {
> >>                         PDFParser parser = new PDFParser(is);
> >>                         parser.parse();
> >>                 cosDoc = parser.getDocument();
> >>                 }
> >>                 catch (IOException e) {
> >>                 e.printStackTrace();
> >>                                 }
> >>                 String docText = null;
> >>                 try {
> >>                 PDFTextStripper stripper = new PDFTextStripper();
> >>                 docText = stripper.getText(new PDDocument(cosDoc));
> >>                 }
> >>                 catch (IOException e) {
> >>                         e.printStackTrace();
> >>                 }
> >>                 Document doc = new Document();
> >>                 if (docText != null) {
> >>                 doc.add(new Field("body", docText, Field.Store.YES,
> >>                         Field.Index.TOKENIZED));
> >>                 }
> >>                 return doc;
> >>                 }
> >>
> >>                 public static void main(String[] args) throws Exception
> {
> >>                         TestPDFParser handler = new TestPDFParser();
> >>
> >>                 Document doc = handler.getDocument(new
> >> FileInputStream(new
> >> File("D:\\lucenePdf\\DRra0026.pdf")));
> >>
> >>                 System.out.println(doc);
> >>
> >>                 //Following code is for making index
> >>
> >>                 IndexWriter f_writer = new IndexWriter("D:\\lucenePdf",
> >> new
> >> StandardAnalyzer(), true);
> >>
> >>                 f_writer.addDocument(doc);
> >>
> >>                 }
> >>                 }
> >> //code for searching a particular string..
> >>
> >> public static void main(String[] args) throws Exception {
> >>         String indexDir = "D:\\lucenePdf";
> >>         String q = "RA0083";
> >>
> >>
> >>         Directory fsDir = FSDirectory.getDirectory(indexDir);
> >>         IndexSearcher is = new IndexSearcher(fsDir);
> >>
> >>         Query query = new QueryParser("body", new
> >> StandardAnalyzer()).parse(q);
> >>
> >>         Hits hits = is.search(query);
> >>         System.out.println("Found " + hits.length() + " documents that
> >> matched query '" + q + "':");
> >>         for (int i = 0; i < hits.length(); i++) {
> >>             Document doc = hits.doc(i);
> >>
> >>         }
> >>     }
> >>
> >>
> >> When I run the above code...I get folowing output as a result of
> running
> >> indexer class
> >>
> >>
> >>
> Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT
> >> : RA0083
> >> 000099062000062000000021000000100220468148001102006PAYOUT : RA0083
> >> 000099062000063000000021000000100330468153601102006PAYOUT : RA0083
> >> 000099062000064700000021000000100440468155401102006PAYOUT : RA0083
> >> 000099062000065700000021000000100550468156201102006PAYOUT : RA0083
> >>
> >> and following  files are generated in the specified path..
> >>
> >> segments.gen
> >> write.lock
> >> segments_4
> >>
> >>
> >> but when I run the search class it gives the result as:
> >>
> >> Found 0 documents that matched query 'RA0083':
> >>
> >> I am also attaching the corresponding pdf file for reference.
> >> It seems as the index is not getting created..
> >>
> >> Please help me with some of your inputs,it will be very helpfull for
> me.
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11647342
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11653883
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: getting problem while indexing pdf files with pdfbox

Posted by neetika <ne...@iflexsolutions.com>.
Hi Erick,

Befoe indexing I have printed the doc, and I have given the output also.It
is printing well.
Kindly please check my post again following...

" System.out.println(doc);
                 //Following code is for making index"

and the corresponding output is...

Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT:
RA0083
 000099062000062000000021000000100220468148001102006PAYOUT : RA0083
 000099062000063000000021000000100330468153601102006PAYOUT : RA0083
 000099062000064700000021000000100440468155401102006PAYOUT : RA0083
 0000099062000065700000021000000100550468156201102006PAYOUT : RA0083
which is as expected...but  my problem is...index file is not getting
generated.

Please help



Erick Erickson wrote:
> 
> Offhand I'd assume that your problem is using PDFbox. Have you
> tried printing out the docText string you get  back from
> 
> docText = stripper.getText(new PDDocument(cosDoc))?
> 
> I'd recommend you assure yourself that you get valid text back from
> the PDF document before worrying about indexing it.
> 
> Best
> Erick
> 
> On 7/17/07, neetika <ne...@iflexsolutions.com> wrote:
>>
>>
>> http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf
>>
>> hi all,
>>
>> i am able to convert a pdf in to a text file using pdfbox.
>> and this is the code that I used, but I am not able to index it
>>
>> // code for parsing and making index
>>
>>                 public Document getDocument(InputStream is)
>>                 {
>>                 COSDocument cosDoc = null;
>>                 try {
>>                         PDFParser parser = new PDFParser(is);
>>                         parser.parse();
>>                 cosDoc = parser.getDocument();
>>                 }
>>                 catch (IOException e) {
>>                 e.printStackTrace();
>>                                 }
>>                 String docText = null;
>>                 try {
>>                 PDFTextStripper stripper = new PDFTextStripper();
>>                 docText = stripper.getText(new PDDocument(cosDoc));
>>                 }
>>                 catch (IOException e) {
>>                         e.printStackTrace();
>>                 }
>>                 Document doc = new Document();
>>                 if (docText != null) {
>>                 doc.add(new Field("body", docText, Field.Store.YES,
>>                         Field.Index.TOKENIZED));
>>                 }
>>                 return doc;
>>                 }
>>
>>                 public static void main(String[] args) throws Exception {
>>                         TestPDFParser handler = new TestPDFParser();
>>
>>                 Document doc = handler.getDocument(new
>> FileInputStream(new
>> File("D:\\lucenePdf\\DRra0026.pdf")));
>>
>>                 System.out.println(doc);
>>
>>                 //Following code is for making index
>>
>>                 IndexWriter f_writer = new IndexWriter("D:\\lucenePdf",
>> new
>> StandardAnalyzer(), true);
>>
>>                 f_writer.addDocument(doc);
>>
>>                 }
>>                 }
>> //code for searching a particular string..
>>
>> public static void main(String[] args) throws Exception {
>>         String indexDir = "D:\\lucenePdf";
>>         String q = "RA0083";
>>
>>
>>         Directory fsDir = FSDirectory.getDirectory(indexDir);
>>         IndexSearcher is = new IndexSearcher(fsDir);
>>
>>         Query query = new QueryParser("body", new
>> StandardAnalyzer()).parse(q);
>>
>>         Hits hits = is.search(query);
>>         System.out.println("Found " + hits.length() + " documents that
>> matched query '" + q + "':");
>>         for (int i = 0; i < hits.length(); i++) {
>>             Document doc = hits.doc(i);
>>
>>         }
>>     }
>>
>>
>> When I run the above code...I get folowing output as a result of running
>> indexer class
>>
>>
>> Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT
>> : RA0083
>> 000099062000062000000021000000100220468148001102006PAYOUT : RA0083
>> 000099062000063000000021000000100330468153601102006PAYOUT : RA0083
>> 000099062000064700000021000000100440468155401102006PAYOUT : RA0083
>> 000099062000065700000021000000100550468156201102006PAYOUT : RA0083
>>
>> and following  files are generated in the specified path..
>>
>> segments.gen
>> write.lock
>> segments_4
>>
>>
>> but when I run the search class it gives the result as:
>>
>> Found 0 documents that matched query 'RA0083':
>>
>> I am also attaching the corresponding pdf file for reference.
>> It seems as the index is not getting created..
>>
>> Please help me with some of your inputs,it will be very helpfull for me.
>> --
>> View this message in context:
>> http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11647342
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11653883
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: getting problem while indexing pdf files with pdfbox

Posted by Erick Erickson <er...@gmail.com>.
Offhand I'd assume that your problem is using PDFbox. Have you
tried printing out the docText string you get  back from

docText = stripper.getText(new PDDocument(cosDoc))?

I'd recommend you assure yourself that you get valid text back from
the PDF document before worrying about indexing it.

Best
Erick

On 7/17/07, neetika <ne...@iflexsolutions.com> wrote:
>
>
> http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf
>
> hi all,
>
> i am able to convert a pdf in to a text file using pdfbox.
> and this is the code that I used, but I am not able to index it
>
> // code for parsing and making index
>
>                 public Document getDocument(InputStream is)
>                 {
>                 COSDocument cosDoc = null;
>                 try {
>                         PDFParser parser = new PDFParser(is);
>                         parser.parse();
>                 cosDoc = parser.getDocument();
>                 }
>                 catch (IOException e) {
>                 e.printStackTrace();
>                                 }
>                 String docText = null;
>                 try {
>                 PDFTextStripper stripper = new PDFTextStripper();
>                 docText = stripper.getText(new PDDocument(cosDoc));
>                 }
>                 catch (IOException e) {
>                         e.printStackTrace();
>                 }
>                 Document doc = new Document();
>                 if (docText != null) {
>                 doc.add(new Field("body", docText, Field.Store.YES,
>                         Field.Index.TOKENIZED));
>                 }
>                 return doc;
>                 }
>
>                 public static void main(String[] args) throws Exception {
>                         TestPDFParser handler = new TestPDFParser();
>
>                 Document doc = handler.getDocument(new FileInputStream(new
> File("D:\\lucenePdf\\DRra0026.pdf")));
>
>                 System.out.println(doc);
>
>                 //Following code is for making index
>
>                 IndexWriter f_writer = new IndexWriter("D:\\lucenePdf",
> new
> StandardAnalyzer(), true);
>
>                 f_writer.addDocument(doc);
>
>                 }
>                 }
> //code for searching a particular string..
>
> public static void main(String[] args) throws Exception {
>         String indexDir = "D:\\lucenePdf";
>         String q = "RA0083";
>
>
>         Directory fsDir = FSDirectory.getDirectory(indexDir);
>         IndexSearcher is = new IndexSearcher(fsDir);
>
>         Query query = new QueryParser("body", new
> StandardAnalyzer()).parse(q);
>
>         Hits hits = is.search(query);
>         System.out.println("Found " + hits.length() + " documents that
> matched query '" + q + "':");
>         for (int i = 0; i < hits.length(); i++) {
>             Document doc = hits.doc(i);
>
>         }
>     }
>
>
> When I run the above code...I get folowing output as a result of running
> indexer class
>
>
> Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT
> : RA0083
> 000099062000062000000021000000100220468148001102006PAYOUT : RA0083
> 000099062000063000000021000000100330468153601102006PAYOUT : RA0083
> 000099062000064700000021000000100440468155401102006PAYOUT : RA0083
> 000099062000065700000021000000100550468156201102006PAYOUT : RA0083
>
> and following  files are generated in the specified path..
>
> segments.gen
> write.lock
> segments_4
>
>
> but when I run the search class it gives the result as:
>
> Found 0 documents that matched query 'RA0083':
>
> I am also attaching the corresponding pdf file for reference.
> It seems as the index is not getting created..
>
> Please help me with some of your inputs,it will be very helpfull for me.
> --
> View this message in context:
> http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11647342
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>