You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by ashwin kumar <gv...@gmail.com> on 2007/03/12 07:03:01 UTC

pdf box help

hi all i am able to convert a pdf in to a text file using pdfbox. and this
is the code that i used

import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
import org.pdfbox.*;

import java.io.*;

public class PDFConvert
{

public static void main(String [] args)
{
String content = null;
try
{

    String pdfFile=new String ("D:\\ASHWIN\\res\\ashwin.pdf");
    PDDocument doc = PDDocument.load(pdfFile);
    PDFTextStripper strip = new PDFTextStripper();
    content = strip.getText(doc);
    System.out.println(content);
}
catch(Exception e)
{
    e.printStackTrace();
}

}
}

now i want to index this text information with lucene . wat is code required
for that pls help

regards
ashwin

Re: pdf box help

Posted by karl wettin <ka...@gmail.com>.

12 mar 2007 kl. 07.54 skrev ashwin kumar:

> ya sorry got it but that link contains only a program to index text  
> i have
> already successfully indexed .txt now want to index pdf

You can not index the PDF. You need to index the text you have  
extracted.

> >> >    content = strip.getText(doc);

So add content to a field.

-- 
karl


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: pdf box help

Posted by ashwin kumar <gv...@gmail.com>.

ya sorry got it but that link contains only a program to index text i have
already successfully indexed .txt now want to index pdf

On 3/12/07, karl wettin <ka...@gmail.com> wrote:
>
>
> 12 mar 2007 kl. 07.44 skrev ashwin kumar:
>
> > it says that the requested URL is not found
>
> Compare the URL in your browser with the URL in the mail. Perhaps
> your mail client does not handle the line feed?
>
> >
> > On 3/12/07, karl wettin <ka...@gmail.com> wrote:
> >>
> >>
> >> 12 mar 2007 kl. 07.03 skrev ashwin kumar:
> >>
> >> > hi all i am able to convert a pdf in to a text file using pdfbox.
> >> > and this
> >> > is the code that i used
> >> > {
> >> >
> >> >    String pdfFile=new String ("D:\\ASHWIN\\res\\ashwin.pdf");
> >> >    PDDocument doc = PDDocument.load(pdfFile);
> >> >    PDFTextStripper strip = new PDFTextStripper();
> >> >    content = strip.getText(doc);
> >> >    System.out.println(content);
> >> > }
> >>
> >> > now i want to index this text information with lucene . wat is code
> >> > required
> >> > for that pls help
> >>
> >> You might want to start here:
> >>
> >> <http://lucene.apache.org/java/2_1_0/api/overview-
> >> summary.html#overview_description>
> >>
> >> There are lots of tutorials out there. Try your favorite search
> >> engine.
> >>
> >> --
> >> karl
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: pdf box help

Posted by karl wettin <ka...@gmail.com>.

12 mar 2007 kl. 07.44 skrev ashwin kumar:

> it says that the requested URL is not found

Compare the URL in your browser with the URL in the mail. Perhaps  
your mail client does not handle the line feed?

>
> On 3/12/07, karl wettin <ka...@gmail.com> wrote:
>>
>>
>> 12 mar 2007 kl. 07.03 skrev ashwin kumar:
>>
>> > hi all i am able to convert a pdf in to a text file using pdfbox.
>> > and this
>> > is the code that i used
>> > {
>> >
>> >    String pdfFile=new String ("D:\\ASHWIN\\res\\ashwin.pdf");
>> >    PDDocument doc = PDDocument.load(pdfFile);
>> >    PDFTextStripper strip = new PDFTextStripper();
>> >    content = strip.getText(doc);
>> >    System.out.println(content);
>> > }
>>
>> > now i want to index this text information with lucene . wat is code
>> > required
>> > for that pls help
>>
>> You might want to start here:
>>
>> <http://lucene.apache.org/java/2_1_0/api/overview-
>> summary.html#overview_description>
>>
>> There are lots of tutorials out there. Try your favorite search  
>> engine.
>>
>> --
>> karl
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: pdf box help

Posted by ashwin kumar <gv...@gmail.com>.

it says that the requested URL is not found

On 3/12/07, karl wettin <ka...@gmail.com> wrote:
>
>
> 12 mar 2007 kl. 07.03 skrev ashwin kumar:
>
> > hi all i am able to convert a pdf in to a text file using pdfbox.
> > and this
> > is the code that i used
> > {
> >
> >    String pdfFile=new String ("D:\\ASHWIN\\res\\ashwin.pdf");
> >    PDDocument doc = PDDocument.load(pdfFile);
> >    PDFTextStripper strip = new PDFTextStripper();
> >    content = strip.getText(doc);
> >    System.out.println(content);
> > }
>
> > now i want to index this text information with lucene . wat is code
> > required
> > for that pls help
>
> You might want to start here:
>
> <http://lucene.apache.org/java/2_1_0/api/overview-
> summary.html#overview_description>
>
> There are lots of tutorials out there. Try your favorite search engine.
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: pdf box help

Posted by karl wettin <ka...@gmail.com>.

12 mar 2007 kl. 07.03 skrev ashwin kumar:

> hi all i am able to convert a pdf in to a text file using pdfbox.  
> and this
> is the code that i used
> {
>
>    String pdfFile=new String ("D:\\ASHWIN\\res\\ashwin.pdf");
>    PDDocument doc = PDDocument.load(pdfFile);
>    PDFTextStripper strip = new PDFTextStripper();
>    content = strip.getText(doc);
>    System.out.println(content);
> }

> now i want to index this text information with lucene . wat is code  
> required
> for that pls help

You might want to start here:

<http://lucene.apache.org/java/2_1_0/api/overview- 
summary.html#overview_description>

There are lots of tutorials out there. Try your favorite search engine.

-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: pdf box help

Posted by Steven Rowe <sa...@syr.edu>.

This may help:

http://www.pdfbox.org/userguide/text_extraction.html#Lucene+Integration

ashwin kumar wrote:
> hi all i am able to convert a pdf in to a text file using pdfbox. and this
> is the code that i used
> 
> import org.pdfbox.pdfparser.PDFParser;
> import org.pdfbox.pdmodel.PDDocument;
> import org.pdfbox.util.PDFTextStripper;
> import org.pdfbox.*;
> 
> import java.io.*;
> 
> public class PDFConvert
> {
> 
> public static void main(String [] args)
> {
> String content = null;
> try
> {
> 
>    String pdfFile=new String ("D:\\ASHWIN\\res\\ashwin.pdf");
>    PDDocument doc = PDDocument.load(pdfFile);
>    PDFTextStripper strip = new PDFTextStripper();
>    content = strip.getText(doc);
>    System.out.println(content);
> }
> catch(Exception e)
> {
>    e.printStackTrace();
> }
> 
> }
> }
> 
> now i want to index this text information with lucene . wat is code
> required
> for that pls help
> 
> regards
> ashwin
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org