You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by ashwin kumar <gv...@gmail.com> on 2007/03/12 07:03:01 UTC
pdf box help
hi all i am able to convert a pdf in to a text file using pdfbox. and this
is the code that i used
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
import org.pdfbox.*;
import java.io.*;
public class PDFConvert
{
public static void main(String [] args)
{
String content = null;
try
{
String pdfFile=new String ("D:\\ASHWIN\\res\\ashwin.pdf");
PDDocument doc = PDDocument.load(pdfFile);
PDFTextStripper strip = new PDFTextStripper();
content = strip.getText(doc);
System.out.println(content);
}
catch(Exception e)
{
e.printStackTrace();
}
}
}
now i want to index this text information with lucene . wat is code required
for that pls help
regards
ashwin
Re: pdf box help
Posted by karl wettin <ka...@gmail.com>.
12 mar 2007 kl. 07.54 skrev ashwin kumar:
> ya sorry got it but that link contains only a program to index text
> i have
> already successfully indexed .txt now want to index pdf
You can not index the PDF. You need to index the text you have
extracted.
> >> > content = strip.getText(doc);
So add content to a field.
--
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: pdf box help
Posted by ashwin kumar <gv...@gmail.com>.
ya sorry got it but that link contains only a program to index text i have
already successfully indexed .txt now want to index pdf
On 3/12/07, karl wettin <ka...@gmail.com> wrote:
>
>
> 12 mar 2007 kl. 07.44 skrev ashwin kumar:
>
> > it says that the requested URL is not found
>
> Compare the URL in your browser with the URL in the mail. Perhaps
> your mail client does not handle the line feed?
>
> >
> > On 3/12/07, karl wettin <ka...@gmail.com> wrote:
> >>
> >>
> >> 12 mar 2007 kl. 07.03 skrev ashwin kumar:
> >>
> >> > hi all i am able to convert a pdf in to a text file using pdfbox.
> >> > and this
> >> > is the code that i used
> >> > {
> >> >
> >> > String pdfFile=new String ("D:\\ASHWIN\\res\\ashwin.pdf");
> >> > PDDocument doc = PDDocument.load(pdfFile);
> >> > PDFTextStripper strip = new PDFTextStripper();
> >> > content = strip.getText(doc);
> >> > System.out.println(content);
> >> > }
> >>
> >> > now i want to index this text information with lucene . wat is code
> >> > required
> >> > for that pls help
> >>
> >> You might want to start here:
> >>
> >> <http://lucene.apache.org/java/2_1_0/api/overview-
> >> summary.html#overview_description>
> >>
> >> There are lots of tutorials out there. Try your favorite search
> >> engine.
> >>
> >> --
> >> karl
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: pdf box help
Posted by karl wettin <ka...@gmail.com>.
12 mar 2007 kl. 07.44 skrev ashwin kumar:
> it says that the requested URL is not found
Compare the URL in your browser with the URL in the mail. Perhaps
your mail client does not handle the line feed?
>
> On 3/12/07, karl wettin <ka...@gmail.com> wrote:
>>
>>
>> 12 mar 2007 kl. 07.03 skrev ashwin kumar:
>>
>> > hi all i am able to convert a pdf in to a text file using pdfbox.
>> > and this
>> > is the code that i used
>> > {
>> >
>> > String pdfFile=new String ("D:\\ASHWIN\\res\\ashwin.pdf");
>> > PDDocument doc = PDDocument.load(pdfFile);
>> > PDFTextStripper strip = new PDFTextStripper();
>> > content = strip.getText(doc);
>> > System.out.println(content);
>> > }
>>
>> > now i want to index this text information with lucene . wat is code
>> > required
>> > for that pls help
>>
>> You might want to start here:
>>
>> <http://lucene.apache.org/java/2_1_0/api/overview-
>> summary.html#overview_description>
>>
>> There are lots of tutorials out there. Try your favorite search
>> engine.
>>
>> --
>> karl
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: pdf box help
Posted by ashwin kumar <gv...@gmail.com>.
it says that the requested URL is not found
On 3/12/07, karl wettin <ka...@gmail.com> wrote:
>
>
> 12 mar 2007 kl. 07.03 skrev ashwin kumar:
>
> > hi all i am able to convert a pdf in to a text file using pdfbox.
> > and this
> > is the code that i used
> > {
> >
> > String pdfFile=new String ("D:\\ASHWIN\\res\\ashwin.pdf");
> > PDDocument doc = PDDocument.load(pdfFile);
> > PDFTextStripper strip = new PDFTextStripper();
> > content = strip.getText(doc);
> > System.out.println(content);
> > }
>
> > now i want to index this text information with lucene . wat is code
> > required
> > for that pls help
>
> You might want to start here:
>
> <http://lucene.apache.org/java/2_1_0/api/overview-
> summary.html#overview_description>
>
> There are lots of tutorials out there. Try your favorite search engine.
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: pdf box help
Posted by karl wettin <ka...@gmail.com>.
12 mar 2007 kl. 07.03 skrev ashwin kumar:
> hi all i am able to convert a pdf in to a text file using pdfbox.
> and this
> is the code that i used
> {
>
> String pdfFile=new String ("D:\\ASHWIN\\res\\ashwin.pdf");
> PDDocument doc = PDDocument.load(pdfFile);
> PDFTextStripper strip = new PDFTextStripper();
> content = strip.getText(doc);
> System.out.println(content);
> }
> now i want to index this text information with lucene . wat is code
> required
> for that pls help
You might want to start here:
<http://lucene.apache.org/java/2_1_0/api/overview-
summary.html#overview_description>
There are lots of tutorials out there. Try your favorite search engine.
--
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: pdf box help
Posted by Steven Rowe <sa...@syr.edu>.
This may help:
http://www.pdfbox.org/userguide/text_extraction.html#Lucene+Integration
ashwin kumar wrote:
> hi all i am able to convert a pdf in to a text file using pdfbox. and this
> is the code that i used
>
> import org.pdfbox.pdfparser.PDFParser;
> import org.pdfbox.pdmodel.PDDocument;
> import org.pdfbox.util.PDFTextStripper;
> import org.pdfbox.*;
>
> import java.io.*;
>
> public class PDFConvert
> {
>
> public static void main(String [] args)
> {
> String content = null;
> try
> {
>
> String pdfFile=new String ("D:\\ASHWIN\\res\\ashwin.pdf");
> PDDocument doc = PDDocument.load(pdfFile);
> PDFTextStripper strip = new PDFTextStripper();
> content = strip.getText(doc);
> System.out.println(content);
> }
> catch(Exception e)
> {
> e.printStackTrace();
> }
>
> }
> }
>
> now i want to index this text information with lucene . wat is code
> required
> for that pls help
>
> regards
> ashwin
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org