You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Pinky Iyer <pi...@yahoo.com> on 2003/02/25 22:49:55 UTC
xpdf parser usage for lucene
Hi !
I am trying to use xpdf for pdf parser, the problem i encounter is when i encounter a file with .pdf extension, i call the pdftotext script to convert to text, which in turn uses the file system and leaves the same file with .txt extension in same dir. How can i get this as a stream and not use the file system at all. Also How do i access the summary and title info. Anybody who has done this before, please help!
Thanks!
Pinky Iyer
---------------------------------
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, and more
Re: xpdf parser usage for lucene
Posted by Pinky Iyer <pi...@yahoo.com>.
THis means that i have to use the htmlparser again on the converted document. Is that right? Also is there a way to use these without utilizing the filesystem, by way of streams or so.
Michael Wechner <mi...@wyona.org> wrote:Pinky Iyer wrote:
>Hi !
> I am trying to use xpdf for pdf parser, the problem i encounter is when i encounter a file with .pdf extension, i call the pdftotext script to convert to text, which in turn uses the file system and leaves the same file with .txt extension in same dir. How can i get this as a stream and not use the file system at all. Also How do i access the summary and title info.
>
xpdf has an option to turn the PDF into an HTML instead of txt, which
allows you to use an HTMLParser
for populating the fields.
Concerning the extension: when you create your Lucene document, you
could replace the txt extension
by the pdf extension in the case of the "uri" field.
HTH
Michael
> Anybody who has done this before, please help!
>Thanks!
>Pinky Iyer
>
>
>
>
>---------------------------------
>Do you Yahoo!?
>Yahoo! Tax Center - forms, calculators, tips, and more
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
---------------------------------
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, and more
Re: xpdf parser usage for lucene
Posted by Michael Wechner <mi...@wyona.org>.
Pinky Iyer wrote:
>Hi !
> I am trying to use xpdf for pdf parser, the problem i encounter is when i encounter a file with .pdf extension, i call the pdftotext script to convert to text, which in turn uses the file system and leaves the same file with .txt extension in same dir. How can i get this as a stream and not use the file system at all. Also How do i access the summary and title info.
>
xpdf has an option to turn the PDF into an HTML instead of txt, which
allows you to use an HTMLParser
for populating the fields.
Concerning the extension: when you create your Lucene document, you
could replace the txt extension
by the pdf extension in the case of the "uri" field.
HTH
Michael
> Anybody who has done this before, please help!
>Thanks!
>Pinky Iyer
>
>
>
>
>---------------------------------
>Do you Yahoo!?
>Yahoo! Tax Center - forms, calculators, tips, and more
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: xpdf parser usage for lucene
Posted by Pinky Iyer <pi...@yahoo.com>.
Thanks Bruce!
I dont know how i missed that! Thanks anyway! It works now....though stuck with title and summary...
P Iyer
Bruce Ritchie <br...@jivesoftware.com> wrote:Pinky,
If you had actually read the documentation that came with pdftotext you would know that if you pass
in a - (dash) as the output filename it will stream the text to stdout. This is exactly what the
code Matt Tucker showed you before did which is copied below. It's all there in his message.
As for summary and title info, you'll probably have to use a pdf parsing library to gain access to
that from the pdf.
String[] cmd = new String[] {
PATH_TO_XPDF, "-enc", "UTF-8", "-q", PDF_FILE_TO_PARSE, "-"};
Process p = Runtime.getRuntime().exec(cmd);
BufferedInputStream bis = new BufferedInputStream(p.getInputStream());
InputStreamReader reader = new InputStreamReader(bis, "UTF-8");
StringWriter out = new StringWriter();
char [] buf = new char[512];
int len;
while ((len = reader.read(buf)) >= 0) {
out.write(buf, 0, len);
}
reader.close();
You should of course wrap this in a try/catch block, etc.
Regards,
Bruce Ritchie
Pinky Iyer wrote:
> Hi !
> I am trying to use xpdf for pdf parser, the problem i encounter is when
> i encounter a file with .pdf extension, i call the pdftotext script to convert
> to text, which in turn uses the file system and leaves the same file with
> .txt extension in same dir. How can i get this as a stream and not use
> the file system at all. Also How do i access the summary and title info.
> Anybody who has done this before, please help!
> Thanks!
> Pinky Iyer
--
AOL - bruceritchie101
ICQ - 9929791
MSN - bruce_ritchie101@hotmail.com
http://www.jivesoftware.com/
> ATTACHMENT part 2 application/x-pkcs7-signature name=smime.p7s
---------------------------------
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, and more
Re: xpdf parser usage for lucene
Posted by Bruce Ritchie <br...@jivesoftware.com>.
Pinky,
If you had actually read the documentation that came with pdftotext you would know that if you pass
in a - (dash) as the output filename it will stream the text to stdout. This is exactly what the
code Matt Tucker showed you before did which is copied below. It's all there in his message.
As for summary and title info, you'll probably have to use a pdf parsing library to gain access to
that from the pdf.
String[] cmd = new String[] {
PATH_TO_XPDF, "-enc", "UTF-8", "-q", PDF_FILE_TO_PARSE, "-"};
Process p = Runtime.getRuntime().exec(cmd);
BufferedInputStream bis = new BufferedInputStream(p.getInputStream());
InputStreamReader reader = new InputStreamReader(bis, "UTF-8");
StringWriter out = new StringWriter();
char [] buf = new char[512];
int len;
while ((len = reader.read(buf)) >= 0) {
out.write(buf, 0, len);
}
reader.close();
You should of course wrap this in a try/catch block, etc.
Regards,
Bruce Ritchie
Pinky Iyer wrote:
> Hi !
> I am trying to use xpdf for pdf parser, the problem i encounter is when
> i encounter a file with .pdf extension, i call the pdftotext script to convert
> to text, which in turn uses the file system and leaves the same file with
> .txt extension in same dir. How can i get this as a stream and not use
> the file system at all. Also How do i access the summary and title info.
> Anybody who has done this before, please help!
> Thanks!
> Pinky Iyer
--
AOL - bruceritchie101
ICQ - 9929791
MSN - bruce_ritchie101@hotmail.com
http://www.jivesoftware.com/