You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Pramod Pradhan <pr...@gmail.com> on 2009/10/27 01:18:50 UTC

java.io.IOException: expected='startxref'

Hi All,

I am trying to write a simple to code to just parse the text data from a pdf
file onto the console.I am hitting the below exception

java.io.IOException: expected='startxref' actual=''
org.pdfbox.io.PushBackInputStream@100ab23
at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:355)
at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
at PDFTextParser.pdftoText(PDFTextParser.java:49)
at PDFTextParser.main(PDFTextParser.java:93)
PDF to Text Conversion failed.

Can someone please help? I have attached the Java class file.

-- 
thanks,
Pramod Pradhan
(361)228-3989

Re: Paradox with Eclipse and PDFStripper.processPages

Posted by Shen Wang <fe...@gmail.com>.
And by the way, I am still not quite clear how to make Eclipse compile 
the code.

Felix


Andreas Lehmkühler wrote:
> Hi,
>
> Shen Wang schrieb:
>   
>> Hi guys,
>>
>> I got a weird thing that I don't know how to make it work. Here is the
>> code:
>>
>> import java.io.File;
>> import java.io.IOException;
>> import java.util.List;
>>
>> import org.apache.pdfbox.pdmodel.PDDocument;
>> import org.apache.pdfbox.util.PDFText2HTML;
>> import org.apache.pdfbox.util.PDFTextStripper;
>>
>>
>> public class PDF_Title {
>>      public PDF_Title() {
>>          }
>>      public static void main( String[] args ) throws IOException {
>>        if ( args.length != 1 ) {
>>            System.out.println( "bad input" );
>>        }
>>        String pdfFileName = args[ 0 ];
>>        PDDocument document = PDDocument.load( pdfFileName );
>>        PDFTextStripper stripper = null;
>>        stripper = new PDFText2HTML("UTF-8");
>>        List pages = document.getDocumentCatalog().getAllPages();
>>        stripper.processPages(pages);
>>    }
>> }
>>
>> The problem is in the last line, if I leave the parameter of
>> processPages and blank, Eclipse will remind me that a pages list
>> parameter is needed and asks me to fill in. However, when I fill the
>> blank with the parameter, which is "pages" here, Eclipse will tell me
>> that the method of processPages from the type PDFTextStripper is not
>> visible and still refuses to compile. However, according to the javadoc,
>> processPages is simply a method of PDFTextStripper and asks for a page
>> list parameter. Could you guys help me point out where I made the
>> mistake? Thanks.
>>     
> Try to use stripper.writeText(document, outputStream) instead of
> stripper.processPages(..)
>
> BR
> Andreas Lehmkühler
>
>   

Re: Paradox with Eclipse and PDFStripper.processPages

Posted by Shen Wang <fe...@gmail.com>.
Hi Andreas,

Thanks for your reply. But what I am looking for is further processing 
the format information of the document instead of simply extracting the 
text. So basicly what I am trying to do is let the stripper object know 
which document it's processing when the writeText method is not called. 
Do you have any idea about this?

Best,

Felix


Andreas Lehmkühler wrote:
> Hi,
>
> Shen Wang schrieb:
>   
>> Hi guys,
>>
>> I got a weird thing that I don't know how to make it work. Here is the
>> code:
>>
>> import java.io.File;
>> import java.io.IOException;
>> import java.util.List;
>>
>> import org.apache.pdfbox.pdmodel.PDDocument;
>> import org.apache.pdfbox.util.PDFText2HTML;
>> import org.apache.pdfbox.util.PDFTextStripper;
>>
>>
>> public class PDF_Title {
>>      public PDF_Title() {
>>          }
>>      public static void main( String[] args ) throws IOException {
>>        if ( args.length != 1 ) {
>>            System.out.println( "bad input" );
>>        }
>>        String pdfFileName = args[ 0 ];
>>        PDDocument document = PDDocument.load( pdfFileName );
>>        PDFTextStripper stripper = null;
>>        stripper = new PDFText2HTML("UTF-8");
>>        List pages = document.getDocumentCatalog().getAllPages();
>>        stripper.processPages(pages);
>>    }
>> }
>>
>> The problem is in the last line, if I leave the parameter of
>> processPages and blank, Eclipse will remind me that a pages list
>> parameter is needed and asks me to fill in. However, when I fill the
>> blank with the parameter, which is "pages" here, Eclipse will tell me
>> that the method of processPages from the type PDFTextStripper is not
>> visible and still refuses to compile. However, according to the javadoc,
>> processPages is simply a method of PDFTextStripper and asks for a page
>> list parameter. Could you guys help me point out where I made the
>> mistake? Thanks.
>>     
> Try to use stripper.writeText(document, outputStream) instead of
> stripper.processPages(..)
>
> BR
> Andreas Lehmkühler
>
>   

Re: Paradox with Eclipse and PDFStripper.processPages

Posted by Andreas Lehmkühler <an...@lehmi.de>.
Hi,

Shen Wang schrieb:
> Hi guys,
> 
> I got a weird thing that I don't know how to make it work. Here is the
> code:
> 
> import java.io.File;
> import java.io.IOException;
> import java.util.List;
> 
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.util.PDFText2HTML;
> import org.apache.pdfbox.util.PDFTextStripper;
> 
> 
> public class PDF_Title {
>      public PDF_Title() {
>          }
>      public static void main( String[] args ) throws IOException {
>        if ( args.length != 1 ) {
>            System.out.println( "bad input" );
>        }
>        String pdfFileName = args[ 0 ];
>        PDDocument document = PDDocument.load( pdfFileName );
>        PDFTextStripper stripper = null;
>        stripper = new PDFText2HTML("UTF-8");
>        List pages = document.getDocumentCatalog().getAllPages();
>        stripper.processPages(pages);
>    }
> }
> 
> The problem is in the last line, if I leave the parameter of
> processPages and blank, Eclipse will remind me that a pages list
> parameter is needed and asks me to fill in. However, when I fill the
> blank with the parameter, which is "pages" here, Eclipse will tell me
> that the method of processPages from the type PDFTextStripper is not
> visible and still refuses to compile. However, according to the javadoc,
> processPages is simply a method of PDFTextStripper and asks for a page
> list parameter. Could you guys help me point out where I made the
> mistake? Thanks.
Try to use stripper.writeText(document, outputStream) instead of
stripper.processPages(..)

BR
Andreas Lehmkühler


Paradox with Eclipse and PDFStripper.processPages

Posted by Shen Wang <fe...@gmail.com>.
Hi guys,

I got a weird thing that I don't know how to make it work. Here is the code:

import java.io.File;
import java.io.IOException;
import java.util.List;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFText2HTML;
import org.apache.pdfbox.util.PDFTextStripper;


public class PDF_Title {
   
    public PDF_Title() {
       
    }
   
    public static void main( String[] args ) throws IOException {
        if ( args.length != 1 ) {
            System.out.println( "bad input" );
        }
        String pdfFileName = args[ 0 ];
        PDDocument document = PDDocument.load( pdfFileName );
        PDFTextStripper stripper = null;
        stripper = new PDFText2HTML("UTF-8");
        List pages = document.getDocumentCatalog().getAllPages();
        stripper.processPages(pages);
    }
}

The problem is in the last line, if I leave the parameter of 
processPages and blank, Eclipse will remind me that a pages list 
parameter is needed and asks me to fill in. However, when I fill the 
blank with the parameter, which is "pages" here, Eclipse will tell me 
that the method of processPages from the type PDFTextStripper is not 
visible and still refuses to compile. However, according to the javadoc, 
processPages is simply a method of PDFTextStripper and asks for a page 
list parameter. Could you guys help me point out where I made the 
mistake? Thanks.

Best,

Felix