You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Mehmet Ali Abdulhayoglu <Me...@kuleuven.be> on 2014/05/14 16:31:37 UTC

Problem when extracting text from a pdf file

Dear all,

As part of my research, I am trying to convert pdf files to text files. I have applied both itext and pdfbox but I encounter the same issue.

When I try extracting text from dnm1.pdf file (attached) both approaches work well. However when applying them for dnm2.pdf they fail.

I retrieve a text file with full of NULL values. Is it normal for such differently shaped pdfs or am I missing something else?

Thanks in advance.

Regards,
Mehmet


My code:

package retrievingfulltetxsfromweb;

import connectingurl.PlacesApi;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;

public class PdfBox {

    // Extract text from PDF Document
            public PdfBox(String fileName) {
                    //PDFParser parser = new PDFParser();
                    String parsedText = null;;
                    PDFTextStripper pdfStripper = null;
                    PDDocument pdDoc = null;
                    COSDocument cosDoc = null;
                    File file = new File(fileName);
                    if (!file.isFile()) {
                            System.err.println("File " + fileName + " does not exist.");
                            //return null;
                    }
                    try {
                            PDFParser parser = new PDFParser(new FileInputStream(file));
                    } catch (IOException e) {
                            System.err.println("Unable to open PDF Parser. " + e.getMessage());
                            //return null;
                    }
                    try {
                            PDFParser parser = new PDFParser(new FileInputStream(file));
                            parser.parse();
                            cosDoc = parser.getDocument();
                            pdfStripper = new PDFTextStripper();
                            pdDoc = new PDDocument(cosDoc);
                            pdfStripper.setStartPage(1);
                            pdfStripper.setEndPage(5);
                            parsedText = pdfStripper.getText(pdDoc);
                        System.out.println(parsedText);
                    } catch (Exception e) {
                            System.err
                                            .println("An exception occured in parsing the PDF Document."
                                                            + e.getMessage());
                    } finally {
                            try {
                                    if (cosDoc != null)
                                            cosDoc.close();
                                    if (pdDoc != null)
                                            pdDoc.close();
                            } catch (Exception e) {
                                    e.printStackTrace();
                            }
                    }
                    //return parsedText;
            }
            public static void main(String args[]){

                PdfBox pdf = new PdfBox("C:/dnm1.pdf");
                   // System.out.println(pdftoText("C:/dnm1.pdf"));
            }

}

RE: Problem when extracting text from a pdf file

Posted by Mehmet Ali Abdulhayoglu <Me...@kuleuven.be>.

Dear Maruan,

Thanks for getting round to checking. 

Best regards,
Mehmet 

-----Original Message-----
From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
Sent: Wednesday 21 May 2014 11:14 AM
To: users@pdfbox.apache.org
Subject: Re: Problem when extracting text from a pdf file

Hi Mehmet,

sry - now I see your issue. It's an encoding issue of the PDF. Copying & Pasting using Adobe Reader gives the same result. I don't think that we can do very much about it but I'll look into it in more detail. 

BR

Maruan Sahyoun

Am 21.05.2014 um 11:06 schrieb Mehmet Ali Abdulhayoglu <Me...@kuleuven.be>:

> Dear Maruan,
> 
> I have checked them again. I am sure that they are correct ones.
> 
> The pdf coming from the first link has a title of "Olfactory Learning-Induced Increase in Spine Density Along the...Neurons". I can process for this pdf.
> 
> The second one has a title : "Relationship between intercepted radiation, net photosynthesis, respiration, and rate of ....densities". I could not handle this one.
> 
> Indeed, when I copy and paste some text from this pdf, what I get is like: 
> 
> *
> 
>  
> 
> 
> 
> When you extract the text from the second one, did you make use of the java script that I sent in my first mail or use another one?
> 
> 
> Thanks for your attention.
> 
> Best regards,
> Mehmet 
> 
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
> Sent: Wednesday 21 May 2014 8:51 AM
> To: users@pdfbox.apache.org
> Subject: Re: Problem when extracting text from a pdf file
> 
> Dear Mehmet,
> 
> did you supply the correct PDF's? I can manual copy & paste text from both as well as extract the text using PDFBox for both.
> 
> BR
> 
> Maruan Sahyoun
> 
> Am 20.05.2014 um 11:56 schrieb Mehmet Ali Abdulhayoglu <Me...@kuleuven.be>:
> 
>> Dear Maruan,
>> 
>> Thanks for your reply. Below you can find the related links for the pdf files. As you state, from the first pdf (dnm1) I can manually copy paste the text while this is not possible for the second one (pdf) which shows that the later one contains no real text.
>> 
>> Is there any other ways to extract text from such pdfs like dnm2?
>> 
>> dnm1.pdf:
>> http://www.researchgate.net/publication/8333207_Olfactory_learning-induced_increase_in_spine_density_along_the_apical_dendrites_of_CA1_hippocampal_neurons/file/79e41503b71b66dabb.pdf
>> 
>> dnm2.pdf:
>> http://www.researchgate.net/publication/222569912_Relationship_between_intercepted_radiation_net_photosynthesis_respiration_and_rate_of_stem_volume_growth_of_Pinus_taeda_and_Pinus_elliottii_stands_of_different_densities/file/9fcfd5064592b3d098.pdf
>> 
>> Regards,
>> Mehmet
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>> Sent: Friday 16 May 2014 10:20 AM
>> To: users@pdfbox.apache.org
>> Subject: Re: Problem when extracting text from a pdf file
>> 
>> Hi Mehmet,
>> 
>> it could well be that text extraction works for one PDF and doesn't for another as it might not contain real text but what you see on screen is drawn. As the attachments didn't make it through because of restrictions on the mailing list could you upload these to a public location to take a look at the files so the answer can be more specific for your case?
>> 
>> BR
>> 
>> Maruan Sahyoun
>> 
>> Am 14.05.2014 um 16:31 schrieb Mehmet Ali Abdulhayoglu <Me...@kuleuven.be>:
>> 
>>> Dear all,
>>> 
>>> As part of my research, I am trying to convert pdf files to text files. I have applied both itext and pdfbox but I encounter the same issue.
>>> 
>>> When I try extracting text from dnm1.pdf file (attached) both approaches work well. However when applying them for dnm2.pdf they fail.
>>> 
>>> I retrieve a text file with full of NULL values. Is it normal for such differently shaped pdfs or am I missing something else?
>>> 
>>> Thanks in advance.
>>> 
>>> Regards,
>>> Mehmet
>>> 
>>> 
>>> My code:
>>> 
>>> package retrievingfulltetxsfromweb;
>>> 
>>> import connectingurl.PlacesApi;
>>> 
>>> import java.io.File;
>>> import java.io.FileInputStream;
>>> import java.io.IOException;
>>> import org.apache.pdfbox.cos.COSDocument;
>>> import org.apache.pdfbox.pdfparser.PDFParser;
>>> import org.apache.pdfbox.pdmodel.PDDocument;
>>> import org.apache.pdfbox.util.PDFTextStripper;
>>> 
>>> public class PdfBox {
>>> 
>>>   // Extract text from PDF Document
>>>           public PdfBox(String fileName) {
>>>                   //PDFParser parser = new PDFParser();
>>>                   String parsedText = null;;
>>>                   PDFTextStripper pdfStripper = null;
>>>                   PDDocument pdDoc = null;
>>>                   COSDocument cosDoc = null;
>>>                   File file = new File(fileName);
>>>                   if (!file.isFile()) {
>>>                           System.err.println("File " + fileName + " does not exist.");
>>>                           //return null;
>>>                   }
>>>                   try {
>>>                           PDFParser parser = new PDFParser(new FileInputStream(file));
>>>                   } catch (IOException e) {
>>>                           System.err.println("Unable to open PDF Parser. " + e.getMessage());
>>>                           //return null;
>>>                   }
>>>                   try {
>>>                           PDFParser parser = new PDFParser(new FileInputStream(file));
>>>                           parser.parse();
>>>                           cosDoc = parser.getDocument();
>>>                           pdfStripper = new PDFTextStripper();
>>>                           pdDoc = new PDDocument(cosDoc);
>>>                           pdfStripper.setStartPage(1);
>>>                           pdfStripper.setEndPage(5);
>>>                           parsedText = pdfStripper.getText(pdDoc);
>>>                       System.out.println(parsedText);
>>>                   } catch (Exception e) {
>>>                           System.err
>>>                                           .println("An exception occured in parsing the PDF Document."
>>>                                                           + e.getMessage());
>>>                   } finally {
>>>                           try {
>>>                                   if (cosDoc != null)
>>>                                           cosDoc.close();
>>>                                   if (pdDoc != null)
>>>                                           pdDoc.close();
>>>                           } catch (Exception e) {
>>>                                   e.printStackTrace();
>>>                           }
>>>                   }
>>>                   //return parsedText;
>>>           }
>>>           public static void main(String args[]){
>>> 
>>>               PdfBox pdf = new PdfBox("C:/dnm1.pdf");
>>>                  // System.out.println(pdftoText("C:/dnm1.pdf"));
>>>           }
>>> 
>>> }
>> 
>

Re: Problem when extracting text from a pdf file

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi Mehmet,

sry - now I see your issue. It’s an encoding issue of the PDF. Copying & Pasting using Adobe Reader gives the same result. I don’t think that we can do very much about it but I’ll look into it in more detail. 

BR

Maruan Sahyoun

Am 21.05.2014 um 11:06 schrieb Mehmet Ali Abdulhayoglu <Me...@kuleuven.be>:

> Dear Maruan,
> 
> I have checked them again. I am sure that they are correct ones.
> 
> The pdf coming from the first link has a title of "Olfactory Learning-Induced Increase in Spine Density Along the...Neurons". I can process for this pdf.
> 
> The second one has a title : "Relationship between intercepted radiation, net photosynthesis, respiration, and rate of ....densities". I could not handle this one.
> 
> Indeed, when I copy and paste some text from this pdf, what I get is like: 
> 
> *
> 
>  
> 
> 
> 
> When you extract the text from the second one, did you make use of the java script that I sent in my first mail or use another one?
> 
> 
> Thanks for your attention.
> 
> Best regards,
> Mehmet 
> 
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
> Sent: Wednesday 21 May 2014 8:51 AM
> To: users@pdfbox.apache.org
> Subject: Re: Problem when extracting text from a pdf file
> 
> Dear Mehmet,
> 
> did you supply the correct PDF's? I can manual copy & paste text from both as well as extract the text using PDFBox for both.
> 
> BR
> 
> Maruan Sahyoun
> 
> Am 20.05.2014 um 11:56 schrieb Mehmet Ali Abdulhayoglu <Me...@kuleuven.be>:
> 
>> Dear Maruan,
>> 
>> Thanks for your reply. Below you can find the related links for the pdf files. As you state, from the first pdf (dnm1) I can manually copy paste the text while this is not possible for the second one (pdf) which shows that the later one contains no real text.
>> 
>> Is there any other ways to extract text from such pdfs like dnm2?
>> 
>> dnm1.pdf:
>> http://www.researchgate.net/publication/8333207_Olfactory_learning-induced_increase_in_spine_density_along_the_apical_dendrites_of_CA1_hippocampal_neurons/file/79e41503b71b66dabb.pdf
>> 
>> dnm2.pdf:
>> http://www.researchgate.net/publication/222569912_Relationship_between_intercepted_radiation_net_photosynthesis_respiration_and_rate_of_stem_volume_growth_of_Pinus_taeda_and_Pinus_elliottii_stands_of_different_densities/file/9fcfd5064592b3d098.pdf
>> 
>> Regards,
>> Mehmet
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>> Sent: Friday 16 May 2014 10:20 AM
>> To: users@pdfbox.apache.org
>> Subject: Re: Problem when extracting text from a pdf file
>> 
>> Hi Mehmet,
>> 
>> it could well be that text extraction works for one PDF and doesn't for another as it might not contain real text but what you see on screen is drawn. As the attachments didn't make it through because of restrictions on the mailing list could you upload these to a public location to take a look at the files so the answer can be more specific for your case?
>> 
>> BR
>> 
>> Maruan Sahyoun
>> 
>> Am 14.05.2014 um 16:31 schrieb Mehmet Ali Abdulhayoglu <Me...@kuleuven.be>:
>> 
>>> Dear all,
>>> 
>>> As part of my research, I am trying to convert pdf files to text files. I have applied both itext and pdfbox but I encounter the same issue.
>>> 
>>> When I try extracting text from dnm1.pdf file (attached) both approaches work well. However when applying them for dnm2.pdf they fail.
>>> 
>>> I retrieve a text file with full of NULL values. Is it normal for such differently shaped pdfs or am I missing something else?
>>> 
>>> Thanks in advance.
>>> 
>>> Regards,
>>> Mehmet
>>> 
>>> 
>>> My code:
>>> 
>>> package retrievingfulltetxsfromweb;
>>> 
>>> import connectingurl.PlacesApi;
>>> 
>>> import java.io.File;
>>> import java.io.FileInputStream;
>>> import java.io.IOException;
>>> import org.apache.pdfbox.cos.COSDocument;
>>> import org.apache.pdfbox.pdfparser.PDFParser;
>>> import org.apache.pdfbox.pdmodel.PDDocument;
>>> import org.apache.pdfbox.util.PDFTextStripper;
>>> 
>>> public class PdfBox {
>>> 
>>>   // Extract text from PDF Document
>>>           public PdfBox(String fileName) {
>>>                   //PDFParser parser = new PDFParser();
>>>                   String parsedText = null;;
>>>                   PDFTextStripper pdfStripper = null;
>>>                   PDDocument pdDoc = null;
>>>                   COSDocument cosDoc = null;
>>>                   File file = new File(fileName);
>>>                   if (!file.isFile()) {
>>>                           System.err.println("File " + fileName + " does not exist.");
>>>                           //return null;
>>>                   }
>>>                   try {
>>>                           PDFParser parser = new PDFParser(new FileInputStream(file));
>>>                   } catch (IOException e) {
>>>                           System.err.println("Unable to open PDF Parser. " + e.getMessage());
>>>                           //return null;
>>>                   }
>>>                   try {
>>>                           PDFParser parser = new PDFParser(new FileInputStream(file));
>>>                           parser.parse();
>>>                           cosDoc = parser.getDocument();
>>>                           pdfStripper = new PDFTextStripper();
>>>                           pdDoc = new PDDocument(cosDoc);
>>>                           pdfStripper.setStartPage(1);
>>>                           pdfStripper.setEndPage(5);
>>>                           parsedText = pdfStripper.getText(pdDoc);
>>>                       System.out.println(parsedText);
>>>                   } catch (Exception e) {
>>>                           System.err
>>>                                           .println("An exception occured in parsing the PDF Document."
>>>                                                           + e.getMessage());
>>>                   } finally {
>>>                           try {
>>>                                   if (cosDoc != null)
>>>                                           cosDoc.close();
>>>                                   if (pdDoc != null)
>>>                                           pdDoc.close();
>>>                           } catch (Exception e) {
>>>                                   e.printStackTrace();
>>>                           }
>>>                   }
>>>                   //return parsedText;
>>>           }
>>>           public static void main(String args[]){
>>> 
>>>               PdfBox pdf = new PdfBox("C:/dnm1.pdf");
>>>                  // System.out.println(pdftoText("C:/dnm1.pdf"));
>>>           }
>>> 
>>> }
>> 
>

RE: Problem when extracting text from a pdf file

Posted by Mehmet Ali Abdulhayoglu <Me...@kuleuven.be>.

Dear Maruan,

I have checked them again. I am sure that they are correct ones.

The pdf coming from the first link has a title of "Olfactory Learning-Induced Increase in Spine Density Along the...Neurons". I can process for this pdf.

The second one has a title : "Relationship between intercepted radiation, net photosynthesis, respiration, and rate of ....densities". I could not handle this one.

Indeed, when I copy and paste some text from this pdf, what I get is like: 

*

 



When you extract the text from the second one, did you make use of the java script that I sent in my first mail or use another one?


Thanks for your attention.

Best regards,
Mehmet 

-----Original Message-----
From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
Sent: Wednesday 21 May 2014 8:51 AM
To: users@pdfbox.apache.org
Subject: Re: Problem when extracting text from a pdf file

Dear Mehmet,

did you supply the correct PDF's? I can manual copy & paste text from both as well as extract the text using PDFBox for both.

BR

Maruan Sahyoun

Am 20.05.2014 um 11:56 schrieb Mehmet Ali Abdulhayoglu <Me...@kuleuven.be>:

> Dear Maruan,
> 
> Thanks for your reply. Below you can find the related links for the pdf files. As you state, from the first pdf (dnm1) I can manually copy paste the text while this is not possible for the second one (pdf) which shows that the later one contains no real text.
> 
> Is there any other ways to extract text from such pdfs like dnm2?
> 
> dnm1.pdf:
> http://www.researchgate.net/publication/8333207_Olfactory_learning-induced_increase_in_spine_density_along_the_apical_dendrites_of_CA1_hippocampal_neurons/file/79e41503b71b66dabb.pdf
> 
> dnm2.pdf:
> http://www.researchgate.net/publication/222569912_Relationship_between_intercepted_radiation_net_photosynthesis_respiration_and_rate_of_stem_volume_growth_of_Pinus_taeda_and_Pinus_elliottii_stands_of_different_densities/file/9fcfd5064592b3d098.pdf
> 
> Regards,
> Mehmet
> 
> 
> 
> 
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
> Sent: Friday 16 May 2014 10:20 AM
> To: users@pdfbox.apache.org
> Subject: Re: Problem when extracting text from a pdf file
> 
> Hi Mehmet,
> 
> it could well be that text extraction works for one PDF and doesn't for another as it might not contain real text but what you see on screen is drawn. As the attachments didn't make it through because of restrictions on the mailing list could you upload these to a public location to take a look at the files so the answer can be more specific for your case?
> 
> BR
> 
> Maruan Sahyoun
> 
> Am 14.05.2014 um 16:31 schrieb Mehmet Ali Abdulhayoglu <Me...@kuleuven.be>:
> 
>> Dear all,
>> 
>> As part of my research, I am trying to convert pdf files to text files. I have applied both itext and pdfbox but I encounter the same issue.
>> 
>> When I try extracting text from dnm1.pdf file (attached) both approaches work well. However when applying them for dnm2.pdf they fail.
>> 
>> I retrieve a text file with full of NULL values. Is it normal for such differently shaped pdfs or am I missing something else?
>> 
>> Thanks in advance.
>> 
>> Regards,
>> Mehmet
>> 
>> 
>> My code:
>> 
>> package retrievingfulltetxsfromweb;
>> 
>> import connectingurl.PlacesApi;
>> 
>> import java.io.File;
>> import java.io.FileInputStream;
>> import java.io.IOException;
>> import org.apache.pdfbox.cos.COSDocument;
>> import org.apache.pdfbox.pdfparser.PDFParser;
>> import org.apache.pdfbox.pdmodel.PDDocument;
>> import org.apache.pdfbox.util.PDFTextStripper;
>> 
>> public class PdfBox {
>> 
>>    // Extract text from PDF Document
>>            public PdfBox(String fileName) {
>>                    //PDFParser parser = new PDFParser();
>>                    String parsedText = null;;
>>                    PDFTextStripper pdfStripper = null;
>>                    PDDocument pdDoc = null;
>>                    COSDocument cosDoc = null;
>>                    File file = new File(fileName);
>>                    if (!file.isFile()) {
>>                            System.err.println("File " + fileName + " does not exist.");
>>                            //return null;
>>                    }
>>                    try {
>>                            PDFParser parser = new PDFParser(new FileInputStream(file));
>>                    } catch (IOException e) {
>>                            System.err.println("Unable to open PDF Parser. " + e.getMessage());
>>                            //return null;
>>                    }
>>                    try {
>>                            PDFParser parser = new PDFParser(new FileInputStream(file));
>>                            parser.parse();
>>                            cosDoc = parser.getDocument();
>>                            pdfStripper = new PDFTextStripper();
>>                            pdDoc = new PDDocument(cosDoc);
>>                            pdfStripper.setStartPage(1);
>>                            pdfStripper.setEndPage(5);
>>                            parsedText = pdfStripper.getText(pdDoc);
>>                        System.out.println(parsedText);
>>                    } catch (Exception e) {
>>                            System.err
>>                                            .println("An exception occured in parsing the PDF Document."
>>                                                            + e.getMessage());
>>                    } finally {
>>                            try {
>>                                    if (cosDoc != null)
>>                                            cosDoc.close();
>>                                    if (pdDoc != null)
>>                                            pdDoc.close();
>>                            } catch (Exception e) {
>>                                    e.printStackTrace();
>>                            }
>>                    }
>>                    //return parsedText;
>>            }
>>            public static void main(String args[]){
>> 
>>                PdfBox pdf = new PdfBox("C:/dnm1.pdf");
>>                   // System.out.println(pdftoText("C:/dnm1.pdf"));
>>            }
>> 
>> }
>

Re: Problem when extracting text from a pdf file

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Dear Mehmet,

did you supply the correct PDF’s? I can manual copy & paste text from both as well as extract the text using PDFBox for both.

BR

Maruan Sahyoun

Am 20.05.2014 um 11:56 schrieb Mehmet Ali Abdulhayoglu <Me...@kuleuven.be>:

> Dear Maruan,
> 
> Thanks for your reply. Below you can find the related links for the pdf files. As you state, from the first pdf (dnm1) I can manually copy paste the text while this is not possible for the second one (pdf) which shows that the later one contains no real text.
> 
> Is there any other ways to extract text from such pdfs like dnm2?
> 
> dnm1.pdf:
> http://www.researchgate.net/publication/8333207_Olfactory_learning-induced_increase_in_spine_density_along_the_apical_dendrites_of_CA1_hippocampal_neurons/file/79e41503b71b66dabb.pdf
> 
> dnm2.pdf:
> http://www.researchgate.net/publication/222569912_Relationship_between_intercepted_radiation_net_photosynthesis_respiration_and_rate_of_stem_volume_growth_of_Pinus_taeda_and_Pinus_elliottii_stands_of_different_densities/file/9fcfd5064592b3d098.pdf
> 
> Regards,
> Mehmet
> 
> 
> 
> 
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
> Sent: Friday 16 May 2014 10:20 AM
> To: users@pdfbox.apache.org
> Subject: Re: Problem when extracting text from a pdf file
> 
> Hi Mehmet,
> 
> it could well be that text extraction works for one PDF and doesn't for another as it might not contain real text but what you see on screen is drawn. As the attachments didn't make it through because of restrictions on the mailing list could you upload these to a public location to take a look at the files so the answer can be more specific for your case?
> 
> BR
> 
> Maruan Sahyoun
> 
> Am 14.05.2014 um 16:31 schrieb Mehmet Ali Abdulhayoglu <Me...@kuleuven.be>:
> 
>> Dear all,
>> 
>> As part of my research, I am trying to convert pdf files to text files. I have applied both itext and pdfbox but I encounter the same issue.
>> 
>> When I try extracting text from dnm1.pdf file (attached) both approaches work well. However when applying them for dnm2.pdf they fail.
>> 
>> I retrieve a text file with full of NULL values. Is it normal for such differently shaped pdfs or am I missing something else?
>> 
>> Thanks in advance.
>> 
>> Regards,
>> Mehmet
>> 
>> 
>> My code:
>> 
>> package retrievingfulltetxsfromweb;
>> 
>> import connectingurl.PlacesApi;
>> 
>> import java.io.File;
>> import java.io.FileInputStream;
>> import java.io.IOException;
>> import org.apache.pdfbox.cos.COSDocument;
>> import org.apache.pdfbox.pdfparser.PDFParser;
>> import org.apache.pdfbox.pdmodel.PDDocument;
>> import org.apache.pdfbox.util.PDFTextStripper;
>> 
>> public class PdfBox {
>> 
>>    // Extract text from PDF Document
>>            public PdfBox(String fileName) {
>>                    //PDFParser parser = new PDFParser();
>>                    String parsedText = null;;
>>                    PDFTextStripper pdfStripper = null;
>>                    PDDocument pdDoc = null;
>>                    COSDocument cosDoc = null;
>>                    File file = new File(fileName);
>>                    if (!file.isFile()) {
>>                            System.err.println("File " + fileName + " does not exist.");
>>                            //return null;
>>                    }
>>                    try {
>>                            PDFParser parser = new PDFParser(new FileInputStream(file));
>>                    } catch (IOException e) {
>>                            System.err.println("Unable to open PDF Parser. " + e.getMessage());
>>                            //return null;
>>                    }
>>                    try {
>>                            PDFParser parser = new PDFParser(new FileInputStream(file));
>>                            parser.parse();
>>                            cosDoc = parser.getDocument();
>>                            pdfStripper = new PDFTextStripper();
>>                            pdDoc = new PDDocument(cosDoc);
>>                            pdfStripper.setStartPage(1);
>>                            pdfStripper.setEndPage(5);
>>                            parsedText = pdfStripper.getText(pdDoc);
>>                        System.out.println(parsedText);
>>                    } catch (Exception e) {
>>                            System.err
>>                                            .println("An exception occured in parsing the PDF Document."
>>                                                            + e.getMessage());
>>                    } finally {
>>                            try {
>>                                    if (cosDoc != null)
>>                                            cosDoc.close();
>>                                    if (pdDoc != null)
>>                                            pdDoc.close();
>>                            } catch (Exception e) {
>>                                    e.printStackTrace();
>>                            }
>>                    }
>>                    //return parsedText;
>>            }
>>            public static void main(String args[]){
>> 
>>                PdfBox pdf = new PdfBox("C:/dnm1.pdf");
>>                   // System.out.println(pdftoText("C:/dnm1.pdf"));
>>            }
>> 
>> }
>

RE: Problem when extracting text from a pdf file

Posted by Mehmet Ali Abdulhayoglu <Me...@kuleuven.be>.

Dear Maruan,

Thanks for your reply. Below you can find the related links for the pdf files. As you state, from the first pdf (dnm1) I can manually copy paste the text while this is not possible for the second one (pdf) which shows that the later one contains no real text.

Is there any other ways to extract text from such pdfs like dnm2?

dnm1.pdf:
http://www.researchgate.net/publication/8333207_Olfactory_learning-induced_increase_in_spine_density_along_the_apical_dendrites_of_CA1_hippocampal_neurons/file/79e41503b71b66dabb.pdf

dnm2.pdf:
http://www.researchgate.net/publication/222569912_Relationship_between_intercepted_radiation_net_photosynthesis_respiration_and_rate_of_stem_volume_growth_of_Pinus_taeda_and_Pinus_elliottii_stands_of_different_densities/file/9fcfd5064592b3d098.pdf

Regards,
Mehmet




-----Original Message-----
From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
Sent: Friday 16 May 2014 10:20 AM
To: users@pdfbox.apache.org
Subject: Re: Problem when extracting text from a pdf file

Hi Mehmet,

it could well be that text extraction works for one PDF and doesn't for another as it might not contain real text but what you see on screen is drawn. As the attachments didn't make it through because of restrictions on the mailing list could you upload these to a public location to take a look at the files so the answer can be more specific for your case?

BR

Maruan Sahyoun

Am 14.05.2014 um 16:31 schrieb Mehmet Ali Abdulhayoglu <Me...@kuleuven.be>:

> Dear all,
>  
> As part of my research, I am trying to convert pdf files to text files. I have applied both itext and pdfbox but I encounter the same issue.
>  
> When I try extracting text from dnm1.pdf file (attached) both approaches work well. However when applying them for dnm2.pdf they fail.
>  
> I retrieve a text file with full of NULL values. Is it normal for such differently shaped pdfs or am I missing something else?
>  
> Thanks in advance.
>  
> Regards,
> Mehmet
>  
>  
> My code:
>  
> package retrievingfulltetxsfromweb;
>  
> import connectingurl.PlacesApi;
>  
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.IOException;
> import org.apache.pdfbox.cos.COSDocument;
> import org.apache.pdfbox.pdfparser.PDFParser;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.util.PDFTextStripper;
>  
> public class PdfBox {
>    
>     // Extract text from PDF Document
>             public PdfBox(String fileName) {
>                     //PDFParser parser = new PDFParser();
>                     String parsedText = null;;
>                     PDFTextStripper pdfStripper = null;
>                     PDDocument pdDoc = null;
>                     COSDocument cosDoc = null;
>                     File file = new File(fileName);
>                     if (!file.isFile()) {
>                             System.err.println("File " + fileName + " does not exist.");
>                             //return null;
>                     }
>                     try {
>                             PDFParser parser = new PDFParser(new FileInputStream(file));
>                     } catch (IOException e) {
>                             System.err.println("Unable to open PDF Parser. " + e.getMessage());
>                             //return null;
>                     }
>                     try {
>                             PDFParser parser = new PDFParser(new FileInputStream(file));
>                             parser.parse();
>                             cosDoc = parser.getDocument();
>                             pdfStripper = new PDFTextStripper();
>                             pdDoc = new PDDocument(cosDoc);
>                             pdfStripper.setStartPage(1);
>                             pdfStripper.setEndPage(5);
>                             parsedText = pdfStripper.getText(pdDoc);
>                         System.out.println(parsedText);
>                     } catch (Exception e) {
>                             System.err
>                                             .println("An exception occured in parsing the PDF Document."
>                                                             + e.getMessage());
>                     } finally {
>                             try {
>                                     if (cosDoc != null)
>                                             cosDoc.close();
>                                     if (pdDoc != null)
>                                             pdDoc.close();
>                             } catch (Exception e) {
>                                     e.printStackTrace();
>                             }
>                     }
>                     //return parsedText;
>             }
>             public static void main(String args[]){
>                    
>                 PdfBox pdf = new PdfBox("C:/dnm1.pdf");
>                    // System.out.println(pdftoText("C:/dnm1.pdf"));
>             }
>  
> }

Re: Problem when extracting text from a pdf file

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi Mehmet,

it could well be that text extraction works for one PDF and doesn’t for another as it might not contain real text but what you see on screen is drawn. As the attachments didn’t make it through because of restrictions on the mailing list could you upload these to a public location to take a look at the files so the answer can be more specific for your case?

BR

Maruan Sahyoun

Am 14.05.2014 um 16:31 schrieb Mehmet Ali Abdulhayoglu <Me...@kuleuven.be>:

> Dear all,
>  
> As part of my research, I am trying to convert pdf files to text files. I have applied both itext and pdfbox but I encounter the same issue.
>  
> When I try extracting text from dnm1.pdf file (attached) both approaches work well. However when applying them for dnm2.pdf they fail.
>  
> I retrieve a text file with full of NULL values. Is it normal for such differently shaped pdfs or am I missing something else?
>  
> Thanks in advance.
>  
> Regards,
> Mehmet
>  
>  
> My code:
>  
> package retrievingfulltetxsfromweb;
>  
> import connectingurl.PlacesApi;
>  
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.IOException;
> import org.apache.pdfbox.cos.COSDocument;
> import org.apache.pdfbox.pdfparser.PDFParser;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.util.PDFTextStripper;
>  
> public class PdfBox {
>    
>     // Extract text from PDF Document
>             public PdfBox(String fileName) {
>                     //PDFParser parser = new PDFParser();
>                     String parsedText = null;;
>                     PDFTextStripper pdfStripper = null;
>                     PDDocument pdDoc = null;
>                     COSDocument cosDoc = null;
>                     File file = new File(fileName);
>                     if (!file.isFile()) {
>                             System.err.println("File " + fileName + " does not exist.");
>                             //return null;
>                     }
>                     try {
>                             PDFParser parser = new PDFParser(new FileInputStream(file));
>                     } catch (IOException e) {
>                             System.err.println("Unable to open PDF Parser. " + e.getMessage());
>                             //return null;
>                     }
>                     try {
>                             PDFParser parser = new PDFParser(new FileInputStream(file));
>                             parser.parse();
>                             cosDoc = parser.getDocument();
>                             pdfStripper = new PDFTextStripper();
>                             pdDoc = new PDDocument(cosDoc);
>                             pdfStripper.setStartPage(1);
>                             pdfStripper.setEndPage(5);
>                             parsedText = pdfStripper.getText(pdDoc);
>                         System.out.println(parsedText);
>                     } catch (Exception e) {
>                             System.err
>                                             .println("An exception occured in parsing the PDF Document."
>                                                             + e.getMessage());
>                     } finally {
>                             try {
>                                     if (cosDoc != null)
>                                             cosDoc.close();
>                                     if (pdDoc != null)
>                                             pdDoc.close();
>                             } catch (Exception e) {
>                                     e.printStackTrace();
>                             }
>                     }
>                     //return parsedText;
>             }
>             public static void main(String args[]){
>                    
>                 PdfBox pdf = new PdfBox("C:/dnm1.pdf");
>                    // System.out.println(pdftoText("C:/dnm1.pdf"));
>             }
>  
> }