You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by 软件开发岗位夏志强 <15...@163.com> on 2024/02/05 08:56:18 UTC

遇到一个无法解决的bug

public  List<HashMap<String, Object>> readPdfString(File file ,int pageNum)  {
    List<HashMap<String, Object>> result = Collections.synchronizedList(new ArrayList<HashMap<String,Object>>());
PDDocument doc = null;
PDDocument originalDocument=null;
    try{
// 创建新的 PDF 文档
originalDocument=PDDocument.load(file);
doc = new PDDocument();
// 遍历原始文档的页面并复制到新文档
for (PDPage page : originalDocument.getPages()) {
            doc.addPage(page);
}
        doc.save(file);
doc=PDDocument.load(file);
/**
         * 为0表示读全部页的数据，大于0表示读取指定页码的数据
*/
pageNum  = pageNum == 0 ?  doc.getNumberOfPages() : pageNum;

CountDownLatch latch = new CountDownLatch(pageNum);

        for(int i=1;i<=pageNum;i++)
        {
int finalI = i;
PDDocument finalDoc = doc;
executorService.execute(() ->{
int attempts = 0;
                try {
                    HashMap<String, Object> map = new HashMap<>();
PDFTextStripper textStripper =new PDFTextStripper();
textStripper.setSortByPosition(true); // 设置是否按文本位置排序
textStripper.setStartPage(finalI); // 设置开始页数
textStripper.setEndPage(finalI); // 设置结束页数
// 从 PDF 文档提取文本
String text = textStripper.getText(finalDoc);
                    int maxAttempts = 5; // 设置最大尝试次数
while (attempts < maxAttempts&&!text.contains("兹证明")&&finalI==1) {
                        text = textStripper.getText(finalDoc);
                        if (text.contains("兹证明")) {
break;
}
                        attempts++;
}
                    map.put("id",finalI);
map.put("data",text);
result.add(map);
}catch (Exception e){
log.error("读取pdf内容失败",e);
attempts++;
}finally {
latch.countDown();
}
            });
}
        latch.await();
}catch (Exception e){
log.error(e.getMessage(),e);
}finally {
        IoUtil.close(doc);
IoUtil.close(originalDocument);
}
return result;



}
IDEA报错 org.apache.fontbox.ttf.TTFParser 151 parse  然后中文是乱码的

Re: 遇到一个无法解决的bug

Posted by Tilman Hausherr <TH...@t-online.de>.

Hello,

Please explain your problem in englisch and mention what PDFBox version 
you are using. Apparently it's about text extraction, read this first:

https://pdfbox.apache.org/3.0/faq.html#how-come-i-am-getting-gibberish(g38g43g36g51g5)-when-extracting-text%3F

Try extracting your test with Adobe Reader. Does it work? If not, then 
we won't be able to either.

If there is an exception, please include the stack trace.

Also post a link (don't attach) to the PDF involved and explain what you 
expected and what you got instead.

Tilman


On 05.02.2024 09:56, 软件开发岗位夏志强 wrote:
> public  List<HashMap<String, Object>> readPdfString(File file ,int pageNum)  {
>      List<HashMap<String, Object>> result = Collections.synchronizedList(new ArrayList<HashMap<String,Object>>());
> PDDocument doc = null;
> PDDocument originalDocument=null;
>      try{
> // 创建新的 PDF 文档
> originalDocument=PDDocument.load(file);
> doc = new PDDocument();
> // 遍历原始文档的页面并复制到新文档
> for (PDPage page : originalDocument.getPages()) {
>              doc.addPage(page);
> }
>          doc.save(file);
> doc=PDDocument.load(file);
> /**
>           * 为0表示读全部页的数据，大于0表示读取指定页码的数据
> */
> pageNum  = pageNum == 0 ?  doc.getNumberOfPages() : pageNum;
>
> CountDownLatch latch = new CountDownLatch(pageNum);
>
>          for(int i=1;i<=pageNum;i++)
>          {
> int finalI = i;
> PDDocument finalDoc = doc;
> executorService.execute(() ->{
> int attempts = 0;
>                  try {
>                      HashMap<String, Object> map = new HashMap<>();
> PDFTextStripper textStripper =new PDFTextStripper();
> textStripper.setSortByPosition(true); // 设置是否按文本位置排序
> textStripper.setStartPage(finalI); // 设置开始页数
> textStripper.setEndPage(finalI); // 设置结束页数
> // 从 PDF 文档提取文本
> String text = textStripper.getText(finalDoc);
>                      int maxAttempts = 5; // 设置最大尝试次数
> while (attempts < maxAttempts&&!text.contains("兹证明")&&finalI==1) {
>                          text = textStripper.getText(finalDoc);
>                          if (text.contains("兹证明")) {
> break;
> }
>                          attempts++;
> }
>                      map.put("id",finalI);
> map.put("data",text);
> result.add(map);
> }catch (Exception e){
> log.error("读取pdf内容失败",e);
> attempts++;
> }finally {
> latch.countDown();
> }
>              });
> }
>          latch.await();
> }catch (Exception e){
> log.error(e.getMessage(),e);
> }finally {
>          IoUtil.close(doc);
> IoUtil.close(originalDocument);
> }
> return result;
>
>
>
> }
> IDEA报错 org.apache.fontbox.ttf.TTFParser 151 parse  然后中文是乱码的



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org