You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Tim Allison <ta...@apache.org> on 2019/12/17 12:33:33 UTC

Re: Parsing order issue

PDFBox Colleagues,
  Any recommendations?

On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vi...@gmail.com> wrote:

> Dear Tika Dev Team,
>
>
>
> Hope this email finds you well.
>
>
>
> I have been actively using Tika for pdf file reading. One issue I found is
> the parsing order. As shown in attached image, the parsing order of pdf
> file is not  based on position of texts.
>
>
>
> As suggested in this github link
> <https://github.com/chrismattmann/tika-python/issues/266>, I used a
> customized config file (see attached), hoping to solve the issue. But this
> has not worked out. If any chance, can you please review this issue, and
> provide any insights or solutions?
>
>
>
> Thanks so much in advance.
>
>
>
> Regards,
>
> Luke
>

Re: Parsing order issue

Posted by Tilman Hausherr <TH...@t-online.de>.

[2nd attempt]

 From my understanding, when you want to use sortbyposition in tika, you 
need to have a segment like this:

...
         <parser class="org.apache.tika.parser.pdf.PDFParser">
             <params>
                 <param name="sortByPosition" type="bool">true</param>
             </params>
         </parser>
...

so your whole file would be like:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
   <parsers>
     <!-- Default Parser for most things, except for 2 mime types, and 
never
          use the Executable Parser -->
     <parser class="org.apache.tika.parser.DefaultParser">
       <mime-exclude>application/pdf</mime-exclude>
     </parser>
     <!-- Use a different parser for PDF -->
     <parser class="org.apache.tika.parser.pdf.PDFParser">
        <mime>application/pdf</mime>
        <params>
         <param name="sortByPosition" type="bool">true</param>
       </params>
     </parser>
   </parsers>
</properties>


I just tried this file with tika-app. The default didn't sort, using 
this did sort. I added " --config=config.xml" at the command line.

Tilman

Am 07.01.2020 um 00:04 schrieb Lu Sun:
> Dear PDFBox Dev Team,
>
> After searching through online
> <https://stackoverflow.com/search?page=5&tab=Relevance&q=pdfbox%20order>, I
> am certain that using setSortByPosition(true) would help. However, I am
> struggling to get the config file right. Can you please provide any advice
> on it?
>
> Thanks so much in advance. Regards, Luke
>
> On Fri, 20 Dec 2019 at 18:06, Lu Sun <vi...@gmail.com> wrote:
>
>> Dear PDFBox Dev Team,
>>
>> Hope this message finds you well.
>>
>> Just wanted to raise this for your attention. Please can you provide any
>> solutions on the parsing order issue? Attached is my config file, an
>> example of pdf file and my parsing results.
>>
>> Thanks so much in advance. Wish you and your team a Merry Christmas and
>> Happy New Year.
>>
>> Regards,
>> Luke
>>
>> On Tue, 17 Dec 2019 at 12:34, Tim Allison <ta...@apache.org> wrote:
>>
>>> PDFBox Colleagues,
>>>    Any recommendations?
>>>
>>> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vi...@gmail.com> wrote:
>>>
>>>> Dear Tika Dev Team,
>>>>
>>>>
>>>>
>>>> Hope this email finds you well.
>>>>
>>>>
>>>>
>>>> I have been actively using Tika for pdf file reading. One issue I found
>>>> is the parsing order. As shown in attached image, the parsing order of pdf
>>>> file is not  based on position of texts.
>>>>
>>>>
>>>>
>>>> As suggested in this github link
>>>> <https://github.com/chrismattmann/tika-python/issues/266>, I used a
>>>> customized config file (see attached), hoping to solve the issue. But this
>>>> has not worked out. If any chance, can you please review this issue, and
>>>> provide any insights or solutions?
>>>>
>>>>
>>>>
>>>> Thanks so much in advance.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Luke
>>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Parsing order issue

Posted by Tilman Hausherr <TH...@t-online.de>.

 From my understanding, when you want to use sortbyposition in tika, you 
need to have a segment like this:

...
         <parser class="org.apache.tika.parser.pdf.PDFParser">
             <params>
                 <param name="sortByPosition" type="bool">true</param>
             </params>
         </parser>
...

so your whole file would be like:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
   <parsers>
     <!-- Default Parser for most things, except for 2 mime types, and never
          use the Executable Parser -->
     <parser class="org.apache.tika.parser.DefaultParser">
       <mime-exclude>application/pdf</mime-exclude>
     </parser>
     <!-- Use a different parser for PDF -->
     <parser class="org.apache.tika.parser.pdf.PDFParser">
        <mime>application/pdf</mime>
        <params>
         <param name="sortByPosition" type="bool">true</param>
       </params>
     </parser>
   </parsers>
</properties>


I just tried this file with tika-app. The default didn't sort, using 
this did sort. I added " --config=config.xml" at the command line.

Tilman

Am 07.01.2020 um 00:04 schrieb Lu Sun:
> Dear PDFBox Dev Team,
>
> After searching through online
> <https://stackoverflow.com/search?page=5&tab=Relevance&q=pdfbox%20order>, I
> am certain that using setSortByPosition(true) would help. However, I am
> struggling to get the config file right. Can you please provide any advice
> on it?
>
> Thanks so much in advance. Regards, Luke
>
> On Fri, 20 Dec 2019 at 18:06, Lu Sun <vi...@gmail.com> wrote:
>
>> Dear PDFBox Dev Team,
>>
>> Hope this message finds you well.
>>
>> Just wanted to raise this for your attention. Please can you provide any
>> solutions on the parsing order issue? Attached is my config file, an
>> example of pdf file and my parsing results.
>>
>> Thanks so much in advance. Wish you and your team a Merry Christmas and
>> Happy New Year.
>>
>> Regards,
>> Luke
>>
>> On Tue, 17 Dec 2019 at 12:34, Tim Allison <ta...@apache.org> wrote:
>>
>>> PDFBox Colleagues,
>>>    Any recommendations?
>>>
>>> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vi...@gmail.com> wrote:
>>>
>>>> Dear Tika Dev Team,
>>>>
>>>>
>>>>
>>>> Hope this email finds you well.
>>>>
>>>>
>>>>
>>>> I have been actively using Tika for pdf file reading. One issue I found
>>>> is the parsing order. As shown in attached image, the parsing order of pdf
>>>> file is not  based on position of texts.
>>>>
>>>>
>>>>
>>>> As suggested in this github link
>>>> <https://github.com/chrismattmann/tika-python/issues/266>, I used a
>>>> customized config file (see attached), hoping to solve the issue. But this
>>>> has not worked out. If any chance, can you please review this issue, and
>>>> provide any insights or solutions?
>>>>
>>>>
>>>>
>>>> Thanks so much in advance.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Luke
>>>>

Re: Parsing order issue

Posted by Tilman Hausherr <TH...@t-online.de>.

Hi,

This is a known problem. When you have columns it is usually better to 
use the unsorted order. And even that may not work properly if the 
sequence in the PDF doesn't make sense.

Somebody with a lot of time should create a software to identify 
"blocks" (these are often rectangular, but not always) and then extract 
them in sequence.

Tilman

Am 10.01.2020 um 16:53 schrieb Lu Sun:
> Hi Tilman,
>
> Sorry for the late response. I implemented the config file, and it 
> worked well. Sadly, I noticed that this "sortByPosition" cannot parse 
> well when PDF files have multiple columns in a page, as it follows an 
> order of left->right, and top->down.
>
>  Pls see attached images. Is it possible you can advise me on how to 
> deal with such case?
>
> Thanks so much.
> Luke
>
>
> On Tue, 7 Jan 2020 at 13:00, Tilman Hausherr <THausherr@t-online.de 
> <ma...@t-online.de>> wrote:
>
>     hi,
>     I answered that one in the mailing lists. You need to subscribe,
>     or read the archives. I'll see if I can fwd it.
>     Tilman
>
>
>     ------------------------------------------------------------------------
>     Gesendet mit der Telekom Mail App
>     <https://kommunikationsdienste.t-online.de/redirects/email_app_android_sendmail_footer>
>
>
>
>     --- Original-Nachricht ---
>     *Von: *Lu Sun
>     *Betreff: *Re: Parsing order issue
>     *Datum: *07.01.2020, 0:04 Uhr
>     *An: *users@pdfbox.apache.org <ma...@pdfbox.apache.org>
>     *Cc: *tallison@apache.org <ma...@apache.org>,
>     <dev@tika.apache.org <ma...@tika.apache.org>>
>
>
>
>
>     Dear PDFBox Dev Team,
>
>     After searching through online
>     <https://stackoverflow.com/search?page=5&tab=Relevance&q=pdfbox%20order&gt;,
>     I
>     am certain that using setSortByPosition(true) would help. However,
>     I am
>     struggling to get the config file right. Can you please provide
>     any advice
>     on it?
>
>     Thanks so much in advance. Regards, Luke
>
>     On Fri, 20 Dec 2019 at 18:06, Lu Sun <vistaxjtu@gmail.com
>     <ma...@gmail.com>> wrote:
>
>     > Dear PDFBox Dev Team,
>     >
>     > Hope this message finds you well.
>     >
>     > Just wanted to raise this for your attention. Please can you
>     provide any
>     > solutions on the parsing order issue? Attached is my config file, an
>     > example of pdf file and my parsing results.
>     >
>     > Thanks so much in advance. Wish you and your team a Merry
>     Christmas and
>     > Happy New Year.
>     >
>     > Regards,
>     > Luke
>     >
>     > On Tue, 17 Dec 2019 at 12:34, Tim Allison <tallison@apache.org
>     <ma...@apache.org>> wrote:
>     >
>     >> PDFBox Colleagues,
>     >> Any recommendations?
>     >>
>     >> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vistaxjtu@gmail.com
>     <ma...@gmail.com>> wrote:
>     >>
>     >>> Dear Tika Dev Team,
>     >>>
>     >>>
>     >>>
>     >>> Hope this email finds you well.
>     >>>
>     >>>
>     >>>
>     >>> I have been actively using Tika for pdf file reading. One
>     issue I found
>     >>> is the parsing order. As shown in attached image, the parsing
>     order of pdf
>     >>> file is not based on position of texts.
>     >>>
>     >>>
>     >>>
>     >>> As suggested in this github link
>     >>> <https://github.com/chrismattmann/tika-python/issues/266&gt;,
>     I used a
>     >>> customized config file (see attached), hoping to solve the
>     issue. But this
>     >>> has not worked out. If any chance, can you please review this
>     issue, and
>     >>> provide any insights or solutions?
>     >>>
>     >>>
>     >>>
>     >>> Thanks so much in advance.
>     >>>
>     >>>
>     >>>
>     >>> Regards,
>     >>>
>     >>> Luke
>     >>>
>     >>
>

Re: Re: Parsing order issue

Posted by Lu Sun <vi...@gmail.com>.

Hi Tim,

Hope you are all doing well.

Just want to raise this issue to your attention. The issue as shown in the
images, is the "sortByPosition" cannot parse well when PDF files have
multiple columns in a page.

Could you please advise me any solutions? Btw, I didn't see my last email
shown on the mailing list, is that because of not being subscribed?

Thanks so much in advance.
Luke
[image: tesco_parsing_order.JPG][image: DeLaRue_parsing_order.JPG]

On Fri, 10 Jan 2020 at 15:53, Lu Sun <vi...@gmail.com> wrote:

> Hi Tilman,
>
> Sorry for the late response. I implemented the config file, and it worked
> well. Sadly, I noticed that this "sortByPosition" cannot parse well when
> PDF files have multiple columns in a page, as it follows an order of
> left->right, and top->down.
>
>  Pls see attached images. Is it possible you can advise me on how to deal
> with such case?
>
> Thanks so much.
> Luke
>
>
> On Tue, 7 Jan 2020 at 13:00, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> hi,
>> I answered that one in the mailing lists. You need to subscribe, or read
>> the archives. I'll see if I can fwd it.
>> Tilman
>>
>>
>> ------------------------------
>> Gesendet mit der Telekom Mail App
>> <https://kommunikationsdienste.t-online.de/redirects/email_app_android_sendmail_footer>
>>
>>
>>
>> --- Original-Nachricht ---
>> *Von: *Lu Sun
>> *Betreff: *Re: Parsing order issue
>> *Datum: *07.01.2020, 0:04 Uhr
>> *An: *users@pdfbox.apache.org
>> *Cc: *tallison@apache.org, <de...@tika.apache.org>
>>
>>
>>
>>
>> Dear PDFBox Dev Team,
>>
>> After searching through online
>> <
>> https://stackoverflow.com/search?page=5&tab=Relevance&q=pdfbox%20order&gt;,
>> I
>> am certain that using setSortByPosition(true) would help. However, I am
>> struggling to get the config file right. Can you please provide any advice
>> on it?
>>
>> Thanks so much in advance. Regards, Luke
>>
>> On Fri, 20 Dec 2019 at 18:06, Lu Sun <vi...@gmail.com> wrote:
>>
>> > Dear PDFBox Dev Team,
>> >
>> > Hope this message finds you well.
>> >
>> > Just wanted to raise this for your attention. Please can you provide any
>> > solutions on the parsing order issue? Attached is my config file, an
>> > example of pdf file and my parsing results.
>> >
>> > Thanks so much in advance. Wish you and your team a Merry Christmas and
>> > Happy New Year.
>> >
>> > Regards,
>> > Luke
>> >
>> > On Tue, 17 Dec 2019 at 12:34, Tim Allison <ta...@apache.org> wrote:
>> >
>> >> PDFBox Colleagues,
>> >> Any recommendations?
>> >>
>> >> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vi...@gmail.com> wrote:
>> >>
>> >>> Dear Tika Dev Team,
>> >>>
>> >>>
>> >>>
>> >>> Hope this email finds you well.
>> >>>
>> >>>
>> >>>
>> >>> I have been actively using Tika for pdf file reading. One issue I
>> found
>> >>> is the parsing order. As shown in attached image, the parsing order
>> of pdf
>> >>> file is not based on position of texts.
>> >>>
>> >>>
>> >>>
>> >>> As suggested in this github link
>> >>> <https://github.com/chrismattmann/tika-python/issues/266&gt;, I used
>> a
>> >>> customized config file (see attached), hoping to solve the issue. But
>> this
>> >>> has not worked out. If any chance, can you please review this issue,
>> and
>> >>> provide any insights or solutions?
>> >>>
>> >>>
>> >>>
>> >>> Thanks so much in advance.
>> >>>
>> >>>
>> >>>
>> >>> Regards,
>> >>>
>> >>> Luke
>> >>>
>> >>
>>
>

Re: Re: Parsing order issue

Posted by Lu Sun <vi...@gmail.com>.

Hi Tilman,

Sorry for the late response. I implemented the config file, and it worked
well. Sadly, I noticed that this "sortByPosition" cannot parse well when
PDF files have multiple columns in a page, as it follows an order of
left->right, and top->down.

 Pls see attached images. Is it possible you can advise me on how to deal
with such case?

Thanks so much.
Luke


On Tue, 7 Jan 2020 at 13:00, Tilman Hausherr <TH...@t-online.de> wrote:

> hi,
> I answered that one in the mailing lists. You need to subscribe, or read
> the archives. I'll see if I can fwd it.
> Tilman
>
>
> ------------------------------
> Gesendet mit der Telekom Mail App
> <https://kommunikationsdienste.t-online.de/redirects/email_app_android_sendmail_footer>
>
>
>
> --- Original-Nachricht ---
> *Von: *Lu Sun
> *Betreff: *Re: Parsing order issue
> *Datum: *07.01.2020, 0:04 Uhr
> *An: *users@pdfbox.apache.org
> *Cc: *tallison@apache.org, <de...@tika.apache.org>
>
>
>
>
> Dear PDFBox Dev Team,
>
> After searching through online
> <https://stackoverflow.com/search?page=5&tab=Relevance&q=pdfbox%20order&gt;,
> I
> am certain that using setSortByPosition(true) would help. However, I am
> struggling to get the config file right. Can you please provide any advice
> on it?
>
> Thanks so much in advance. Regards, Luke
>
> On Fri, 20 Dec 2019 at 18:06, Lu Sun <vi...@gmail.com> wrote:
>
> > Dear PDFBox Dev Team,
> >
> > Hope this message finds you well.
> >
> > Just wanted to raise this for your attention. Please can you provide any
> > solutions on the parsing order issue? Attached is my config file, an
> > example of pdf file and my parsing results.
> >
> > Thanks so much in advance. Wish you and your team a Merry Christmas and
> > Happy New Year.
> >
> > Regards,
> > Luke
> >
> > On Tue, 17 Dec 2019 at 12:34, Tim Allison <ta...@apache.org> wrote:
> >
> >> PDFBox Colleagues,
> >> Any recommendations?
> >>
> >> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vi...@gmail.com> wrote:
> >>
> >>> Dear Tika Dev Team,
> >>>
> >>>
> >>>
> >>> Hope this email finds you well.
> >>>
> >>>
> >>>
> >>> I have been actively using Tika for pdf file reading. One issue I found
> >>> is the parsing order. As shown in attached image, the parsing order of
> pdf
> >>> file is not based on position of texts.
> >>>
> >>>
> >>>
> >>> As suggested in this github link
> >>> <https://github.com/chrismattmann/tika-python/issues/266&gt;, I used a
> >>> customized config file (see attached), hoping to solve the issue. But
> this
> >>> has not worked out. If any chance, can you please review this issue,
> and
> >>> provide any insights or solutions?
> >>>
> >>>
> >>>
> >>> Thanks so much in advance.
> >>>
> >>>
> >>>
> >>> Regards,
> >>>
> >>> Luke
> >>>
> >>
>

Re: Parsing order issue

Posted by Lu Sun <vi...@gmail.com>.

Dear PDFBox Dev Team,

After searching through online
<https://stackoverflow.com/search?page=5&tab=Relevance&q=pdfbox%20order>, I
am certain that using setSortByPosition(true) would help. However, I am
struggling to get the config file right. Can you please provide any advice
on it?

Thanks so much in advance. Regards, Luke

On Fri, 20 Dec 2019 at 18:06, Lu Sun <vi...@gmail.com> wrote:

> Dear PDFBox Dev Team,
>
> Hope this message finds you well.
>
> Just wanted to raise this for your attention. Please can you provide any
> solutions on the parsing order issue? Attached is my config file, an
> example of pdf file and my parsing results.
>
> Thanks so much in advance. Wish you and your team a Merry Christmas and
> Happy New Year.
>
> Regards,
> Luke
>
> On Tue, 17 Dec 2019 at 12:34, Tim Allison <ta...@apache.org> wrote:
>
>> PDFBox Colleagues,
>>   Any recommendations?
>>
>> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vi...@gmail.com> wrote:
>>
>>> Dear Tika Dev Team,
>>>
>>>
>>>
>>> Hope this email finds you well.
>>>
>>>
>>>
>>> I have been actively using Tika for pdf file reading. One issue I found
>>> is the parsing order. As shown in attached image, the parsing order of pdf
>>> file is not  based on position of texts.
>>>
>>>
>>>
>>> As suggested in this github link
>>> <https://github.com/chrismattmann/tika-python/issues/266>, I used a
>>> customized config file (see attached), hoping to solve the issue. But this
>>> has not worked out. If any chance, can you please review this issue, and
>>> provide any insights or solutions?
>>>
>>>
>>>
>>> Thanks so much in advance.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Luke
>>>
>>

Re: Parsing order issue

Posted by Tilman Hausherr <TH...@t-online.de>.

> I answered, asked to have a look at your file (upload to a 
> sharehoster), and mentioned that your config file is suspicious.

I found your file (it was in the moderation mail), and that is a typical 
case where the PDF order is different to the visual order. That is what 
the sort option in the PDF parser is for. Which brings us back to my 
theory that your config file is wrong.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Parsing order issue

Posted by Tilman Hausherr <TH...@t-online.de>.

I answered, asked to have a look at your file (upload to a sharehoster), 
and mentioned that your config file is suspicious.

Tilman

Am 20.12.2019 um 19:06 schrieb Lu Sun:
> Dear PDFBox Dev Team,
>
> Hope this message finds you well.
>
> Just wanted to raise this for your attention. Please can you provide 
> any solutions on the parsing order issue? Attached is my config file, 
> an example of pdf file and my parsing results.
>
> Thanks so much in advance. Wish you and your team a Merry Christmas 
> and Happy New Year.
>
> Regards,
> Luke
>
> On Tue, 17 Dec 2019 at 12:34, Tim Allison <tallison@apache.org 
> <ma...@apache.org>> wrote:
>
>     PDFBox Colleagues,
>       Any recommendations?
>
>     On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vistaxjtu@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         Dear Tika Dev Team,
>
>         Hope this email finds you well.
>
>         I have been actively using Tika for pdf file reading. One
>         issue I found is the parsing order. As shown in attached
>         image, the parsing order of pdf file is not  based on position
>         of texts.
>
>         As suggested in this github link
>         <https://github.com/chrismattmann/tika-python/issues/266>, I
>         used a customized config file (see attached), hoping to solve
>         the issue. But this has not worked out. If any chance, can you
>         please review this issue, and provide any insights or solutions?
>
>         Thanks so much in advance.
>
>         Regards,
>
>         Luke
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Parsing order issue

Posted by Tilman Hausherr <TH...@t-online.de>.

I answered, asked to have a look at your file (upload to a sharehoster), 
and mentioned that your config file is suspicious.

Tilman

Am 20.12.2019 um 19:06 schrieb Lu Sun:
> Dear PDFBox Dev Team,
>
> Hope this message finds you well.
>
> Just wanted to raise this for your attention. Please can you provide 
> any solutions on the parsing order issue? Attached is my config file, 
> an example of pdf file and my parsing results.
>
> Thanks so much in advance. Wish you and your team a Merry Christmas 
> and Happy New Year.
>
> Regards,
> Luke
>
> On Tue, 17 Dec 2019 at 12:34, Tim Allison <tallison@apache.org 
> <ma...@apache.org>> wrote:
>
>     PDFBox Colleagues,
>       Any recommendations?
>
>     On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vistaxjtu@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         Dear Tika Dev Team,
>
>         Hope this email finds you well.
>
>         I have been actively using Tika for pdf file reading. One
>         issue I found is the parsing order. As shown in attached
>         image, the parsing order of pdf file is not  based on position
>         of texts.
>
>         As suggested in this github link
>         <https://github.com/chrismattmann/tika-python/issues/266>, I
>         used a customized config file (see attached), hoping to solve
>         the issue. But this has not worked out. If any chance, can you
>         please review this issue, and provide any insights or solutions?
>
>         Thanks so much in advance.
>
>         Regards,
>
>         Luke
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Parsing order issue

Posted by Lu Sun <vi...@gmail.com>.

Dear PDFBox Dev Team,

After searching through online
<https://stackoverflow.com/search?page=5&tab=Relevance&q=pdfbox%20order>, I
am certain that using setSortByPosition(true) would help. However, I am
struggling to get the config file right. Can you please provide any advice
on it?

Thanks so much in advance. Regards, Luke

On Fri, 20 Dec 2019 at 18:06, Lu Sun <vi...@gmail.com> wrote:

> Dear PDFBox Dev Team,
>
> Hope this message finds you well.
>
> Just wanted to raise this for your attention. Please can you provide any
> solutions on the parsing order issue? Attached is my config file, an
> example of pdf file and my parsing results.
>
> Thanks so much in advance. Wish you and your team a Merry Christmas and
> Happy New Year.
>
> Regards,
> Luke
>
> On Tue, 17 Dec 2019 at 12:34, Tim Allison <ta...@apache.org> wrote:
>
>> PDFBox Colleagues,
>>   Any recommendations?
>>
>> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vi...@gmail.com> wrote:
>>
>>> Dear Tika Dev Team,
>>>
>>>
>>>
>>> Hope this email finds you well.
>>>
>>>
>>>
>>> I have been actively using Tika for pdf file reading. One issue I found
>>> is the parsing order. As shown in attached image, the parsing order of pdf
>>> file is not  based on position of texts.
>>>
>>>
>>>
>>> As suggested in this github link
>>> <https://github.com/chrismattmann/tika-python/issues/266>, I used a
>>> customized config file (see attached), hoping to solve the issue. But this
>>> has not worked out. If any chance, can you please review this issue, and
>>> provide any insights or solutions?
>>>
>>>
>>>
>>> Thanks so much in advance.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Luke
>>>
>>

Re: Parsing order issue

Posted by Lu Sun <vi...@gmail.com>.

Dear PDFBox Dev Team,

Hope this message finds you well.

Just wanted to raise this for your attention. Please can you provide any
solutions on the parsing order issue? Attached is my config file, an
example of pdf file and my parsing results.

Thanks so much in advance. Wish you and your team a Merry Christmas and
Happy New Year.

Regards,
Luke

On Tue, 17 Dec 2019 at 12:34, Tim Allison <ta...@apache.org> wrote:

> PDFBox Colleagues,
>   Any recommendations?
>
> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vi...@gmail.com> wrote:
>
>> Dear Tika Dev Team,
>>
>>
>>
>> Hope this email finds you well.
>>
>>
>>
>> I have been actively using Tika for pdf file reading. One issue I found
>> is the parsing order. As shown in attached image, the parsing order of pdf
>> file is not  based on position of texts.
>>
>>
>>
>> As suggested in this github link
>> <https://github.com/chrismattmann/tika-python/issues/266>, I used a
>> customized config file (see attached), hoping to solve the issue. But this
>> has not worked out. If any chance, can you please review this issue, and
>> provide any insights or solutions?
>>
>>
>>
>> Thanks so much in advance.
>>
>>
>>
>> Regards,
>>
>> Luke
>>
>

Re: Parsing order issue

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi Tim,

unfortunately the image didn't make it to the mailing list. What is the issue here? Is the extracted text not in the right
order?

Order of PDF parsing and visual order of text are not related.

BR
Maruan

 
> PDFBox Colleagues,
>   Any recommendations?
> 
> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vi...@gmail.com> wrote:
> 
> > Dear Tika Dev Team,
> > 
> > 
> > 
> > Hope this email finds you well.
> > 
> > 
> > 
> > I have been actively using Tika for pdf file reading. One issue I found is
> > the parsing order. As shown in attached image, the parsing order of pdf
> > file is not  based on position of texts.
> > 
> > 
> > 
> > As suggested in this github link
> > <https://github.com/chrismattmann/tika-python/issues/266>;, I used a
> > customized config file (see attached), hoping to solve the issue. But this
> > has not worked out. If any chance, can you please review this issue, and
> > provide any insights or solutions?
> > 
> > 
> > 
> > Thanks so much in advance.
> > 
> > 
> > 
> > Regards,
> > 
> > Luke
> > 
--

Re: Parsing order issue

Posted by Lu Sun <vi...@gmail.com>.

Dear PDFBox Dev Team,

Hope this message finds you well.

Just wanted to raise this for your attention. Please can you provide any
solutions on the parsing order issue? Attached is my config file, an
example of pdf file and my parsing results.

Thanks so much in advance. Wish you and your team a Merry Christmas and
Happy New Year.

Regards,
Luke

On Tue, 17 Dec 2019 at 12:34, Tim Allison <ta...@apache.org> wrote:

> PDFBox Colleagues,
>   Any recommendations?
>
> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vi...@gmail.com> wrote:
>
>> Dear Tika Dev Team,
>>
>>
>>
>> Hope this email finds you well.
>>
>>
>>
>> I have been actively using Tika for pdf file reading. One issue I found
>> is the parsing order. As shown in attached image, the parsing order of pdf
>> file is not  based on position of texts.
>>
>>
>>
>> As suggested in this github link
>> <https://github.com/chrismattmann/tika-python/issues/266>, I used a
>> customized config file (see attached), hoping to solve the issue. But this
>> has not worked out. If any chance, can you please review this issue, and
>> provide any insights or solutions?
>>
>>
>>
>> Thanks so much in advance.
>>
>>
>>
>> Regards,
>>
>> Luke
>>
>

Re: Parsing order issue

Posted by Tim Allison <ta...@apache.org>.

Tilman,
   That isn’t correct. I’ll find the link that might help...

On Tue, Dec 17, 2019 at 1:02 PM Tilman Hausherr <TH...@t-online.de>
wrote:

> I already answered... we need the PDF.
>
> But... about the config:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
>    <parsers>
>      <!-- Default Parser for most things, except for 2 mime types, and
> never
>           use the Executable Parser -->
>      <parser class="org.apache.tika.parser.DefaultParser">
>        <mime-exclude>image/jpeg</mime-exclude>
>        <mime-exclude>application/pdf</mime-exclude>
>        <parser-exclude
> class="org.apache.tika.parser.executable.ExecutableParser"/>
>      </parser>
>
>      <!-- Use a different parser for PDF -->
>      <parser class="org.apache.tika.parser.DefaultParser">
>      <property name="sortByPosition" value="true"/>
>        <mime>application/pdf</mime>
>      </parser>
>    </parsers>
> </properties>
>
> Is this a correct setting for PDFs in tika? I notice that the same
> parser class is used twice.
>
> And the file was named "tika.config", shouldn't it be named
> "tika-config.xml"?
>
> Tilman
>
> Am 17.12.2019 um 13:33 schrieb Tim Allison:
> > PDFBox Colleagues,
> >    Any recommendations?
> >
> > On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vi...@gmail.com> wrote:
> >
> >> Dear Tika Dev Team,
> >>
> >>
> >>
> >> Hope this email finds you well.
> >>
> >>
> >>
> >> I have been actively using Tika for pdf file reading. One issue I found
> is
> >> the parsing order. As shown in attached image, the parsing order of pdf
> >> file is not  based on position of texts.
> >>
> >>
> >>
> >> As suggested in this github link
> >> <https://github.com/chrismattmann/tika-python/issues/266>, I used a
> >> customized config file (see attached), hoping to solve the issue. But
> this
> >> has not worked out. If any chance, can you please review this issue, and
> >> provide any insights or solutions?
> >>
> >>
> >>
> >> Thanks so much in advance.
> >>
> >>
> >>
> >> Regards,
> >>
> >> Luke
> >>
>
>

Re: Parsing order issue

Posted by Tilman Hausherr <TH...@t-online.de>.

I already answered... we need the PDF.

But... about the config:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
   <parsers>
     <!-- Default Parser for most things, except for 2 mime types, and never
          use the Executable Parser -->
     <parser class="org.apache.tika.parser.DefaultParser">
       <mime-exclude>image/jpeg</mime-exclude>
       <mime-exclude>application/pdf</mime-exclude>
       <parser-exclude 
class="org.apache.tika.parser.executable.ExecutableParser"/>
     </parser>

     <!-- Use a different parser for PDF -->
     <parser class="org.apache.tika.parser.DefaultParser">
     <property name="sortByPosition" value="true"/>
       <mime>application/pdf</mime>
     </parser>
   </parsers>
</properties>

Is this a correct setting for PDFs in tika? I notice that the same 
parser class is used twice.

And the file was named "tika.config", shouldn't it be named 
"tika-config.xml"?

Tilman

Am 17.12.2019 um 13:33 schrieb Tim Allison:
> PDFBox Colleagues,
>    Any recommendations?
>
> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vi...@gmail.com> wrote:
>
>> Dear Tika Dev Team,
>>
>>
>>
>> Hope this email finds you well.
>>
>>
>>
>> I have been actively using Tika for pdf file reading. One issue I found is
>> the parsing order. As shown in attached image, the parsing order of pdf
>> file is not  based on position of texts.
>>
>>
>>
>> As suggested in this github link
>> <https://github.com/chrismattmann/tika-python/issues/266>, I used a
>> customized config file (see attached), hoping to solve the issue. But this
>> has not worked out. If any chance, can you please review this issue, and
>> provide any insights or solutions?
>>
>>
>>
>> Thanks so much in advance.
>>
>>
>>
>> Regards,
>>
>> Luke
>>

Re: Parsing order issue

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi Tim,

unfortunately the image didn't make it to the mailing list. What is the issue here? Is the extracted text not in the right
order?

Order of PDF parsing and visual order of text are not related.

BR
Maruan

 
> PDFBox Colleagues,
>   Any recommendations?
> 
> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vi...@gmail.com> wrote:
> 
> > Dear Tika Dev Team,
> > 
> > 
> > 
> > Hope this email finds you well.
> > 
> > 
> > 
> > I have been actively using Tika for pdf file reading. One issue I found is
> > the parsing order. As shown in attached image, the parsing order of pdf
> > file is not  based on position of texts.
> > 
> > 
> > 
> > As suggested in this github link
> > <https://github.com/chrismattmann/tika-python/issues/266>;, I used a
> > customized config file (see attached), hoping to solve the issue. But this
> > has not worked out. If any chance, can you please review this issue, and
> > provide any insights or solutions?
> > 
> > 
> > 
> > Thanks so much in advance.
> > 
> > 
> > 
> > Regards,
> > 
> > Luke
> > 
-- 




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Parsing order issue

Posted by Tilman Hausherr <TH...@t-online.de>.

I already answered... we need the PDF.

But... about the config:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
   <parsers>
     <!-- Default Parser for most things, except for 2 mime types, and never
          use the Executable Parser -->
     <parser class="org.apache.tika.parser.DefaultParser">
       <mime-exclude>image/jpeg</mime-exclude>
       <mime-exclude>application/pdf</mime-exclude>
       <parser-exclude 
class="org.apache.tika.parser.executable.ExecutableParser"/>
     </parser>

     <!-- Use a different parser for PDF -->
     <parser class="org.apache.tika.parser.DefaultParser">
     <property name="sortByPosition" value="true"/>
       <mime>application/pdf</mime>
     </parser>
   </parsers>
</properties>

Is this a correct setting for PDFs in tika? I notice that the same 
parser class is used twice.

And the file was named "tika.config", shouldn't it be named 
"tika-config.xml"?

Tilman

Am 17.12.2019 um 13:33 schrieb Tim Allison:
> PDFBox Colleagues,
>    Any recommendations?
>
> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vi...@gmail.com> wrote:
>
>> Dear Tika Dev Team,
>>
>>
>>
>> Hope this email finds you well.
>>
>>
>>
>> I have been actively using Tika for pdf file reading. One issue I found is
>> the parsing order. As shown in attached image, the parsing order of pdf
>> file is not  based on position of texts.
>>
>>
>>
>> As suggested in this github link
>> <https://github.com/chrismattmann/tika-python/issues/266>, I used a
>> customized config file (see attached), hoping to solve the issue. But this
>> has not worked out. If any chance, can you please review this issue, and
>> provide any insights or solutions?
>>
>>
>>
>> Thanks so much in advance.
>>
>>
>>
>> Regards,
>>
>> Luke
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org