You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Raimi Rufai <rr...@gmail.com> on 2011/11/08 16:50:22 UTC

Re: PDFTextStripper : can't change the default TextPositionComparator

Hi Sebastien,

It might be more flexible to inject an instance of rather than the class of
the Comparator. For comparators that take parameters, your current solution
won't work. In other words, you would have:

private Comparator textPositionComparator=  new TextPositionComparator();

public Comparator getTextPositionComparator() {
                       return textPositionComparator;
}

public void setgetTextPositionComparator(Comparator comparator) {
          textPositionComparator = comparator;
}

What do you think?

Regards,

Raimi



On Tue, Nov 8, 2011 at 10:24 AM, Martinez, Mel - 1004 - MITLL <
m.martinez@ll.mit.edu> wrote:

> Sebastien,
>
> I totally agree that this would be a good change, having run into the same
> problem when working out my own mods to the text extraction some time ago.
>
> Please create a JIRA issue proposing this at:
> https://issues.apache.org/jira/browse/PDFBOX
>
> Mel
>
>
> -----Original Message-----
> From: Sébastien Dailly [mailto:sebastien@chimrod.com]
> Sent: Tuesday, November 08, 2011 4:27 AM
> To: dev@pdfbox.apache.org
> Subject: PDFTextStripper : can't change the default TextPositionComparator
>
> Hello,
>
> I'm trying to use the PDFTextStripper class, but the sortByPosition does
> not seems to act correctly when the chararacters on the same line are
> not exactly on the same y position.
>
> There is no way to replace the TextPositionComparator used in the class
> by my own, even by subclassing the PDFTextStripper class ( see later ).
>
> One solution is to use a getter instead of a hard link between classes :
>
> >             List<TextPosition> textList = charactersByArticle.get( i );
> >             if( getSortByPosition() )
> >             {
> >                 TextPositionComparator comparator = new
> TextPositionComparator();
> >                 Collections.sort( textList, comparator );
> >             }
>
> become :
>
> >             List<TextPosition> textList = charactersByArticle.get( i );
> >             if( getSortByPosition() )
> >             {
> >                 Comparator comparator = getTextPositionComparator();
> >                 Collections.sort( textList, comparator );
> >             }
>
> with getTextPositionComparator defined as following :
>
> > private Class<? extends Comparator> textPositionComparator=
> TextPositionComparator.class;
>
> > […]
>
> >       /**
> >        *
> >        * @return The comparator for ordening text position.
> >        */
> >       public Comparator getTextPositionComparator() {
> >               try {
> >                       return textPositionComparator.newInstance();
> >               } catch (final InstantiationException e) {
> >                       return null;
> >               } catch (final IllegalAccessException e) {
> >                       return null;
> >               }
> >       }
>
> (with the appropriate setter).
>
> Note :
>
> Still the PDFTextStripper.writePage is protected, it uses the
> getTextPosition method from the PositionWrapper class, wich is a
> protected method, without subclassing this class ! This only works
> because they belong to the same package ! (I think it can be considered
> as a bug in the project architecture)
>
> >                //Resets the average character width when we see a change
> in font
> >                 // or a change in the font size
> >                 if(lastPosition != null && ((position.getFont() !=
> lastPosition.getTextPosition().getFont())
> >                         || (position.getFontSize() !=
> lastPosition.getTextPosition().getFontSize())))
> >                 {
> >                     previousAveCharWidth = -1;
> >                 }
>
> Thank you,
>
> --
> Sébastien
>



-- 
«To develop software is to build a machine simply by describing it.»
(Michael A. Jackson -- not the singer)

«Développer un logiciel revient à construire une machine tout simplement en
le décrivant.» (Michael A. Jackson - pas le chanteur)

Re: PDFTextStripper : can't change the default TextPositionComparator

Posted by Sébastien Dailly <se...@chimrod.com>.
Le 08/11/2011 16:53, Raimi Rufai a écrit :
> With Generics, things might look like this instead perhaps ...
>
>   private Comparator<TextPosition>  textPositionComparator=  new
> TextPositionComparator();
>
> public Comparator<TextPosition>  getTextPositionComparator() {
>                         return textPositionComparator;
> }
>
> public void setgetTextPositionComparator(Comparator<TextPosition>
> comparator) {
>            textPositionComparator = comparator;
> }
>
>
>
> On Tue, Nov 8, 2011 at 10:50 AM, Raimi Rufai<rr...@gmail.com>  wrote:
>
>> Hi Sebastien,
>>
>> It might be more flexible to inject an instance of rather than the class
>> of the Comparator. For comparators that take parameters, your current
>> solution won't work. In other words, you would have:
>>
>> private Comparator<TextPosition>  textPositionComparator=  new
>> TextPositionComparator();
>>
>> public Comparator<TextPosition>  getTextPositionComparator() {
>>                         return textPositionComparator;
>> }
>>
>> public void setgetTextPositionComparator(Comparator<TextPosition>
>> comparator) {
>>            textPositionComparator = comparator;
>> }
>>
>> What do you think?
>>

I Raimi,

I think your solution is better. I'm opening an issue in Jira for that.

-- 
Sébastien

Re: PDFTextStripper : can't change the default TextPositionComparator

Posted by Raimi Rufai <rr...@gmail.com>.
With Generics, things might look like this instead perhaps ...

 private Comparator<TextPosition> textPositionComparator=  new
TextPositionComparator();

public Comparator<TextPosition> getTextPositionComparator() {
                       return textPositionComparator;
}

public void setgetTextPositionComparator(Comparator<TextPosition>
comparator) {
          textPositionComparator = comparator;
}



On Tue, Nov 8, 2011 at 10:50 AM, Raimi Rufai <rr...@gmail.com> wrote:

> Hi Sebastien,
>
> It might be more flexible to inject an instance of rather than the class
> of the Comparator. For comparators that take parameters, your current
> solution won't work. In other words, you would have:
>
> private Comparator<TextPosition> textPositionComparator=  new
> TextPositionComparator();
>
> public Comparator<TextPosition> getTextPositionComparator() {
>                        return textPositionComparator;
> }
>
> public void setgetTextPositionComparator(Comparator<TextPosition>
> comparator) {
>           textPositionComparator = comparator;
> }
>
> What do you think?
>
> Regards,
>
> Raimi
>
>
>
> On Tue, Nov 8, 2011 at 10:24 AM, Martinez, Mel - 1004 - MITLL <
> m.martinez@ll.mit.edu> wrote:
>
>> Sebastien,
>>
>> I totally agree that this would be a good change, having run into the same
>> problem when working out my own mods to the text extraction some time ago.
>>
>> Please create a JIRA issue proposing this at:
>> https://issues.apache.org/jira/browse/PDFBOX
>>
>> Mel
>>
>>
>> -----Original Message-----
>> From: Sébastien Dailly [mailto:sebastien@chimrod.com]
>> Sent: Tuesday, November 08, 2011 4:27 AM
>> To: dev@pdfbox.apache.org
>> Subject: PDFTextStripper : can't change the default TextPositionComparator
>>
>> Hello,
>>
>> I'm trying to use the PDFTextStripper class, but the sortByPosition does
>> not seems to act correctly when the chararacters on the same line are
>> not exactly on the same y position.
>>
>> There is no way to replace the TextPositionComparator used in the class
>> by my own, even by subclassing the PDFTextStripper class ( see later ).
>>
>> One solution is to use a getter instead of a hard link between classes :
>>
>> >             List<TextPosition> textList = charactersByArticle.get( i );
>> >             if( getSortByPosition() )
>> >             {
>> >                 TextPositionComparator comparator = new
>> TextPositionComparator();
>> >                 Collections.sort( textList, comparator );
>> >             }
>>
>> become :
>>
>> >             List<TextPosition> textList = charactersByArticle.get( i );
>> >             if( getSortByPosition() )
>> >             {
>> >                 Comparator comparator = getTextPositionComparator();
>> >                 Collections.sort( textList, comparator );
>> >             }
>>
>> with getTextPositionComparator defined as following :
>>
>> > private Class<? extends Comparator> textPositionComparator=
>> TextPositionComparator.class;
>>
>> > […]
>>
>> >       /**
>> >        *
>> >        * @return The comparator for ordening text position.
>> >        */
>> >       public Comparator getTextPositionComparator() {
>> >               try {
>> >                       return textPositionComparator.newInstance();
>> >               } catch (final InstantiationException e) {
>> >                       return null;
>> >               } catch (final IllegalAccessException e) {
>> >                       return null;
>> >               }
>> >       }
>>
>> (with the appropriate setter).
>>
>> Note :
>>
>> Still the PDFTextStripper.writePage is protected, it uses the
>> getTextPosition method from the PositionWrapper class, wich is a
>> protected method, without subclassing this class ! This only works
>> because they belong to the same package ! (I think it can be considered
>> as a bug in the project architecture)
>>
>> >                //Resets the average character width when we see a change
>> in font
>> >                 // or a change in the font size
>> >                 if(lastPosition != null && ((position.getFont() !=
>> lastPosition.getTextPosition().getFont())
>> >                         || (position.getFontSize() !=
>> lastPosition.getTextPosition().getFontSize())))
>> >                 {
>> >                     previousAveCharWidth = -1;
>> >                 }
>>
>> Thank you,
>>
>> --
>> Sébastien
>>
>
>
>
> --
> «To develop software is to build a machine simply by describing it.»
> (Michael A. Jackson -- not the singer)
>
> «Développer un logiciel revient à construire une machine tout simplement
> en le décrivant.» (Michael A. Jackson - pas le chanteur)
>
>


-- 
«To develop software is to build a machine simply by describing it.»
(Michael A. Jackson -- not the singer)

«Développer un logiciel revient à construire une machine tout simplement en
le décrivant.» (Michael A. Jackson - pas le chanteur)