You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Brian Carrier (JIRA)" <ji...@apache.org> on 2009/02/18 22:36:01 UTC

[jira] Resolved: (PDFBOX-430) Incorrect diacritic placement in text extraction

     [ https://issues.apache.org/jira/browse/PDFBOX-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian Carrier resolved PDFBOX-430.
----------------------------------

    Resolution: Fixed

Fixed with patch by Ken Glidden that merges a single diacritic text chunk into the previous text chunk if they overlap.  Note that this will not solve problems where the diacritic comes much after the text chunk it overlays, but we have not observed PDF files like that.

Sending        trunk/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java
Sending        trunk/src/main/java/org/apache/pdfbox/util/TextPosition.java
Sending        trunk/test/input/Acrobat9.pdf-sorted.txt
Sending        trunk/test/input/Acrobat9.pdf.txt
Transmitting file data ....Committed revision 745665.



> Incorrect diacritic placement in text extraction
> ------------------------------------------------
>
>                 Key: PDFBOX-430
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-430
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Brian Carrier
>
> Some PDF files store diacritics (accents over characters) as separate text elements. The PDF files essentially have a chunk of text and then backup and place the diacritic over one of the characters in the chunk of text. With text extraction, the current design does not allow the diacritic to be placed over a character in the chunk and instead it is placed after the chunk. 
> The debug-diac2.pdf file in PDFBOX-429 shows this problem. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: contributing to PDFBox (was: [jira] Resolved: (PDFBOX-430) Incorrect diacritic placement in text extraction)

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
Hi Ken

Thanks a lot! I think that does it. Your ICLA has already been recorded.

On 05.03.2009 22:50:38 Ken Glidden wrote:
> Hi Jeremias,
> 
> I believe I've completed my action items:
> 1 - I emailed a signed ICL to secretary@apache.org.
> 2 - I uploaded the relevant source files to PDFBOX-429 and PDFBOX-430 and checked the "grant license" box for both when I did so.
> 
> Let me know if there are any other open items.
> 
> Thanks,
> Ken
> 
> -----Original Message-----
> From: Jeremias Maerki [mailto:dev@jeremias-maerki.ch] 
> Sent: Monday, March 02, 2009 3:31 AM
> To: pdfbox-dev@incubator.apache.org
> Subject: Re: contributing to PDFBox (was: [jira] Resolved: (PDFBOX-430) Incorrect diacritic placement in text extraction)
> 
> Thanks for speaking up, Ken. It's a great thing you're contributing to
> PDFBox. But we actually do have legal issues to worry about here.
> 
> The way this happened, we don't have a legal trail to make sure that
> your contributions are actually intended for inclusion and under what
> license. Only Brian (hopefully) knows your intentions. When you attach a
> patch to a Jira issue, you have to tick a checkbox indicating that you
> intend this for inclusion:
> 
> "[ ] Grant license to ASF for inclusion in ASF works (as per the Apache
> License §5)
> Contributions intended for inclusion in ASF products (eg. patches, code)
> must be licensed to ASF under the terms of the Apache License. Other
> attachments (eg. log dumps, test cases) need not be."
> 
> With §5 of the ALv2 you explicitely give the ASF the same license for
> your changes as the ASF gives to its users. That is enough for smaller
> patches (bugfixes, small improvements). As soon as you contribute
> considerable new functionality or new files which have a certain
> "artistic" aspect, the §5 is considered insufficient at which point
> committers are expected to ask for an Contributor License Agreement to
> be filed with the ASF. Also, regular contributors should send in a CLA
> as it is also a precondition to becoming a committer. For even larger
> contributions (like whole new subsystems), a contribution may even have
> to go through IP clearance with an explicit separate license grant on
> the code submitted. So there are various levels. The lines are probably
> not always very clearly drawn. But the intent is to protect the users
> and the contributors (i.e. you) from legal harm [1]. That can only
> happen if we have a clean legal trail.
> 
> [1] http://apache.org/foundation/how-it-works.html#what
> (see especially the third point in the list)
> 
> I only notice after this started that you and Justin LeFebvre are from
> the same company. Both of you have written more than one patch. So I
> would like to suggest that both of you send in an ICLA [2]. Please also
> check if the work contracts in your company make it necessary to send in
> a CCLA [2] in addition to the ICLAs.
> 
> [2] http://apache.org/licenses/#clas
> 
> A committer can always ask the PMC chair or an ASF member to check if a
> particular ICLA has been recorded, yet.
> 
> Ken, can I ask you to attach the two (original) patches, that were
> processed via Brian, to the JIRA issues associated with them so the gaps
> are filled, even if that happens after the two patches were processed. I
> think that should be enough to correct the situation. In the future,
> please attach your patches to a new JIRA issues and take it from there.
> 
> There are other points also: by directly working with Brian, there is no
> discussion (if necessary) around this if anyone has any issues. Other
> committers can only react after everything has already happened. You're
> also not taking part in the community whose building is the most
> important task of PDFBox being in the Apache Incubator. And you're not
> getting the same visibility you'd get if you take part in discussions
> here. Only that way does the existing team have a chance to get to know
> you and to eventually vote you in as a committer if you turn out to be a
> regular contributor. Given that two employees of your company contribute
> to PDFBox means that it is important to you. Then it is all the more
> important that you participate in the project and jointly help evolve
> the project in directions that help you.
> 
> Everybody (especially Brian), don't feel bad about this! The Incubation
> phase is here for everybody to learn who we do things inside the Apache
> Software Foundation. There are a few rules that makes the ASF so
> different from the ordinary SourceForge project. I know it's a lot of
> new stuff especially new committers have to learn. Hopefully, we mentors
> can help clear things up if there are questions or problems.
> 
> Thank you for your understanding!




Jeremias Maerki


Re: contributing to PDFBox (was: [jira] Resolved: (PDFBOX-430) Incorrect diacritic placement in text extraction)

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
I can see that there is a CCLA from Basis Technology Corporation but it
doesn't list Ken Glidden nor Justin LeFebvre in Schedule A. Are you also
an employee of that company? Anyway, the CCLA is always in addition to
an ICLA if a CCLA is necessary. Contributions to ASF projects are done always
by an individual, never by a company. That's why the ICLA is the most
important of the two. A company may have to officially sanction that
some of their employees contribute code to the ASF depending on the work
contracts. That's what the CCLA is for.

I'm sorry that I have to be difficult here but we have to make sure this
is all done right.

On 02.03.2009 15:11:47 Brian Carrier wrote:
> Hi Jeremias,
> 
> Sorry for the confusion. Ken did most of the patch and then had to  
> work on some other projects, so I did some final touches and  
> submitted it.  We already have a CCLA on file.
> 
> thanks,
> brian
> 
> 
> On Mar 2, 2009, at 3:30 AM, Jeremias Maerki wrote:
> 
> > Thanks for speaking up, Ken. It's a great thing you're contributing to
> > PDFBox. But we actually do have legal issues to worry about here.
> >
> > The way this happened, we don't have a legal trail to make sure that
> > your contributions are actually intended for inclusion and under what
> > license. Only Brian (hopefully) knows your intentions. When you  
> > attach a
> > patch to a Jira issue, you have to tick a checkbox indicating that you
> > intend this for inclusion:
> >
> > "[ ] Grant license to ASF for inclusion in ASF works (as per the  
> > Apache
> > License §5)
> > Contributions intended for inclusion in ASF products (eg. patches,  
> > code)
> > must be licensed to ASF under the terms of the Apache License. Other
> > attachments (eg. log dumps, test cases) need not be."
> >
> > With §5 of the ALv2 you explicitely give the ASF the same license for
> > your changes as the ASF gives to its users. That is enough for smaller
> > patches (bugfixes, small improvements). As soon as you contribute
> > considerable new functionality or new files which have a certain
> > "artistic" aspect, the §5 is considered insufficient at which point
> > committers are expected to ask for an Contributor License Agreement to
> > be filed with the ASF. Also, regular contributors should send in a CLA
> > as it is also a precondition to becoming a committer. For even larger
> > contributions (like whole new subsystems), a contribution may even  
> > have
> > to go through IP clearance with an explicit separate license grant on
> > the code submitted. So there are various levels. The lines are  
> > probably
> > not always very clearly drawn. But the intent is to protect the users
> > and the contributors (i.e. you) from legal harm [1]. That can only
> > happen if we have a clean legal trail.
> >
> > [1] http://apache.org/foundation/how-it-works.html#what
> > (see especially the third point in the list)
> >
> > I only notice after this started that you and Justin LeFebvre are from
> > the same company. Both of you have written more than one patch. So I
> > would like to suggest that both of you send in an ICLA [2]. Please  
> > also
> > check if the work contracts in your company make it necessary to  
> > send in
> > a CCLA [2] in addition to the ICLAs.
> >
> > [2] http://apache.org/licenses/#clas
> >
> > A committer can always ask the PMC chair or an ASF member to check  
> > if a
> > particular ICLA has been recorded, yet.
> >
> > Ken, can I ask you to attach the two (original) patches, that were
> > processed via Brian, to the JIRA issues associated with them so the  
> > gaps
> > are filled, even if that happens after the two patches were  
> > processed. I
> > think that should be enough to correct the situation. In the future,
> > please attach your patches to a new JIRA issues and take it from  
> > there.
> >
> > There are other points also: by directly working with Brian, there  
> > is no
> > discussion (if necessary) around this if anyone has any issues. Other
> > committers can only react after everything has already happened.  
> > You're
> > also not taking part in the community whose building is the most
> > important task of PDFBox being in the Apache Incubator. And you're not
> > getting the same visibility you'd get if you take part in discussions
> > here. Only that way does the existing team have a chance to get to  
> > know
> > you and to eventually vote you in as a committer if you turn out to  
> > be a
> > regular contributor. Given that two employees of your company  
> > contribute
> > to PDFBox means that it is important to you. Then it is all the more
> > important that you participate in the project and jointly help evolve
> > the project in directions that help you.
> >
> > Everybody (especially Brian), don't feel bad about this! The  
> > Incubation
> > phase is here for everybody to learn who we do things inside the  
> > Apache
> > Software Foundation. There are a few rules that makes the ASF so
> > different from the ordinary SourceForge project. I know it's a lot of
> > new stuff especially new committers have to learn. Hopefully, we  
> > mentors
> > can help clear things up if there are questions or problems.
> >
> > Thank you for your understanding!
> >
> > On 01.03.2009 19:31:14 Ken Glidden wrote:
> >> I am said Ken Glidden.
> >> I'm VP of Engineering at Basis Technology and am working directly  
> >> with Brian on this.
> >> No legal issues to worry about.
> >> Cheers.
> >>
> >> -----Original Message-----
> >> From: Jeremias Maerki [mailto:dev@jeremias-maerki.ch]
> >> Sent: Saturday, February 28, 2009 12:26 PM
> >> To: pdfbox-dev@incubator.apache.org
> >> Subject: Re: [jira] Resolved: (PDFBOX-430) Incorrect diacritic  
> >> placement in text extraction
> >>
> >> Brian,
> >>
> >> you state here that you've applied a patch by one Ken Glidden. I  
> >> cannot
> >> find any post or submission from a person with that name on the  
> >> PDFBox
> >> mailing lists. So I'm concerned about the legal trail here. Can you
> >> explain that, please? Thank you.
> >>
> >> On 18.02.2009 22:36:01 Brian Carrier (JIRA) wrote:
> >>>
> >>>      [ https://issues.apache.org/jira/browse/PDFBOX-430? 
> >>> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
> >>>
> >>> Brian Carrier resolved PDFBOX-430.
> >>> ----------------------------------
> >>>
> >>>     Resolution: Fixed
> >>>
> >>> Fixed with patch by Ken Glidden that merges a single diacritic  
> >>> text chunk into the previous text chunk if they overlap.  Note  
> >>> that this will not solve problems where the diacritic comes much  
> >>> after the text chunk it overlays, but we have not observed PDF  
> >>> files like that.
> >>>
> >>> Sending        trunk/src/main/java/org/apache/pdfbox/util/ 
> >>> PDFTextStripper.java
> >>> Sending        trunk/src/main/java/org/apache/pdfbox/util/ 
> >>> TextPosition.java
> >>> Sending        trunk/test/input/Acrobat9.pdf-sorted.txt
> >>> Sending        trunk/test/input/Acrobat9.pdf.txt
> >>> Transmitting file data ....Committed revision 745665.
> >>>
> >>>
> >>>
> >>>> Incorrect diacritic placement in text extraction
> >>>> ------------------------------------------------
> >>>>
> >>>>                 Key: PDFBOX-430
> >>>>                 URL: https://issues.apache.org/jira/browse/ 
> >>>> PDFBOX-430
> >>>>             Project: PDFBox
> >>>>          Issue Type: Bug
> >>>>            Reporter: Brian Carrier
> >>>>
> >>>> Some PDF files store diacritics (accents over characters) as  
> >>>> separate text elements. The PDF files essentially have a chunk  
> >>>> of text and then backup and place the diacritic over one of the  
> >>>> characters in the chunk of text. With text extraction, the  
> >>>> current design does not allow the diacritic to be placed over a  
> >>>> character in the chunk and instead it is placed after the chunk.
> >>>> The debug-diac2.pdf file in PDFBOX-429 shows this problem.
> >>>
> >>> -- 
> >>> This message is automatically generated by JIRA.
> >>> -
> >>> You can reply to this email to add a comment to the issue online.
> >>
> >>
> >>
> >>
> >> Jeremias Maerki
> >>
> >
> >
> >
> >
> > Jeremias Maerki
> >
> 




Jeremias Maerki


Re: contributing to PDFBox (was: [jira] Resolved: (PDFBOX-430) Incorrect diacritic placement in text extraction)

Posted by Brian Carrier <ca...@digital-evidence.org>.
Hi Jeremias,

Sorry for the confusion. Ken did most of the patch and then had to  
work on some other projects, so I did some final touches and  
submitted it.  We already have a CCLA on file.

thanks,
brian


On Mar 2, 2009, at 3:30 AM, Jeremias Maerki wrote:

> Thanks for speaking up, Ken. It's a great thing you're contributing to
> PDFBox. But we actually do have legal issues to worry about here.
>
> The way this happened, we don't have a legal trail to make sure that
> your contributions are actually intended for inclusion and under what
> license. Only Brian (hopefully) knows your intentions. When you  
> attach a
> patch to a Jira issue, you have to tick a checkbox indicating that you
> intend this for inclusion:
>
> "[ ] Grant license to ASF for inclusion in ASF works (as per the  
> Apache
> License §5)
> Contributions intended for inclusion in ASF products (eg. patches,  
> code)
> must be licensed to ASF under the terms of the Apache License. Other
> attachments (eg. log dumps, test cases) need not be."
>
> With §5 of the ALv2 you explicitely give the ASF the same license for
> your changes as the ASF gives to its users. That is enough for smaller
> patches (bugfixes, small improvements). As soon as you contribute
> considerable new functionality or new files which have a certain
> "artistic" aspect, the §5 is considered insufficient at which point
> committers are expected to ask for an Contributor License Agreement to
> be filed with the ASF. Also, regular contributors should send in a CLA
> as it is also a precondition to becoming a committer. For even larger
> contributions (like whole new subsystems), a contribution may even  
> have
> to go through IP clearance with an explicit separate license grant on
> the code submitted. So there are various levels. The lines are  
> probably
> not always very clearly drawn. But the intent is to protect the users
> and the contributors (i.e. you) from legal harm [1]. That can only
> happen if we have a clean legal trail.
>
> [1] http://apache.org/foundation/how-it-works.html#what
> (see especially the third point in the list)
>
> I only notice after this started that you and Justin LeFebvre are from
> the same company. Both of you have written more than one patch. So I
> would like to suggest that both of you send in an ICLA [2]. Please  
> also
> check if the work contracts in your company make it necessary to  
> send in
> a CCLA [2] in addition to the ICLAs.
>
> [2] http://apache.org/licenses/#clas
>
> A committer can always ask the PMC chair or an ASF member to check  
> if a
> particular ICLA has been recorded, yet.
>
> Ken, can I ask you to attach the two (original) patches, that were
> processed via Brian, to the JIRA issues associated with them so the  
> gaps
> are filled, even if that happens after the two patches were  
> processed. I
> think that should be enough to correct the situation. In the future,
> please attach your patches to a new JIRA issues and take it from  
> there.
>
> There are other points also: by directly working with Brian, there  
> is no
> discussion (if necessary) around this if anyone has any issues. Other
> committers can only react after everything has already happened.  
> You're
> also not taking part in the community whose building is the most
> important task of PDFBox being in the Apache Incubator. And you're not
> getting the same visibility you'd get if you take part in discussions
> here. Only that way does the existing team have a chance to get to  
> know
> you and to eventually vote you in as a committer if you turn out to  
> be a
> regular contributor. Given that two employees of your company  
> contribute
> to PDFBox means that it is important to you. Then it is all the more
> important that you participate in the project and jointly help evolve
> the project in directions that help you.
>
> Everybody (especially Brian), don't feel bad about this! The  
> Incubation
> phase is here for everybody to learn who we do things inside the  
> Apache
> Software Foundation. There are a few rules that makes the ASF so
> different from the ordinary SourceForge project. I know it's a lot of
> new stuff especially new committers have to learn. Hopefully, we  
> mentors
> can help clear things up if there are questions or problems.
>
> Thank you for your understanding!
>
> On 01.03.2009 19:31:14 Ken Glidden wrote:
>> I am said Ken Glidden.
>> I'm VP of Engineering at Basis Technology and am working directly  
>> with Brian on this.
>> No legal issues to worry about.
>> Cheers.
>>
>> -----Original Message-----
>> From: Jeremias Maerki [mailto:dev@jeremias-maerki.ch]
>> Sent: Saturday, February 28, 2009 12:26 PM
>> To: pdfbox-dev@incubator.apache.org
>> Subject: Re: [jira] Resolved: (PDFBOX-430) Incorrect diacritic  
>> placement in text extraction
>>
>> Brian,
>>
>> you state here that you've applied a patch by one Ken Glidden. I  
>> cannot
>> find any post or submission from a person with that name on the  
>> PDFBox
>> mailing lists. So I'm concerned about the legal trail here. Can you
>> explain that, please? Thank you.
>>
>> On 18.02.2009 22:36:01 Brian Carrier (JIRA) wrote:
>>>
>>>      [ https://issues.apache.org/jira/browse/PDFBOX-430? 
>>> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>>
>>> Brian Carrier resolved PDFBOX-430.
>>> ----------------------------------
>>>
>>>     Resolution: Fixed
>>>
>>> Fixed with patch by Ken Glidden that merges a single diacritic  
>>> text chunk into the previous text chunk if they overlap.  Note  
>>> that this will not solve problems where the diacritic comes much  
>>> after the text chunk it overlays, but we have not observed PDF  
>>> files like that.
>>>
>>> Sending        trunk/src/main/java/org/apache/pdfbox/util/ 
>>> PDFTextStripper.java
>>> Sending        trunk/src/main/java/org/apache/pdfbox/util/ 
>>> TextPosition.java
>>> Sending        trunk/test/input/Acrobat9.pdf-sorted.txt
>>> Sending        trunk/test/input/Acrobat9.pdf.txt
>>> Transmitting file data ....Committed revision 745665.
>>>
>>>
>>>
>>>> Incorrect diacritic placement in text extraction
>>>> ------------------------------------------------
>>>>
>>>>                 Key: PDFBOX-430
>>>>                 URL: https://issues.apache.org/jira/browse/ 
>>>> PDFBOX-430
>>>>             Project: PDFBox
>>>>          Issue Type: Bug
>>>>            Reporter: Brian Carrier
>>>>
>>>> Some PDF files store diacritics (accents over characters) as  
>>>> separate text elements. The PDF files essentially have a chunk  
>>>> of text and then backup and place the diacritic over one of the  
>>>> characters in the chunk of text. With text extraction, the  
>>>> current design does not allow the diacritic to be placed over a  
>>>> character in the chunk and instead it is placed after the chunk.
>>>> The debug-diac2.pdf file in PDFBOX-429 shows this problem.
>>>
>>> -- 
>>> This message is automatically generated by JIRA.
>>> -
>>> You can reply to this email to add a comment to the issue online.
>>
>>
>>
>>
>> Jeremias Maerki
>>
>
>
>
>
> Jeremias Maerki
>


RE: contributing to PDFBox (was: [jira] Resolved: (PDFBOX-430) Incorrect diacritic placement in text extraction)

Posted by Ken Glidden <Ke...@basistech.com>.
Hi Jeremias,

I believe I've completed my action items:
1 - I emailed a signed ICL to secretary@apache.org.
2 - I uploaded the relevant source files to PDFBOX-429 and PDFBOX-430 and checked the "grant license" box for both when I did so.

Let me know if there are any other open items.

Thanks,
Ken

-----Original Message-----
From: Jeremias Maerki [mailto:dev@jeremias-maerki.ch] 
Sent: Monday, March 02, 2009 3:31 AM
To: pdfbox-dev@incubator.apache.org
Subject: Re: contributing to PDFBox (was: [jira] Resolved: (PDFBOX-430) Incorrect diacritic placement in text extraction)

Thanks for speaking up, Ken. It's a great thing you're contributing to
PDFBox. But we actually do have legal issues to worry about here.

The way this happened, we don't have a legal trail to make sure that
your contributions are actually intended for inclusion and under what
license. Only Brian (hopefully) knows your intentions. When you attach a
patch to a Jira issue, you have to tick a checkbox indicating that you
intend this for inclusion:

"[ ] Grant license to ASF for inclusion in ASF works (as per the Apache
License §5)
Contributions intended for inclusion in ASF products (eg. patches, code)
must be licensed to ASF under the terms of the Apache License. Other
attachments (eg. log dumps, test cases) need not be."

With §5 of the ALv2 you explicitely give the ASF the same license for
your changes as the ASF gives to its users. That is enough for smaller
patches (bugfixes, small improvements). As soon as you contribute
considerable new functionality or new files which have a certain
"artistic" aspect, the §5 is considered insufficient at which point
committers are expected to ask for an Contributor License Agreement to
be filed with the ASF. Also, regular contributors should send in a CLA
as it is also a precondition to becoming a committer. For even larger
contributions (like whole new subsystems), a contribution may even have
to go through IP clearance with an explicit separate license grant on
the code submitted. So there are various levels. The lines are probably
not always very clearly drawn. But the intent is to protect the users
and the contributors (i.e. you) from legal harm [1]. That can only
happen if we have a clean legal trail.

[1] http://apache.org/foundation/how-it-works.html#what
(see especially the third point in the list)

I only notice after this started that you and Justin LeFebvre are from
the same company. Both of you have written more than one patch. So I
would like to suggest that both of you send in an ICLA [2]. Please also
check if the work contracts in your company make it necessary to send in
a CCLA [2] in addition to the ICLAs.

[2] http://apache.org/licenses/#clas

A committer can always ask the PMC chair or an ASF member to check if a
particular ICLA has been recorded, yet.

Ken, can I ask you to attach the two (original) patches, that were
processed via Brian, to the JIRA issues associated with them so the gaps
are filled, even if that happens after the two patches were processed. I
think that should be enough to correct the situation. In the future,
please attach your patches to a new JIRA issues and take it from there.

There are other points also: by directly working with Brian, there is no
discussion (if necessary) around this if anyone has any issues. Other
committers can only react after everything has already happened. You're
also not taking part in the community whose building is the most
important task of PDFBox being in the Apache Incubator. And you're not
getting the same visibility you'd get if you take part in discussions
here. Only that way does the existing team have a chance to get to know
you and to eventually vote you in as a committer if you turn out to be a
regular contributor. Given that two employees of your company contribute
to PDFBox means that it is important to you. Then it is all the more
important that you participate in the project and jointly help evolve
the project in directions that help you.

Everybody (especially Brian), don't feel bad about this! The Incubation
phase is here for everybody to learn who we do things inside the Apache
Software Foundation. There are a few rules that makes the ASF so
different from the ordinary SourceForge project. I know it's a lot of
new stuff especially new committers have to learn. Hopefully, we mentors
can help clear things up if there are questions or problems.

Thank you for your understanding!

RE: contributing to PDFBox (was: [jira] Resolved: (PDFBOX-430) Incorrect diacritic placement in text extraction)

Posted by Ken Glidden <Ke...@basistech.com>.
Thanks,

I'll dot the i's and cross the t's as you suggest.  Thanks for the explanation.
We did check with our CTO.  He believes that the ICAs are enough.


-----Original Message-----
From: Jeremias Maerki [mailto:dev@jeremias-maerki.ch] 
Sent: Monday, March 02, 2009 3:31 AM
To: pdfbox-dev@incubator.apache.org
Subject: Re: contributing to PDFBox (was: [jira] Resolved: (PDFBOX-430) Incorrect diacritic placement in text extraction)

Thanks for speaking up, Ken. It's a great thing you're contributing to
PDFBox. But we actually do have legal issues to worry about here.

The way this happened, we don't have a legal trail to make sure that
your contributions are actually intended for inclusion and under what
license. Only Brian (hopefully) knows your intentions. When you attach a
patch to a Jira issue, you have to tick a checkbox indicating that you
intend this for inclusion:

"[ ] Grant license to ASF for inclusion in ASF works (as per the Apache
License §5)
Contributions intended for inclusion in ASF products (eg. patches, code)
must be licensed to ASF under the terms of the Apache License. Other
attachments (eg. log dumps, test cases) need not be."

With §5 of the ALv2 you explicitely give the ASF the same license for
your changes as the ASF gives to its users. That is enough for smaller
patches (bugfixes, small improvements). As soon as you contribute
considerable new functionality or new files which have a certain
"artistic" aspect, the §5 is considered insufficient at which point
committers are expected to ask for an Contributor License Agreement to
be filed with the ASF. Also, regular contributors should send in a CLA
as it is also a precondition to becoming a committer. For even larger
contributions (like whole new subsystems), a contribution may even have
to go through IP clearance with an explicit separate license grant on
the code submitted. So there are various levels. The lines are probably
not always very clearly drawn. But the intent is to protect the users
and the contributors (i.e. you) from legal harm [1]. That can only
happen if we have a clean legal trail.

[1] http://apache.org/foundation/how-it-works.html#what
(see especially the third point in the list)

I only notice after this started that you and Justin LeFebvre are from
the same company. Both of you have written more than one patch. So I
would like to suggest that both of you send in an ICLA [2]. Please also
check if the work contracts in your company make it necessary to send in
a CCLA [2] in addition to the ICLAs.

[2] http://apache.org/licenses/#clas

A committer can always ask the PMC chair or an ASF member to check if a
particular ICLA has been recorded, yet.

Ken, can I ask you to attach the two (original) patches, that were
processed via Brian, to the JIRA issues associated with them so the gaps
are filled, even if that happens after the two patches were processed. I
think that should be enough to correct the situation. In the future,
please attach your patches to a new JIRA issues and take it from there.

There are other points also: by directly working with Brian, there is no
discussion (if necessary) around this if anyone has any issues. Other
committers can only react after everything has already happened. You're
also not taking part in the community whose building is the most
important task of PDFBox being in the Apache Incubator. And you're not
getting the same visibility you'd get if you take part in discussions
here. Only that way does the existing team have a chance to get to know
you and to eventually vote you in as a committer if you turn out to be a
regular contributor. Given that two employees of your company contribute
to PDFBox means that it is important to you. Then it is all the more
important that you participate in the project and jointly help evolve
the project in directions that help you.

Everybody (especially Brian), don't feel bad about this! The Incubation
phase is here for everybody to learn who we do things inside the Apache
Software Foundation. There are a few rules that makes the ASF so
different from the ordinary SourceForge project. I know it's a lot of
new stuff especially new committers have to learn. Hopefully, we mentors
can help clear things up if there are questions or problems.

Thank you for your understanding!

On 01.03.2009 19:31:14 Ken Glidden wrote:
> I am said Ken Glidden.
> I'm VP of Engineering at Basis Technology and am working directly with Brian on this.
> No legal issues to worry about.
> Cheers.
> 
> -----Original Message-----
> From: Jeremias Maerki [mailto:dev@jeremias-maerki.ch] 
> Sent: Saturday, February 28, 2009 12:26 PM
> To: pdfbox-dev@incubator.apache.org
> Subject: Re: [jira] Resolved: (PDFBOX-430) Incorrect diacritic placement in text extraction
> 
> Brian,
> 
> you state here that you've applied a patch by one Ken Glidden. I cannot
> find any post or submission from a person with that name on the PDFBox
> mailing lists. So I'm concerned about the legal trail here. Can you
> explain that, please? Thank you.
> 
> On 18.02.2009 22:36:01 Brian Carrier (JIRA) wrote:
> > 
> >      [ https://issues.apache.org/jira/browse/PDFBOX-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
> > 
> > Brian Carrier resolved PDFBOX-430.
> > ----------------------------------
> > 
> >     Resolution: Fixed
> > 
> > Fixed with patch by Ken Glidden that merges a single diacritic text chunk into the previous text chunk if they overlap.  Note that this will not solve problems where the diacritic comes much after the text chunk it overlays, but we have not observed PDF files like that.
> > 
> > Sending        trunk/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java
> > Sending        trunk/src/main/java/org/apache/pdfbox/util/TextPosition.java
> > Sending        trunk/test/input/Acrobat9.pdf-sorted.txt
> > Sending        trunk/test/input/Acrobat9.pdf.txt
> > Transmitting file data ....Committed revision 745665.
> > 
> > 
> > 
> > > Incorrect diacritic placement in text extraction
> > > ------------------------------------------------
> > >
> > >                 Key: PDFBOX-430
> > >                 URL: https://issues.apache.org/jira/browse/PDFBOX-430
> > >             Project: PDFBox
> > >          Issue Type: Bug
> > >            Reporter: Brian Carrier
> > >
> > > Some PDF files store diacritics (accents over characters) as separate text elements. The PDF files essentially have a chunk of text and then backup and place the diacritic over one of the characters in the chunk of text. With text extraction, the current design does not allow the diacritic to be placed over a character in the chunk and instead it is placed after the chunk. 
> > > The debug-diac2.pdf file in PDFBOX-429 shows this problem. 
> > 
> > -- 
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> 
> 
> 
> 
> Jeremias Maerki
> 




Jeremias Maerki


Re: contributing to PDFBox (was: [jira] Resolved: (PDFBOX-430) Incorrect diacritic placement in text extraction)

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
Thanks for speaking up, Ken. It's a great thing you're contributing to
PDFBox. But we actually do have legal issues to worry about here.

The way this happened, we don't have a legal trail to make sure that
your contributions are actually intended for inclusion and under what
license. Only Brian (hopefully) knows your intentions. When you attach a
patch to a Jira issue, you have to tick a checkbox indicating that you
intend this for inclusion:

"[ ] Grant license to ASF for inclusion in ASF works (as per the Apache
License §5)
Contributions intended for inclusion in ASF products (eg. patches, code)
must be licensed to ASF under the terms of the Apache License. Other
attachments (eg. log dumps, test cases) need not be."

With §5 of the ALv2 you explicitely give the ASF the same license for
your changes as the ASF gives to its users. That is enough for smaller
patches (bugfixes, small improvements). As soon as you contribute
considerable new functionality or new files which have a certain
"artistic" aspect, the §5 is considered insufficient at which point
committers are expected to ask for an Contributor License Agreement to
be filed with the ASF. Also, regular contributors should send in a CLA
as it is also a precondition to becoming a committer. For even larger
contributions (like whole new subsystems), a contribution may even have
to go through IP clearance with an explicit separate license grant on
the code submitted. So there are various levels. The lines are probably
not always very clearly drawn. But the intent is to protect the users
and the contributors (i.e. you) from legal harm [1]. That can only
happen if we have a clean legal trail.

[1] http://apache.org/foundation/how-it-works.html#what
(see especially the third point in the list)

I only notice after this started that you and Justin LeFebvre are from
the same company. Both of you have written more than one patch. So I
would like to suggest that both of you send in an ICLA [2]. Please also
check if the work contracts in your company make it necessary to send in
a CCLA [2] in addition to the ICLAs.

[2] http://apache.org/licenses/#clas

A committer can always ask the PMC chair or an ASF member to check if a
particular ICLA has been recorded, yet.

Ken, can I ask you to attach the two (original) patches, that were
processed via Brian, to the JIRA issues associated with them so the gaps
are filled, even if that happens after the two patches were processed. I
think that should be enough to correct the situation. In the future,
please attach your patches to a new JIRA issues and take it from there.

There are other points also: by directly working with Brian, there is no
discussion (if necessary) around this if anyone has any issues. Other
committers can only react after everything has already happened. You're
also not taking part in the community whose building is the most
important task of PDFBox being in the Apache Incubator. And you're not
getting the same visibility you'd get if you take part in discussions
here. Only that way does the existing team have a chance to get to know
you and to eventually vote you in as a committer if you turn out to be a
regular contributor. Given that two employees of your company contribute
to PDFBox means that it is important to you. Then it is all the more
important that you participate in the project and jointly help evolve
the project in directions that help you.

Everybody (especially Brian), don't feel bad about this! The Incubation
phase is here for everybody to learn who we do things inside the Apache
Software Foundation. There are a few rules that makes the ASF so
different from the ordinary SourceForge project. I know it's a lot of
new stuff especially new committers have to learn. Hopefully, we mentors
can help clear things up if there are questions or problems.

Thank you for your understanding!

On 01.03.2009 19:31:14 Ken Glidden wrote:
> I am said Ken Glidden.
> I'm VP of Engineering at Basis Technology and am working directly with Brian on this.
> No legal issues to worry about.
> Cheers.
> 
> -----Original Message-----
> From: Jeremias Maerki [mailto:dev@jeremias-maerki.ch] 
> Sent: Saturday, February 28, 2009 12:26 PM
> To: pdfbox-dev@incubator.apache.org
> Subject: Re: [jira] Resolved: (PDFBOX-430) Incorrect diacritic placement in text extraction
> 
> Brian,
> 
> you state here that you've applied a patch by one Ken Glidden. I cannot
> find any post or submission from a person with that name on the PDFBox
> mailing lists. So I'm concerned about the legal trail here. Can you
> explain that, please? Thank you.
> 
> On 18.02.2009 22:36:01 Brian Carrier (JIRA) wrote:
> > 
> >      [ https://issues.apache.org/jira/browse/PDFBOX-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
> > 
> > Brian Carrier resolved PDFBOX-430.
> > ----------------------------------
> > 
> >     Resolution: Fixed
> > 
> > Fixed with patch by Ken Glidden that merges a single diacritic text chunk into the previous text chunk if they overlap.  Note that this will not solve problems where the diacritic comes much after the text chunk it overlays, but we have not observed PDF files like that.
> > 
> > Sending        trunk/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java
> > Sending        trunk/src/main/java/org/apache/pdfbox/util/TextPosition.java
> > Sending        trunk/test/input/Acrobat9.pdf-sorted.txt
> > Sending        trunk/test/input/Acrobat9.pdf.txt
> > Transmitting file data ....Committed revision 745665.
> > 
> > 
> > 
> > > Incorrect diacritic placement in text extraction
> > > ------------------------------------------------
> > >
> > >                 Key: PDFBOX-430
> > >                 URL: https://issues.apache.org/jira/browse/PDFBOX-430
> > >             Project: PDFBox
> > >          Issue Type: Bug
> > >            Reporter: Brian Carrier
> > >
> > > Some PDF files store diacritics (accents over characters) as separate text elements. The PDF files essentially have a chunk of text and then backup and place the diacritic over one of the characters in the chunk of text. With text extraction, the current design does not allow the diacritic to be placed over a character in the chunk and instead it is placed after the chunk. 
> > > The debug-diac2.pdf file in PDFBOX-429 shows this problem. 
> > 
> > -- 
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> 
> 
> 
> 
> Jeremias Maerki
> 




Jeremias Maerki


RE: [jira] Resolved: (PDFBOX-430) Incorrect diacritic placement in text extraction

Posted by Ken Glidden <Ke...@basistech.com>.
I am said Ken Glidden.
I'm VP of Engineering at Basis Technology and am working directly with Brian on this.
No legal issues to worry about.
Cheers.

-----Original Message-----
From: Jeremias Maerki [mailto:dev@jeremias-maerki.ch] 
Sent: Saturday, February 28, 2009 12:26 PM
To: pdfbox-dev@incubator.apache.org
Subject: Re: [jira] Resolved: (PDFBOX-430) Incorrect diacritic placement in text extraction

Brian,

you state here that you've applied a patch by one Ken Glidden. I cannot
find any post or submission from a person with that name on the PDFBox
mailing lists. So I'm concerned about the legal trail here. Can you
explain that, please? Thank you.

On 18.02.2009 22:36:01 Brian Carrier (JIRA) wrote:
> 
>      [ https://issues.apache.org/jira/browse/PDFBOX-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
> 
> Brian Carrier resolved PDFBOX-430.
> ----------------------------------
> 
>     Resolution: Fixed
> 
> Fixed with patch by Ken Glidden that merges a single diacritic text chunk into the previous text chunk if they overlap.  Note that this will not solve problems where the diacritic comes much after the text chunk it overlays, but we have not observed PDF files like that.
> 
> Sending        trunk/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java
> Sending        trunk/src/main/java/org/apache/pdfbox/util/TextPosition.java
> Sending        trunk/test/input/Acrobat9.pdf-sorted.txt
> Sending        trunk/test/input/Acrobat9.pdf.txt
> Transmitting file data ....Committed revision 745665.
> 
> 
> 
> > Incorrect diacritic placement in text extraction
> > ------------------------------------------------
> >
> >                 Key: PDFBOX-430
> >                 URL: https://issues.apache.org/jira/browse/PDFBOX-430
> >             Project: PDFBox
> >          Issue Type: Bug
> >            Reporter: Brian Carrier
> >
> > Some PDF files store diacritics (accents over characters) as separate text elements. The PDF files essentially have a chunk of text and then backup and place the diacritic over one of the characters in the chunk of text. With text extraction, the current design does not allow the diacritic to be placed over a character in the chunk and instead it is placed after the chunk. 
> > The debug-diac2.pdf file in PDFBOX-429 shows this problem. 
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.




Jeremias Maerki


Re: [jira] Resolved: (PDFBOX-430) Incorrect diacritic placement in text extraction

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.
Brian,

you state here that you've applied a patch by one Ken Glidden. I cannot
find any post or submission from a person with that name on the PDFBox
mailing lists. So I'm concerned about the legal trail here. Can you
explain that, please? Thank you.

On 18.02.2009 22:36:01 Brian Carrier (JIRA) wrote:
> 
>      [ https://issues.apache.org/jira/browse/PDFBOX-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
> 
> Brian Carrier resolved PDFBOX-430.
> ----------------------------------
> 
>     Resolution: Fixed
> 
> Fixed with patch by Ken Glidden that merges a single diacritic text chunk into the previous text chunk if they overlap.  Note that this will not solve problems where the diacritic comes much after the text chunk it overlays, but we have not observed PDF files like that.
> 
> Sending        trunk/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java
> Sending        trunk/src/main/java/org/apache/pdfbox/util/TextPosition.java
> Sending        trunk/test/input/Acrobat9.pdf-sorted.txt
> Sending        trunk/test/input/Acrobat9.pdf.txt
> Transmitting file data ....Committed revision 745665.
> 
> 
> 
> > Incorrect diacritic placement in text extraction
> > ------------------------------------------------
> >
> >                 Key: PDFBOX-430
> >                 URL: https://issues.apache.org/jira/browse/PDFBOX-430
> >             Project: PDFBox
> >          Issue Type: Bug
> >            Reporter: Brian Carrier
> >
> > Some PDF files store diacritics (accents over characters) as separate text elements. The PDF files essentially have a chunk of text and then backup and place the diacritic over one of the characters in the chunk of text. With text extraction, the current design does not allow the diacritic to be placed over a character in the chunk and instead it is placed after the chunk. 
> > The debug-diac2.pdf file in PDFBOX-429 shows this problem. 
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.




Jeremias Maerki