You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Rubesh MX (JIRA)" <ji...@apache.org> on 2011/09/22 08:03:26 UTC

[jira] [Created] (PDFBOX-1123) Not able to read field values from a PDF File if the field contains special characters.

Not able to read field values from a PDF File if the field contains special characters.
---------------------------------------------------------------------------------------

                 Key: PDFBOX-1123
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1123
             Project: PDFBox
          Issue Type: Bug
            Reporter: Rubesh MX
            Priority: Critical


Hi, I am trying to read the field names in a PDF file, it is working with most of the files, but in some files we are not able to read the field Id/name, the reason being we have some field names as -
topmostSubform[0].Page1[0].c1_04_0_[0]
topmostSubform[0].Page1[0].c1_09_0_
topmostSubform[0].Page2[0].Table_Line4a[0].#subform[1].p2-t69[0]
Here all the field names starts with topmostSubform[0]. so when we try to get the field names like PDField.getpartialname() - the field name is getting truncated at '.' and we get only - topmostSubform[0] and since all the field names starts with the same name the total count of fields are coming as 1. Since there are some special characters like '.'; '_'; '#' this is causing the issue. Could you please suggest on this? This is very critical.



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1123) Not able to read field values from a PDF File if the field contains special characters.

Posted by "Rubesh MX (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13117890#comment-13117890 ] 

Rubesh MX commented on PDFBOX-1123:
-----------------------------------

Hi Andreas,
Thanks very much for the quick reply, the code snippet is given below - This would just show in a message box, all field names with types etc., once this succeeds, we will be building the code to what is expected of the project. This is with the latest version for .Net, I have also tried taking the latest version that is not officially relased as mentioned by you in http://pdfbox.lehmi.de/

string fileIn = @"C:\fspl.pdf";
PDDocument pdDoc = PDDocument.load(fileIn);
PDDocumentCatalog pdCat = pdDoc.getDocumentCatalog();
var obj = pdCat.getAcroForm().getFields().toArray();
foreach (var stx in obj)
	{
         	PDField pdd = (PDField)stx;                
         	MessageBox.Show((pdd.getPartialName() + "|" + pdd.getFieldType() + "|" + pdd.getValue()));         
	}
pdDoc.save(fileIn);
pdDoc.close();
Please let me know if there is something wrong with the way I am reading the field names. Your suggestions/comments on this will be much appreciated. Thanks again.



                
> Not able to read field values from a PDF File if the field contains special characters.
> ---------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1123
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1123
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Rubesh MX
>            Priority: Minor
>              Labels: acroform
>         Attachments: fspl.pdf
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Hi, I am trying to read the field names in a PDF file, it is working with most of the files, but in some files we are not able to read the field Id/name, the reason being we have some field names as -
> topmostSubform[0].Page1[0].c1_04_0_[0]
> topmostSubform[0].Page1[0].c1_09_0_
> topmostSubform[0].Page2[0].Table_Line4a[0].#subform[1].p2-t69[0]
> Here all the field names starts with topmostSubform[0]. so when we try to get the field names like PDField.getpartialname() - the field name is getting truncated at '.' and we get only - topmostSubform[0] and since all the field names starts with the same name the total count of fields are coming as 1. Since there are some special characters like '.'; '_'; '#' this is causing the issue. Could you please suggest on this? This is very critical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1123) Not able to read field values from a PDF File if the field contains special characters.

Posted by "Andreas Lehmkühler (Updated JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-1123:
---------------------------------------

    Priority: Minor  (was: Critical)
      Labels: acroform  (was: Bug)

How did you ewxtract the field names? Please post the relevant code snippet?
                
> Not able to read field values from a PDF File if the field contains special characters.
> ---------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1123
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1123
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Rubesh MX
>            Priority: Minor
>              Labels: acroform
>         Attachments: fspl.pdf
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Hi, I am trying to read the field names in a PDF file, it is working with most of the files, but in some files we are not able to read the field Id/name, the reason being we have some field names as -
> topmostSubform[0].Page1[0].c1_04_0_[0]
> topmostSubform[0].Page1[0].c1_09_0_
> topmostSubform[0].Page2[0].Table_Line4a[0].#subform[1].p2-t69[0]
> Here all the field names starts with topmostSubform[0]. so when we try to get the field names like PDField.getpartialname() - the field name is getting truncated at '.' and we get only - topmostSubform[0] and since all the field names starts with the same name the total count of fields are coming as 1. Since there are some special characters like '.'; '_'; '#' this is causing the issue. Could you please suggest on this? This is very critical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PDFBOX-1123) Not able to read field values from a PDF File if the field contains special characters.

Posted by "Maruan Sahyoun (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118091#comment-13118091 ] 

Maruan Sahyoun commented on PDFBOX-1123:
----------------------------------------

Hi Rubesh,

when you are at the field topmostSubform[0] call getKids(). You will get two new fields Page1[0] and Page2[0]. Again call getKids() at Page1[0] and you will get two childs c1_04_0_[0] and c1_09_0_ which are the fields where you want to get/set the values etc.

Good luck - Maruan
                
> Not able to read field values from a PDF File if the field contains special characters.
> ---------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1123
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1123
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Rubesh MX
>            Priority: Minor
>              Labels: acroform
>         Attachments: fspl.pdf
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Hi, I am trying to read the field names in a PDF file, it is working with most of the files, but in some files we are not able to read the field Id/name, the reason being we have some field names as -
> topmostSubform[0].Page1[0].c1_04_0_[0]
> topmostSubform[0].Page1[0].c1_09_0_
> topmostSubform[0].Page2[0].Table_Line4a[0].#subform[1].p2-t69[0]
> Here all the field names starts with topmostSubform[0]. so when we try to get the field names like PDField.getpartialname() - the field name is getting truncated at '.' and we get only - topmostSubform[0] and since all the field names starts with the same name the total count of fields are coming as 1. Since there are some special characters like '.'; '_'; '#' this is causing the issue. Could you please suggest on this? This is very critical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1123) Not able to read field values from a PDF File if the field contains special characters.

Posted by "Maruan Sahyoun (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119252#comment-13119252 ] 

Maruan Sahyoun commented on PDFBOX-1123:
----------------------------------------

Hi Rubesh,


List fields = document.getDocumentCatalog().getAcroForm().getFields();
PDField firstLevel = (PDField) fields.get(0);
List kids = firstLevel.getKids();
PDField firstKid = (PDField) kids.get(0);
System.out.println(firstKid.getFullyQualifiedName());

getKids() will return null in case there is no kid.

Kind regards Maruan



                
> Not able to read field values from a PDF File if the field contains special characters.
> ---------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1123
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1123
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Rubesh MX
>            Priority: Minor
>              Labels: acroform
>         Attachments: fspl.pdf
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Hi, I am trying to read the field names in a PDF file, it is working with most of the files, but in some files we are not able to read the field Id/name, the reason being we have some field names as -
> topmostSubform[0].Page1[0].c1_04_0_[0]
> topmostSubform[0].Page1[0].c1_09_0_
> topmostSubform[0].Page2[0].Table_Line4a[0].#subform[1].p2-t69[0]
> Here all the field names starts with topmostSubform[0]. so when we try to get the field names like PDField.getpartialname() - the field name is getting truncated at '.' and we get only - topmostSubform[0] and since all the field names starts with the same name the total count of fields are coming as 1. Since there are some special characters like '.'; '_'; '#' this is causing the issue. Could you please suggest on this? This is very critical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1123) Not able to read field values from a PDF File if the field contains special characters.

Posted by "Rubesh MX (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119142#comment-13119142 ] 

Rubesh MX commented on PDFBOX-1123:
-----------------------------------

Hi Maruan, Thanks very much for the help so far.
Just wanted to confirm again, sorry even now I am confused.
When I get the collection - 
 var ob = pdCat.getAcroForm().getFields().toArray();
This would give me the collection of all the first level field nodes(which is topmostSubform[0]) - now when I loop thru this collection and try to declare a PDField type variable and say getKids() I should get the Kids at this level, am I correct?
getKids would be of type String(), so how will I be able to get another set of Kids from this collection, I am sorry I have not got it correct, do you have a sample which extracts the kids and so on? That would be of great help to me.

                
> Not able to read field values from a PDF File if the field contains special characters.
> ---------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1123
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1123
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Rubesh MX
>            Priority: Minor
>              Labels: acroform
>         Attachments: fspl.pdf
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Hi, I am trying to read the field names in a PDF file, it is working with most of the files, but in some files we are not able to read the field Id/name, the reason being we have some field names as -
> topmostSubform[0].Page1[0].c1_04_0_[0]
> topmostSubform[0].Page1[0].c1_09_0_
> topmostSubform[0].Page2[0].Table_Line4a[0].#subform[1].p2-t69[0]
> Here all the field names starts with topmostSubform[0]. so when we try to get the field names like PDField.getpartialname() - the field name is getting truncated at '.' and we get only - topmostSubform[0] and since all the field names starts with the same name the total count of fields are coming as 1. Since there are some special characters like '.'; '_'; '#' this is causing the issue. Could you please suggest on this? This is very critical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1123) Not able to read field values from a PDF File if the field contains special characters.

Posted by "Rubesh MX (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118013#comment-13118013 ] 

Rubesh MX commented on PDFBOX-1123:
-----------------------------------

Hi Maruan, First, A big thanks for replying quickly. I am slightly confused now. I need some clarification on your comment -
"You need to call .getKids() on a each field node which will give you all the kids, inspect if you are at a field or at another node and move on until you get to the final field"
So you mean to say that after I have the collection - var obj = pdCat.getAcroForm().getFields().toArray(); rather than saying PDField pdd = (PDField)stx; and then pdd.getPartialName(); I should be doing pdd.getKids()? But I am not getting the right thing when I try this - I have not fully understood what you were explaining, sorry.
Could you please clarify?


                
> Not able to read field values from a PDF File if the field contains special characters.
> ---------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1123
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1123
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Rubesh MX
>            Priority: Minor
>              Labels: acroform
>         Attachments: fspl.pdf
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Hi, I am trying to read the field names in a PDF file, it is working with most of the files, but in some files we are not able to read the field Id/name, the reason being we have some field names as -
> topmostSubform[0].Page1[0].c1_04_0_[0]
> topmostSubform[0].Page1[0].c1_09_0_
> topmostSubform[0].Page2[0].Table_Line4a[0].#subform[1].p2-t69[0]
> Here all the field names starts with topmostSubform[0]. so when we try to get the field names like PDField.getpartialname() - the field name is getting truncated at '.' and we get only - topmostSubform[0] and since all the field names starts with the same name the total count of fields are coming as 1. Since there are some special characters like '.'; '_'; '#' this is causing the issue. Could you please suggest on this? This is very critical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1123) Not able to read field values from a PDF File if the field contains special characters.

Posted by "Rubesh MX (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13117061#comment-13117061 ] 

Rubesh MX commented on PDFBOX-1123:
-----------------------------------

Hi Andreas, Any confirmation from you on this issue will be much appreciated, we are waiting to hear from you on this before we go ahead with the development plan, since the file I had attached should also be read by our application, we are waiting for a word from you on this.
                
> Not able to read field values from a PDF File if the field contains special characters.
> ---------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1123
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1123
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Rubesh MX
>            Priority: Critical
>              Labels: Bug
>         Attachments: fspl.pdf
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Hi, I am trying to read the field names in a PDF file, it is working with most of the files, but in some files we are not able to read the field Id/name, the reason being we have some field names as -
> topmostSubform[0].Page1[0].c1_04_0_[0]
> topmostSubform[0].Page1[0].c1_09_0_
> topmostSubform[0].Page2[0].Table_Line4a[0].#subform[1].p2-t69[0]
> Here all the field names starts with topmostSubform[0]. so when we try to get the field names like PDField.getpartialname() - the field name is getting truncated at '.' and we get only - topmostSubform[0] and since all the field names starts with the same name the total count of fields are coming as 1. Since there are some special characters like '.'; '_'; '#' this is causing the issue. Could you please suggest on this? This is very critical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (PDFBOX-1123) Not able to read field values from a PDF File if the field contains special characters.

Posted by "Maruan Sahyoun (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13117915#comment-13117915 ] 

Maruan Sahyoun edited comment on PDFBOX-1123 at 9/30/11 7:26 AM:
-----------------------------------------------------------------

@Rubesh
getAcroForm().getFields() only returns the first level of field nodes which is topmostSubform in your case. You need to call .getKids() on a each field node which will give you all the kids, inspect if you are at a field or at another node and move on until you get to the final field. There you can use either .getPartialName() to retrieve the name of the field only or .getFullyQualifiedName() to get the name including the parents.

As an alternative you might want to use the findKid() on a top level field node method which drills down based on an array of names created e.g. by doing a split("\\.") on the full names of the fields you are looking for.

Or you can use the fully qualified name on getDocumentCatalog().getAcroForm().getField().
                
      was (Author: msahyoun):
    @Rubesh
getAcroForm().getFields() only returns the first level of field nodes which is topmostSubform in your case. You need to call .getKids() on a each field node which will give you all the kids, inspect if you are at a field or at another node and move on until you get to the final field. There you can use either .getPartialName() to retrieve the name of the field only or .getFullyQualifiedName() to get the name including the parents.

As an alternative you might want to use the findKid() on a top level field node method which drills down based on an array of names created e.g. by doing a split("\\.") on the full names of the fields you are looking for.

@Andreas
If I'm not mistaken there is no easier way at this point in time to supply the fullyQualifiedName to get the field right? Should we add a convenience method to do so?
                  
> Not able to read field values from a PDF File if the field contains special characters.
> ---------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1123
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1123
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Rubesh MX
>            Priority: Minor
>              Labels: acroform
>         Attachments: fspl.pdf
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Hi, I am trying to read the field names in a PDF file, it is working with most of the files, but in some files we are not able to read the field Id/name, the reason being we have some field names as -
> topmostSubform[0].Page1[0].c1_04_0_[0]
> topmostSubform[0].Page1[0].c1_09_0_
> topmostSubform[0].Page2[0].Table_Line4a[0].#subform[1].p2-t69[0]
> Here all the field names starts with topmostSubform[0]. so when we try to get the field names like PDField.getpartialname() - the field name is getting truncated at '.' and we get only - topmostSubform[0] and since all the field names starts with the same name the total count of fields are coming as 1. Since there are some special characters like '.'; '_'; '#' this is causing the issue. Could you please suggest on this? This is very critical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1123) Not able to read field values from a PDF File if the field contains special characters.

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113966#comment-13113966 ] 

Andreas Lehmkühler commented on PDFBOX-1123:
--------------------------------------------

Can you provide us with a sample pdf?

> Not able to read field values from a PDF File if the field contains special characters.
> ---------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1123
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1123
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Rubesh MX
>            Priority: Critical
>              Labels: Bug
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Hi, I am trying to read the field names in a PDF file, it is working with most of the files, but in some files we are not able to read the field Id/name, the reason being we have some field names as -
> topmostSubform[0].Page1[0].c1_04_0_[0]
> topmostSubform[0].Page1[0].c1_09_0_
> topmostSubform[0].Page2[0].Table_Line4a[0].#subform[1].p2-t69[0]
> Here all the field names starts with topmostSubform[0]. so when we try to get the field names like PDField.getpartialname() - the field name is getting truncated at '.' and we get only - topmostSubform[0] and since all the field names starts with the same name the total count of fields are coming as 1. Since there are some special characters like '.'; '_'; '#' this is causing the issue. Could you please suggest on this? This is very critical.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (PDFBOX-1123) Not able to read field values from a PDF File if the field contains special characters.

Posted by "Rubesh MX (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rubesh MX updated PDFBOX-1123:
------------------------------

    Attachment: fspl.pdf

The file which contains the special characters in the field names is attached, as I had mentioned earlier when I am trying to read the field names it gets truncated at '.' and I am not able to read all the field names as well.

> Not able to read field values from a PDF File if the field contains special characters.
> ---------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1123
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1123
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Rubesh MX
>            Priority: Critical
>              Labels: Bug
>         Attachments: fspl.pdf
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Hi, I am trying to read the field names in a PDF file, it is working with most of the files, but in some files we are not able to read the field Id/name, the reason being we have some field names as -
> topmostSubform[0].Page1[0].c1_04_0_[0]
> topmostSubform[0].Page1[0].c1_09_0_
> topmostSubform[0].Page2[0].Table_Line4a[0].#subform[1].p2-t69[0]
> Here all the field names starts with topmostSubform[0]. so when we try to get the field names like PDField.getpartialname() - the field name is getting truncated at '.' and we get only - topmostSubform[0] and since all the field names starts with the same name the total count of fields are coming as 1. Since there are some special characters like '.'; '_'; '#' this is causing the issue. Could you please suggest on this? This is very critical.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (PDFBOX-1123) Not able to read field values from a PDF File if the field contains special characters.

Posted by "Maruan Sahyoun (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13117915#comment-13117915 ] 

Maruan Sahyoun edited comment on PDFBOX-1123 at 9/30/11 7:20 AM:
-----------------------------------------------------------------

@Rubesh
getAcroForm().getFields() only returns the first level of field nodes which is topmostSubform in your case. You need to call .getKids() on a each field node which will give you all the kids, inspect if you are at a field or at another node and move on until you get to the final field. There you can use either .getPartialName() to retrieve the name of the field only or .getFullyQualifiedName() to get the name including the parents.

As an alternative you might want to use the findKid() on a top level field node method which drills down based on an array of names created e.g. by doing a split("\\.") on the full names of the fields you are looking for.

@Andreas
If I'm not mistaken there is no easier way at this point in time to supply the fullyQualifiedName to get the field right? Should we add a convenience method to do so?
                
      was (Author: msahyoun):
    @Rubesh
getAcroForm().getFields() only returns the first level of field nodes which is topmostSubform in your case. You need to call .getKids() on a each field node which will give you all the kids, inspect if you are at a field or at another node and move on until you get to the final field. There you can use either .getPartialName() to retrieve the name of the field only or .getFullyQualifiedName() to get the name including the parents.

@Andreas
If I'm not mistaken there is no easier way at this point in time to supply the fullyQualifiedName to get the field right? Should we add a convenience method to do so?
                  
> Not able to read field values from a PDF File if the field contains special characters.
> ---------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1123
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1123
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Rubesh MX
>            Priority: Minor
>              Labels: acroform
>         Attachments: fspl.pdf
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Hi, I am trying to read the field names in a PDF file, it is working with most of the files, but in some files we are not able to read the field Id/name, the reason being we have some field names as -
> topmostSubform[0].Page1[0].c1_04_0_[0]
> topmostSubform[0].Page1[0].c1_09_0_
> topmostSubform[0].Page2[0].Table_Line4a[0].#subform[1].p2-t69[0]
> Here all the field names starts with topmostSubform[0]. so when we try to get the field names like PDField.getpartialname() - the field name is getting truncated at '.' and we get only - topmostSubform[0] and since all the field names starts with the same name the total count of fields are coming as 1. Since there are some special characters like '.'; '_'; '#' this is causing the issue. Could you please suggest on this? This is very critical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1123) Not able to read field values from a PDF File if the field contains special characters.

Posted by "Maruan Sahyoun (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13117915#comment-13117915 ] 

Maruan Sahyoun commented on PDFBOX-1123:
----------------------------------------

@Rubesh
getAcroForm().getFields() only returns the first level of field nodes which is topmostSubform in your case. You need to call .getKids() on a each field node which will give you all the kids, inspect if you are at a field or at another node and move on until you get to the final field. There you can use either .getPartialName() to retrieve the name of the field only or .getFullyQualifiedName() to get the name including the parents.

@Andreas
If I'm not mistaken there is no easier way at this point in time to supply the fullyQualifiedName to get the field right? Should we add a convenience method to do so?
                
> Not able to read field values from a PDF File if the field contains special characters.
> ---------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1123
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1123
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Rubesh MX
>            Priority: Minor
>              Labels: acroform
>         Attachments: fspl.pdf
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Hi, I am trying to read the field names in a PDF file, it is working with most of the files, but in some files we are not able to read the field Id/name, the reason being we have some field names as -
> topmostSubform[0].Page1[0].c1_04_0_[0]
> topmostSubform[0].Page1[0].c1_09_0_
> topmostSubform[0].Page2[0].Table_Line4a[0].#subform[1].p2-t69[0]
> Here all the field names starts with topmostSubform[0]. so when we try to get the field names like PDField.getpartialname() - the field name is getting truncated at '.' and we get only - topmostSubform[0] and since all the field names starts with the same name the total count of fields are coming as 1. Since there are some special characters like '.'; '_'; '#' this is causing the issue. Could you please suggest on this? This is very critical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira