You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Roberto Nibali <rn...@gmail.com> on 2015/08/18 20:50:18 UTC

How to extract the object id from a form field?

Hi

I'd like to print out the corresponding object id given a specific form
field. How would I do that with PDFBox programmatically?

Let's for the sake of the argument, assume that the form field is
represented by the following obj:

obj 218 0
  <<
    /DA <2B94B0298F2FD7F81F32C6E22043>
    /F 4
    /FT /Tx
    /Ff 4194304
    /MK
    /P 28 0 R
    /Parent 46 0 R
    /Rect [159.781 764.53 347.142 777.195]
    /Subtype /Widget
    /T <5EB6B730886188AB3D3194B9654C18094C>
    /Type /Annot
    /V <45BBBA249C618BBD3974A4BE61501E57181D>
    /AP 666 0 R
  >>

If I am going over all PDField entries of a PDF, how would I get to the
underlying obj number (in the above case 218) from a PDField object?

Best regards
Roberto

Re: How to extract the object id from a form field?

Posted by John Hewson <jo...@jahewson.com>.

> On 19 Aug 2015, at 12:36, Tilman Hausherr <TH...@t-online.de> wrote:
> 
> Am 19.08.2015 um 21:16 schrieb Roberto Nibali:
>> 
>> At least for simple geometric shapes, like rectangles, this should be feasible, no? Anyway, after constantly getting "null" from the getCOSObject().getItem(COSName.KIDS) and nothing out of getChildren() from a given PDField, I kind of gave up.
> 
> That should work higher up elements.... You have to start at /Acroform and work with the intermediate elements (non terminal fields). The 2.0 PDFDebugger shows you what elements are indirect objects. And then go down until you "hit" one of your PDField objects.
> 
> It is tricky because the PDFBox code tries to hide the indirect objects from the ordinary user most of the time, and the COSBase objects don't know from what indirect objects they come.

I added a method a while ago to solve exactly this problem:

	COSDocument#getKey(COSBase)

Will return the COSObjectKey which correspond a given indirect object. It does this by examining every indirect object until it finds the one it’s after.

— John

> Tilman
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: How to extract the object id from a form field?

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 19.08.2015 um 21:16 schrieb Roberto Nibali:
>
> At least for simple geometric shapes, like rectangles, this should be 
> feasible, no? Anyway, after constantly getting "null" from the 
> getCOSObject().getItem(COSName.KIDS) and nothing out of getChildren() 
> from a given PDField, I kind of gave up.

That should work higher up elements.... You have to start at /Acroform 
and work with the intermediate elements (non terminal fields). The 2.0 
PDFDebugger shows you what elements are indirect objects. And then go 
down until you "hit" one of your PDField objects.

It is tricky because the PDFBox code tries to hide the indirect objects 
from the ordinary user most of the time, and the COSBase objects don't 
know from what indirect objects they come.

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: How to extract the object id from a form field?

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi,

> Am 19.08.2015 um 21:16 schrieb Roberto Nibali <rn...@gmail.com>:
> 
> Hi Tilman
> 
> Thanks for your reply ... I did not really succeed. We'll probably end up looking at how the PDFDebugger code does it ;).
> 
> On Tue, Aug 18, 2015 at 9:08 PM, Tilman Hausherr <THausherr@t-online.de <ma...@t-online.de>> wrote:
> Am 18.08.2015 um 20:50 schrieb Roberto Nibali:
>> Hi
>> 
>> I'd like to print out the corresponding object id given a specific form
>> field. How would I do that with PDFBox programmatically?
>> 
>> Let's for the sake of the argument, assume that the form field is
>> represented by the following obj:
>> 
>> obj 218 0
>>   <<
>>     /DA <2B94B0298F2FD7F81F32C6E22043>
>>     /F 4
>>     /FT /Tx
>>     /Ff 4194304
>>     /MK
>>     /P 28 0 R
>>     /Parent 46 0 R
>>     /Rect [159.781 764.53 347.142 777.195]
>>     /Subtype /Widget
>>     /T <5EB6B730886188AB3D3194B9654C18094C>
>>     /Type /Annot
>>     /V <45BBBA249C618BBD3974A4BE61501E57181D>
>>     /AP 666 0 R
>>   >>
>> 
>> If I am going over all PDField entries of a PDF, how would I get to the
>> underlying obj number (in the above case 218) from a PDField object?
> 
> I haven't tried this myself, but I think you could "synchronise" the getChildren() results with the getCOSObject().getItem(COSName.KIDS) array, i.e. sort out which indirect type is which item returned from getChildren(). The Kids COSArray has indirect objects (= COSObject type), as seen here:
> 
> 
> 
> COSObject.getObject() returns the dereferenced object.
> 
> The reason I asked about this is that while migrating some documents, we found out that the originating PDFs not only have textual changes in the PDF (mostly legal aspect changes in the fix text); the client in certain cases modified the PDFs by adding borders or other graphical elements inside. Those obviously do not show up in the template PDF. 
> 
> My somewhat (maybe stupid) idea was to simply print out the obj id or even the whole object and subsequently insert it into the template for the final PDF during the form field migration, on top of updating all references to the new obj id.
> 
> At least for simple geometric shapes, like rectangles, this should be feasible, no? Anyway, after constantly getting "null" from the getCOSObject().getItem(COSName.KIDS) and nothing out of getChildren() from a given PDField, I kind of gave up.
> 
> Imagine you had the following code, and wanted to additionally dump out the underlying object id and the referencing ids of the PDField:
> @Test
> private void excuteDumpFields() throws IOException {
>     PDDocument srcDoc = null;
>     try {
>         srcDoc = PDDocument.load(new File(srcDocName)); 
>         PDAcroForm acroForm = srcDoc.getDocumentCatalog().getAcroForm();
>         List<PDField> fields = acroForm.getFields();
>         for (PDField field : fields) {
>             dumpField(srcDoc, field);
>         }
>         srcDoc.close();
>     } catch (Exception e) {
>         logerr(e.getMessage());
>     } finally {
>         if (srcDoc != null) {
>             srcDoc.close();
>         }
>     }
> }
> 
> private void dumpField(PDDocument srcDoc, PDField srcField) throws IOException {
>     if (srcField instanceof PDNonTerminalField) {
>         for (PDField child : ((PDNonTerminalField) srcField).getChildren()) {
>             dumpField(srcDoc, child);
>         }
>     } else if (!(srcField instanceof PDSignatureField)) {
>         String fqName = srcField.getFullyQualifiedName();
>         String fTypes[] = srcField.getClass().getName().split("\\.");
>         System.out.printf("fqName=%s type=%s%n", fqName, fTypes[fTypes.length-1]);
>     }
> }
> It has become customary to me to dump the objects using the pdf-parser (http://blog.didierstevens.com/programs/pdf-tools/ <http://blog.didierstevens.com/programs/pdf-tools/>) as follows to futher investigate issues (excerpt showing the dump of object 228):
> 
> $ python pdf-parser.py -o 228 ../../ccmig2.pdf
> 
> obj 228 0
>  Type: /Annot
>  Referencing: 685 0 R, 28 0 R, 46 0 R, 686 0 R
> 
>   <<
>     /AA
>       <<
>         /K 685 0 R
>       >>
>     /DA <92F8913CB200CF3C13A363C2D20D>
>     /F 4
>     /FT /Tx
>     /Ff 12582912
>     /MK
>     /MaxLen 1
>     /P 28 0 R
>     /Parent 46 0 R
>     /Q 1
>     /Rect [454.437 769.504 465.482 782.169]
>     /Subtype /Widget
>     /T <8C8A>
>     /Type /Annot
>     /V ()
>     /AP 686 0 R
>   >>
> 
> And to get the objects referencing object 228:
> 
> $ python pdf-parser.py -r 228 ../../ccmig2.pdf
> 
> obj 28 0
>  Type: /Page
>  Referencing: 101 0 R, 217 0 R, 218 0 R, 219 0 R, 220 0 R, 221 0 R, 222 0 R, 223 0 R, 224 0 R, 225 0 R, 226 0 R, 227 0 R, 228 0 R, 229 0 R, 230 0 R, 231 0 R, 232 0 R, 61 0 R, 60 0 R, 62 0 R, 63 0 R, 64 0 R, 65 0 R, 66 0 R, 67 0 R, 69 0 R, 68 0 R, 70 0 R, 71 0 R, 72 0 R, 73 0 R, 74 0 R, 75 0 R, 76 0 R, 77 0 R, 78 0 R, 79 0 R, 80 0 R, 81 0 R, 82 0 R, 83 0 R, 84 0 R, 86 0 R, 87 0 R, 88 0 R, 89 0 R, 90 0 R, 91 0 R, 92 0 R, 93 0 R, 94 0 R, 95 0 R, 96 0 R, 97 0 R, 85 0 R, 233 0 R, 234 0 R, 235 0 R, 236 0 R, 237 0 R, 238 0 R, 239 0 R, 22 0 R, 240 0 R, 241 0 R, 242 0 R, 243 0 R, 244 0 R, 245 0 R, 246 0 R, 247 0 R, 103 0 R, 248 0 R, 6 0 R, 205 0 R, 206 0 R, 207 0 R, 208 0 R, 209 0 R, 210 0 R, 211 0 R, 213 0 R, 212 0 R
> 
>   <<
>     /Annots '[101 0 R 217 0 R 218 0 R 219 0 R 220 0 R 221 0 R 222 0 R 223 0 R 224 0 R 225 0 R\n226 0 R 227 0 R 228 0 R 229 0 R 230 0 R 231 0 R 232 0 R 61 0 R 60 0 R 62 0 R\n63 0 R 64 0 R 65 0 R 66 0 R 67 0 R 69 0 R 68 0 R 70 0 R 71 0 R 72 0 R\n73 0 R 74 0 R 75 0 R 76 0 R 77 0 R 78 0 R 79 0 R 80 0 R 81 0 R 82 0 R\n83 0 R 84 0 R 86 0 R 87 0 R 88 0 R 89 0 R 90 0 R 91 0 R 92 0 R 93 0 R\n94 0 R 95 0 R 96 0 R 97 0 R 85 0 R 233 0 R 234 0 R 235 0 R 236 0 R 237 0 R\n238 0 R 239 0 R 22 0 R 240 0 R 241 0 R 242 0 R 243 0 R 244 0 R 245 0 R 246 0 R\n247 0 R 103 0 R]'
>     /BleedBox [0.0 0.0 595.276 841.89]
>     /Contents 248 0 R
>     /CropBox [0.0 0.0 595.276 841.89]
>     /MediaBox [0.0 0.0 595.276 841.89]
>     /Parent 6 0 R
>     /Resources
>       <<
>         /ExtGState
>           <<
>             /GS0 205 0 R
>             /GS1 206 0 R
>             /GS2 207 0 R
>             /GS3 208 0 R
>           >>
>         /Font
>           <<
>             /C2_0 209 0 R
>             /C2_1 210 0 R
>             /TT0 211 0 R
>             /TT1 213 0 R
>             /TT2 212 0 R
>           >>
>         /ProcSet [/PDF /Text]
>       >>
>     /Rotate 0
>     /Tabs /W
>     /TrimBox [0.0 0.0 595.276 841.89]
>     /Type /Page
>   >>
> 
> 
> obj 46 0
>  Type:
>  Referencing: 218 0 R, 230 0 R, 231 0 R, 232 0 R, 219 0 R, 217 0 R, 220 0 R, 221 0 R, 222 0 R, 223 0 R, 224 0 R, 225 0 R, 226 0 R, 227 0 R, 228 0 R, 229 0 R, 17 0 R
> 
>   <<
>     /Kids '[218 0 R 230 0 R 231 0 R 232 0 R 219 0 R 217 0 R 220 0 R 221 0 R 222 0 R 223 0 R\n224 0 R 225 0 R 226 0 R 227 0 R 228 0 R 229 0 R]'
>     /Parent 17 0 R
>     /T <32AB37>
>   >>
> 
> It would be tremendous if I could get at least the proper object id out of the PDFields using PDFBox.

a PDField is uniquely identified by it's full name - which can als be used to find it within the template. Now if someone added a border in the source document field which you would like to add to the template document field this is part of the widget definition for the field e.g. the /MK entry. There are also some defaults used by Acrobat e.g. when a border color is defined there will be a small border around the field even if there is no border width defined.

If I understood your use case correctly knowing the object id of the field wouldn't help in this case.

BR
Maruan


> 
> Take care
> Roberto
> 
>

Re: How to extract the object id from a form field?

Posted by Roberto Nibali <rn...@gmail.com>.

Hi Tilman

Thanks for your reply ... I did not really succeed. We'll probably end up
looking at how the PDFDebugger code does it ;).

On Tue, Aug 18, 2015 at 9:08 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 18.08.2015 um 20:50 schrieb Roberto Nibali:
>
> Hi
>
> I'd like to print out the corresponding object id given a specific form
> field. How would I do that with PDFBox programmatically?
>
> Let's for the sake of the argument, assume that the form field is
> represented by the following obj:
>
> obj 218 0
>   <<
>     /DA <2B94B0298F2FD7F81F32C6E22043>
>     /F 4
>     /FT /Tx
>     /Ff 4194304
>     /MK
>     /P 28 0 R
>     /Parent 46 0 R
>     /Rect [159.781 764.53 347.142 777.195]
>     /Subtype /Widget
>     /T <5EB6B730886188AB3D3194B9654C18094C>
>     /Type /Annot
>     /V <45BBBA249C618BBD3974A4BE61501E57181D>
>     /AP 666 0 R
>   >>
>
> If I am going over all PDField entries of a PDF, how would I get to the
> underlying obj number (in the above case 218) from a PDField object?
>
>
> I haven't tried this myself, but I think you could "synchronise" the
> getChildren() results with the getCOSObject().getItem(COSName.KIDS) array,
> i.e. sort out which indirect type is which item returned from
> getChildren(). The Kids COSArray has indirect objects (= COSObject type),
> as seen here:
>
>
>
> COSObject.getObject() returns the dereferenced object.
>

The reason I asked about this is that while migrating some documents, we
found out that the originating PDFs not only have textual changes in the
PDF (mostly legal aspect changes in the fix text); the client in certain
cases modified the PDFs by adding borders or other graphical elements
inside. Those obviously do not show up in the template PDF.

My somewhat (maybe stupid) idea was to simply print out the obj id or even
the whole object and subsequently insert it into the template for the final
PDF during the form field migration, on top of updating all references to
the new obj id.

At least for simple geometric shapes, like rectangles, this should be
feasible, no? Anyway, after constantly getting "null" from the
getCOSObject().getItem(COSName.KIDS) and nothing out of getChildren() from
a given PDField, I kind of gave up.

Imagine you had the following code, and wanted to additionally dump out the
underlying object id and the referencing ids of the PDField:

@Test
private void excuteDumpFields() throws IOException {
    PDDocument srcDoc = null;
    try {
        srcDoc = PDDocument.load(new File(srcDocName));
        PDAcroForm acroForm = srcDoc.getDocumentCatalog().getAcroForm();
        List<PDField> fields = acroForm.getFields();
        for (PDField field : fields) {
            dumpField(srcDoc, field);
        }
        srcDoc.close();
    } catch (Exception e) {
        logerr(e.getMessage());
    } finally {
        if (srcDoc != null) {
            srcDoc.close();
        }
    }
}

private void dumpField(PDDocument srcDoc, PDField srcField) throws IOException {
    if (srcField instanceof PDNonTerminalField) {
        for (PDField child : ((PDNonTerminalField) srcField).getChildren()) {
            dumpField(srcDoc, child);
        }
    } else if (!(srcField instanceof PDSignatureField)) {
        String fqName = srcField.getFullyQualifiedName();
        String fTypes[] = srcField.getClass().getName().split("\\.");
        System.out.printf("fqName=%s type=%s%n", fqName,
fTypes[fTypes.length-1]);
    }
}

It has become customary to me to dump the objects using the pdf-parser (
http://blog.didierstevens.com/programs/pdf-tools/) as follows to futher
investigate issues (excerpt showing the dump of object 228):

$ python pdf-parser.py -o 228 ../../ccmig2.pdf

obj 228 0
 Type: /Annot
 Referencing: 685 0 R, 28 0 R, 46 0 R, 686 0 R

  <<
    /AA
      <<
        /K 685 0 R
      >>
    /DA <92F8913CB200CF3C13A363C2D20D>
    /F 4
    /FT /Tx
    /Ff 12582912
    /MK
    /MaxLen 1
    /P 28 0 R
    /Parent 46 0 R
    /Q 1
    /Rect [454.437 769.504 465.482 782.169]
    /Subtype /Widget
    /T <8C8A>
    /Type /Annot
    /V ()
    /AP 686 0 R
  >>

And to get the objects referencing object 228:

$ python pdf-parser.py -r 228 ../../ccmig2.pdf

obj 28 0
 Type: /Page
 Referencing: 101 0 R, 217 0 R, 218 0 R, 219 0 R, 220 0 R, 221 0 R, 222 0
R, 223 0 R, 224 0 R, 225 0 R, 226 0 R, 227 0 R, 228 0 R, 229 0 R, 230 0 R,
231 0 R, 232 0 R, 61 0 R, 60 0 R, 62 0 R, 63 0 R, 64 0 R, 65 0 R, 66 0 R,
67 0 R, 69 0 R, 68 0 R, 70 0 R, 71 0 R, 72 0 R, 73 0 R, 74 0 R, 75 0 R, 76
0 R, 77 0 R, 78 0 R, 79 0 R, 80 0 R, 81 0 R, 82 0 R, 83 0 R, 84 0 R, 86 0
R, 87 0 R, 88 0 R, 89 0 R, 90 0 R, 91 0 R, 92 0 R, 93 0 R, 94 0 R, 95 0 R,
96 0 R, 97 0 R, 85 0 R, 233 0 R, 234 0 R, 235 0 R, 236 0 R, 237 0 R, 238 0
R, 239 0 R, 22 0 R, 240 0 R, 241 0 R, 242 0 R, 243 0 R, 244 0 R, 245 0 R,
246 0 R, 247 0 R, 103 0 R, 248 0 R, 6 0 R, 205 0 R, 206 0 R, 207 0 R, 208 0
R, 209 0 R, 210 0 R, 211 0 R, 213 0 R, 212 0 R

  <<
    /Annots '[101 0 R 217 0 R 218 0 R 219 0 R 220 0 R 221 0 R 222 0 R 223 0
R 224 0 R 225 0 R\n226 0 R 227 0 R 228 0 R 229 0 R 230 0 R 231 0 R 232 0 R
61 0 R 60 0 R 62 0 R\n63 0 R 64 0 R 65 0 R 66 0 R 67 0 R 69 0 R 68 0 R 70 0
R 71 0 R 72 0 R\n73 0 R 74 0 R 75 0 R 76 0 R 77 0 R 78 0 R 79 0 R 80 0 R 81
0 R 82 0 R\n83 0 R 84 0 R 86 0 R 87 0 R 88 0 R 89 0 R 90 0 R 91 0 R 92 0 R
93 0 R\n94 0 R 95 0 R 96 0 R 97 0 R 85 0 R 233 0 R 234 0 R 235 0 R 236 0 R
237 0 R\n238 0 R 239 0 R 22 0 R 240 0 R 241 0 R 242 0 R 243 0 R 244 0 R 245
0 R 246 0 R\n247 0 R 103 0 R]'
    /BleedBox [0.0 0.0 595.276 841.89]
    /Contents 248 0 R
    /CropBox [0.0 0.0 595.276 841.89]
    /MediaBox [0.0 0.0 595.276 841.89]
    /Parent 6 0 R
    /Resources
      <<
        /ExtGState
          <<
            /GS0 205 0 R
            /GS1 206 0 R
            /GS2 207 0 R
            /GS3 208 0 R
          >>
        /Font
          <<
            /C2_0 209 0 R
            /C2_1 210 0 R
            /TT0 211 0 R
            /TT1 213 0 R
            /TT2 212 0 R
          >>
        /ProcSet [/PDF /Text]
      >>
    /Rotate 0
    /Tabs /W
    /TrimBox [0.0 0.0 595.276 841.89]
    /Type /Page
  >>


obj 46 0
 Type:
 Referencing: 218 0 R, 230 0 R, 231 0 R, 232 0 R, 219 0 R, 217 0 R, 220 0
R, 221 0 R, 222 0 R, 223 0 R, 224 0 R, 225 0 R, 226 0 R, 227 0 R, 228 0 R,
229 0 R, 17 0 R

  <<
    /Kids '[218 0 R 230 0 R 231 0 R 232 0 R 219 0 R 217 0 R 220 0 R 221 0 R
222 0 R 223 0 R\n224 0 R 225 0 R 226 0 R 227 0 R 228 0 R 229 0 R]'
    /Parent 17 0 R
    /T <32AB37>
  >>

It would be tremendous if I could get at least the proper object id out of
the PDFields using PDFBox.

Take care
Roberto

Re: How to extract the object id from a form field?

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 18.08.2015 um 20:50 schrieb Roberto Nibali:
> Hi
>
> I'd like to print out the corresponding object id given a specific form
> field. How would I do that with PDFBox programmatically?
>
> Let's for the sake of the argument, assume that the form field is
> represented by the following obj:
>
> obj 218 0
>    <<
>      /DA <2B94B0298F2FD7F81F32C6E22043>
>      /F 4
>      /FT /Tx
>      /Ff 4194304
>      /MK
>      /P 28 0 R
>      /Parent 46 0 R
>      /Rect [159.781 764.53 347.142 777.195]
>      /Subtype /Widget
>      /T <5EB6B730886188AB3D3194B9654C18094C>
>      /Type /Annot
>      /V <45BBBA249C618BBD3974A4BE61501E57181D>
>      /AP 666 0 R
>    >>
>
> If I am going over all PDField entries of a PDF, how would I get to the
> underlying obj number (in the above case 218) from a PDField object?

I haven't tried this myself, but I think you could "synchronise" the 
getChildren() results with the getCOSObject().getItem(COSName.KIDS) 
array, i.e. sort out which indirect type is which item returned from 
getChildren(). The Kids COSArray has indirect objects (= COSObject 
type), as seen here:



COSObject.getObject() returns the dereferenced object.

Tilman