You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Sriram Varadharajan <va...@gmail.com> on 2015/10/31 02:37:18 UTC

Strip Data out of PDF and save only skeleton.

We are using PDFBox to process PDF that contains sensitive data . Currently
we don't store these PDF (even after encrypting) due to security compliance
. If there is an ability to strip the data out of PDF we can save the file
and we can use them for analytical purposes

Question is  Does PDF box or any other utility out there gives the ability
to blank out all the Data in the PDF and just save the skeleton alone ?
Please share any custom solutions or ideas if any !!

Thanks

Re: Strip Data out of PDF and save only skeleton.

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.
We have used PDFBox (through our extension PDF2SVG)  to extract vectors and
overlays in scientific images and as part of this to find "questionable
practice". As an example here is a scientific graph
http://www.slideshare.net/petermurrayrust/contentmining-in-neuroscience
(slides 22,23,24). 22 shows an acceptable spectrum, but analysis of the
vectors and layers in 23 shows a white square, which has been used to
obscure an impurity peak, The student involved confessed and has probably
ruined their career.

This may even have been introduced *before* creating the PDF - if rich
graphical formats are imported into PDF they often preserve all the vectors
and layers. The positive side of this is that we can often use this to
extract high quality data from PDFs, as long as the images are not
mutilated into bitmaps. An example of this is the reconstruction of high
quality astronomical data from graphs: see slides 34/36/37/38.

If anyone is interested in data extraction from graphs using PDF2SVG
contact me offlist. It's alpha but may be useful for those who are happy to
do some hacking.

On Sat, Oct 31, 2015 at 12:02 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 31.10.2015 um 04:05 schrieb Sriram Varadharajan:
>
>> Is there any other alternative like overlaying an opaque rectangle on top
>> of the rectangle box that has the data . I know the coordinates as i use
>> it
>> to extract the data from the PDF at the first place .
>>
>> I am also OK filling out rectangles with dark colors . At the end i need
>> only the borders and no data .
>>
>
> Heh heh...:
>
> http://news.bbc.co.uk/2/hi/europe/4504589.stm
>
> Tilman
>
>
>
>
>
>>
>>
>> On Fri, Oct 30, 2015 at 7:11 PM, John Hewson <jo...@jahewson.com> wrote:
>>
>> This is a very hard thing to get right, especially if you have compliance
>>> needs.
>>> There are just so many ways that sensitive data could remain embedded in
>>> the resulting document.
>>>
>>> If you want my advice, don’t attempt this.
>>>
>>> — John
>>>
>>> On 30 Oct 2015, at 18:37, Sriram Varadharajan <va...@gmail.com>
>>>>
>>> wrote:
>>>
>>>> We are using PDFBox to process PDF that contains sensitive data .
>>>>
>>> Currently
>>>
>>>> we don't store these PDF (even after encrypting) due to security
>>>>
>>> compliance
>>>
>>>> . If there is an ability to strip the data out of PDF we can save the
>>>>
>>> file
>>>
>>>> and we can use them for analytical purposes
>>>>
>>>> Question is  Does PDF box or any other utility out there gives the
>>>>
>>> ability
>>>
>>>> to blank out all the Data in the PDF and just save the skeleton alone ?
>>>> Please share any custom solutions or ideas if any !!
>>>>
>>>> Thanks
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: Strip Data out of PDF and save only skeleton.

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 31.10.2015 um 04:05 schrieb Sriram Varadharajan:
> Is there any other alternative like overlaying an opaque rectangle on top
> of the rectangle box that has the data . I know the coordinates as i use it
> to extract the data from the PDF at the first place .
>
> I am also OK filling out rectangles with dark colors . At the end i need
> only the borders and no data .

Heh heh...:

http://news.bbc.co.uk/2/hi/europe/4504589.stm

Tilman



>
>
>
> On Fri, Oct 30, 2015 at 7:11 PM, John Hewson <jo...@jahewson.com> wrote:
>
>> This is a very hard thing to get right, especially if you have compliance
>> needs.
>> There are just so many ways that sensitive data could remain embedded in
>> the resulting document.
>>
>> If you want my advice, don’t attempt this.
>>
>> — John
>>
>>> On 30 Oct 2015, at 18:37, Sriram Varadharajan <va...@gmail.com>
>> wrote:
>>> We are using PDFBox to process PDF that contains sensitive data .
>> Currently
>>> we don't store these PDF (even after encrypting) due to security
>> compliance
>>> . If there is an ability to strip the data out of PDF we can save the
>> file
>>> and we can use them for analytical purposes
>>>
>>> Question is  Does PDF box or any other utility out there gives the
>> ability
>>> to blank out all the Data in the PDF and just save the skeleton alone ?
>>> Please share any custom solutions or ideas if any !!
>>>
>>> Thanks
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Strip Data out of PDF and save only skeleton.

Posted by John Hewson <jo...@jahewson.com>.
> On 30 Oct 2015, at 20:05, Sriram Varadharajan <va...@gmail.com> wrote:
> 
> Is there any other alternative like overlaying an opaque rectangle on top
> of the rectangle box that has the data .

No. Your sensitive data will of course still be in the document even if you cover it with an opaque square!

Do not go down this road.

-- John

> I know the coordinates as i use it
> to extract the data from the PDF at the first place .
> 
> I am also OK filling out rectangles with dark colors . At the end i need
> only the borders and no data .
> 
> 
> 
>> On Fri, Oct 30, 2015 at 7:11 PM, John Hewson <jo...@jahewson.com> wrote:
>> 
>> This is a very hard thing to get right, especially if you have compliance
>> needs.
>> There are just so many ways that sensitive data could remain embedded in
>> the resulting document.
>> 
>> If you want my advice, don’t attempt this.
>> 
>> — John
>> 
>>>> On 30 Oct 2015, at 18:37, Sriram Varadharajan <va...@gmail.com>
>>> wrote:
>>> 
>>> We are using PDFBox to process PDF that contains sensitive data .
>> Currently
>>> we don't store these PDF (even after encrypting) due to security
>> compliance
>>> . If there is an ability to strip the data out of PDF we can save the
>> file
>>> and we can use them for analytical purposes
>>> 
>>> Question is  Does PDF box or any other utility out there gives the
>> ability
>>> to blank out all the Data in the PDF and just save the skeleton alone ?
>>> Please share any custom solutions or ideas if any !!
>>> 
>>> Thanks
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Strip Data out of PDF and save only skeleton.

Posted by Sriram Varadharajan <va...@gmail.com>.
Is there any other alternative like overlaying an opaque rectangle on top
of the rectangle box that has the data . I know the coordinates as i use it
to extract the data from the PDF at the first place .

I am also OK filling out rectangles with dark colors . At the end i need
only the borders and no data .



On Fri, Oct 30, 2015 at 7:11 PM, John Hewson <jo...@jahewson.com> wrote:

> This is a very hard thing to get right, especially if you have compliance
> needs.
> There are just so many ways that sensitive data could remain embedded in
> the resulting document.
>
> If you want my advice, don’t attempt this.
>
> — John
>
> > On 30 Oct 2015, at 18:37, Sriram Varadharajan <va...@gmail.com>
> wrote:
> >
> > We are using PDFBox to process PDF that contains sensitive data .
> Currently
> > we don't store these PDF (even after encrypting) due to security
> compliance
> > . If there is an ability to strip the data out of PDF we can save the
> file
> > and we can use them for analytical purposes
> >
> > Question is  Does PDF box or any other utility out there gives the
> ability
> > to blank out all the Data in the PDF and just save the skeleton alone ?
> > Please share any custom solutions or ideas if any !!
> >
> > Thanks
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Strip Data out of PDF and save only skeleton.

Posted by John Hewson <jo...@jahewson.com>.
This is a very hard thing to get right, especially if you have compliance needs.
There are just so many ways that sensitive data could remain embedded in
the resulting document.

If you want my advice, don’t attempt this.

— John

> On 30 Oct 2015, at 18:37, Sriram Varadharajan <va...@gmail.com> wrote:
> 
> We are using PDFBox to process PDF that contains sensitive data . Currently
> we don't store these PDF (even after encrypting) due to security compliance
> . If there is an ability to strip the data out of PDF we can save the file
> and we can use them for analytical purposes
> 
> Question is  Does PDF box or any other utility out there gives the ability
> to blank out all the Data in the PDF and just save the skeleton alone ?
> Please share any custom solutions or ideas if any !!
> 
> Thanks


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org