You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Stuart Owen <st...@manchester.ac.uk> on 2020/09/17 10:28:42 UTC

Perculiar problem reading validations from a XLSX exported from Google Sheets

Hi,

I've encountered an odd problem reading validations from an xlsx 
spreadsheet uploaded to Google Drive, and then exported.
To cut a long story short, my particular use-case involves a pipeline of 
creating an Excel spreadsheet, which is then shared and used via Google 
Sheets to allow scientists to collaborate and annotate biological data 
(using data validations), and is then exported and the annotations read 
via Apache POI.

I have found that Apache POI can detect the validations in the original 
spreadsheet, but not in the exported version. The validations are read 
via XSSFSheet.getDataValidations(). The validations however, can be read 
fine when opening in either Excel or LibreOffice, and if resaved can 
then again be read by Apache POI.

To clarify things, I've put together a test case to demonstrate the 
problem at https://github.com/stuzart/poi-google-export-testcase

I'm not sure if this is a problem with Google export, or Apache POI.

thanks,

Stuart

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Perculiar problem reading validations from a XLSX exported from Google Sheets

Posted by Stuart Owen <st...@manchester.ac.uk>.

On 17/09/2020 16:42, Andreas Reichel wrote:
> On Thu, 2020-09-17 at 16:26 +0100, Stuart Owen wrote:
>> original -
>> https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/original-unpacked/xl/worksheets/sheet1.xml#L17
>>
>> exported -
>> https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/exported-unpacked/xl/worksheets/sheet1.xml#L1003
>>
>> resaved -
>> https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/resaved-unpacked/xl/worksheets/sheet1.xml#L1007
> Stuart,
>
> maybe you would like to open a ticket in the POI's bug tracker with
> your finding.
> In the meantime, you could try to add the missing count="1" attribute
> by yourself? For example:
>
>
> 1) unzip the XLSX
> 2) find the xml node dataValidations, count the child elements and add
> the count="..." accordingly
> 3) zip into the XLSX file again
>
>
> If your files are small enough, than I would read the complete DOM of
> the XML.
> If  your files are large, then maybe GREP/SED could be helpful.
>
>
> Maybe test your/our assumption first with a test case having count="3".
> When editing the XML manually and adding the count="..." attribute
> helps, it was definitely worth to file a ticket.

Yep, OK. I will try this tomorrow and add the information to a ticket.

many thanks,
Stuart
>
>
> Good luck and best regards
> Andreas
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Perculiar problem reading validations from a XLSX exported from Google Sheets

Posted by Andreas Reichel <an...@manticore-projects.com>.
On Thu, 2020-09-17 at 16:26 +0100, Stuart Owen wrote:
> original - 
> https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/original-unpacked/xl/worksheets/sheet1.xml#L17
> 
> exported - 
> https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/exported-unpacked/xl/worksheets/sheet1.xml#L1003
> 
> resaved - 
> https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/resaved-unpacked/xl/worksheets/sheet1.xml#L1007

Stuart,

maybe you would like to open a ticket in the POI's bug tracker with
your finding.
In the meantime, you could try to add the missing count="1" attribute
by yourself? For example:


1) unzip the XLSX
2) find the xml node dataValidations, count the child elements and add
the count="..." accordingly
3) zip into the XLSX file again


If your files are small enough, than I would read the complete DOM of
the XML.
If  your files are large, then maybe GREP/SED could be helpful.


Maybe test your/our assumption first with a test case having count="3".
When editing the XML manually and adding the count="..." attribute
helps, it was definitely worth to file a ticket.


Good luck and best regards
Andreas



Re: Perculiar problem reading validations from a XLSX exported from Google Sheets

Posted by Andreas Reichel <an...@manticore-projects.com>.
Stuart,

On Thu, 2020-09-17 at 17:07 +0100, Stuart Owen wrote:
> > I'm not familiar with the format enough to know if this is legal or not.
> 
> to add, I've tried validating the xml against its schema, and it appears 
> 
> to be legal.


I do not doubt that. I would suspect that POI is falsely expecting that
attribute and interprets it as Zero when missing, not reading the
actual children.
Might be related to SAX vs. DOM, so they can't count the children when
that attribute is missing.

Best regards
Andreas

Re: Perculiar problem reading validations from a XLSX exported from Google Sheets

Posted by Stuart Owen <st...@manchester.ac.uk>.

On 17/09/2020 16:26, Stuart Owen wrote:
> Hi,
>
> comments below ...
>
> On 17/09/2020 11:48, Andreas Reichel wrote:
>> Stuart,
>>
>> maybe you could provide the simpliest possible test case, e. g.
>> creating your form with only 2 columns and 1 row, add the validation
>> and then:
> I found when creating a sheet as simple as you suggest, POI wasn't 
> able find the validations in the original either. Looking at the 
> unpacked xml for the sheet, the validation is defined in a weird 
> looking block:
> <ext uri="{CCE6A557-97BC-4b89-ADB6-D9C93CAAB3DF}"
> xmlns:x14="http://schemas.microsoft.com/office/spreadsheetml/2009/9/main"><x14:dataValidations 
> count="1"
> xmlns:xm="http://schemas.microsoft.com/office/excel/2006/main"><x14:dataValidation 
> type="list" allowBlank="1" showInputMessage="1"
> showErrorMessage="1"><x14:formula1><xm:f>Sheet2!$A$1</xm:f></x14:formula1><xm:sqref>A1</xm:sqref></x14:dataValidation></x14:dataValidations></ext> 
>
> </extLst>
> (I added the file to the github test case here: 
> https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/original-very-simple.xlsx) 
>
>>
>> 1) extract the XML after your created the template
>> 2) extract the XML after you have exporting it from Google spreadsheets
>> 3) extract the XML after importing and re-exporting it from Libreoffice
>>
>> Then compare the XML files and see what is actually different.
>
> So, sticking with the original files I have unpacked the xlsx and 
> taken a look at the xml differences. There are actually a lot of 
> differences, too much to do a simple diff. However manually 
> inspecting, I've noticed a difference in the dataValidation validation 
> block for the sheet1.xml that looks suspicous. In the problematic 
> exported case, it seems to be missing theĀ  count="1" attribute, 
> whereas the other cases that work do have this.
> i.e
> original - 
> https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/original-unpacked/xl/worksheets/sheet1.xml#L17
> exported - 
> https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/exported-unpacked/xl/worksheets/sheet1.xml#L1003
> resaved - 
> https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/resaved-unpacked/xl/worksheets/sheet1.xml#L1007
>
> I'm not familiar with the format enough to know if this is legal or not.

to add, I've tried validating the xml against its schema, and it appears 
to be legal.

>
>> Also I do not understand yet, why do you want to read the validation in
>> POI again? In my limited understanding, the Google Spreadsheet will
>> validate the user inputs already and assure, you will read a valid
>> file?
> I work on a tool that is based on embedding OWL ontologies (semantic 
> web stuff) into spreadsheets, allowing scientists (mostly biologists) 
> to annotate data in a way that is user friendly and familiar to them, 
> and hides the nasty stuff. We use a trick with data validation to map 
> human readable labels, with their sementic URI identifier, and then be 
> able to export this again. Using data validations allows us to do this 
> without any additional macros or vb scripting. The exported 
> annotations and data can then be fed into a database as a graph, and 
> be queried and reasoned across. You can find out more at 
> https://rightfield.org.uk/
>>
>> Possible work-arounds in the meantime:
>>
>> 1) when opening the Google Spreadsheet file, read the Validations from
>> the original XLSX file in parallel and merge that information
>> (e.g. take the data from the Google Spreadsheet, but the structure and
>> definitions from the original XLSX).
>>
>> 2) alternatively, run it automated through Libreoffice first (it has a
>> command-line/shell only option)
> Yes, option 2 is something I am thinking of as a fallback option.
>
> many thanks,
> Stuart
>>
>> Best regards
>> Andreas
>>
>> On Thu, 2020-09-17 at 11:28 +0100, Stuart Owen wrote:
>>> Hi,
>>> I've encountered an odd problem reading validations from an xlsx
>>> spreadsheet uploaded to Google Drive, and then exported.To cut a long
>>> story short, my particular use-case involves a pipeline of creating
>>> an Excel spreadsheet, which is then shared and used via Google Sheets
>>> to allow scientists to collaborate and annotate biological data
>>> (using data validations), and is then exported and the annotations
>>> read via Apache POI.
>>> I have found that Apache POI can detect the validations in the
>>> original spreadsheet, but not in the exported version. The
>>> validations are read via XSSFSheet.getDataValidations(). The
>>> validations however, can be read fine when opening in either Excel or
>>> LibreOffice, and if resaved can then again be read by Apache POI.
>>> To clarify things, I've put together a test case to demonstrate the
>>> problem at https://github.com/stuzart/poi-google-export-testcase
>>>
>>> I'm not sure if this is a problem with Google export, or Apache POI.
>>> thanks,
>>> Stuart
>>> -------------------------------------------------------------------
>>> --To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>>> For additional commands, e-mail: user-help@poi.apache.org
>>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Perculiar problem reading validations from a XLSX exported from Google Sheets

Posted by Stuart Owen <st...@manchester.ac.uk>.
Hi,

comments below ...

On 17/09/2020 11:48, Andreas Reichel wrote:
> Stuart,
>
> maybe you could provide the simpliest possible test case, e. g.
> creating your form with only 2 columns and 1 row, add the validation
> and then:
I found when creating a sheet as simple as you suggest, POI wasn't able 
find the validations in the original either. Looking at the unpacked xml 
for the sheet, the validation is defined in a weird looking block:
<ext uri="{CCE6A557-97BC-4b89-ADB6-D9C93CAAB3DF}"
xmlns:x14="http://schemas.microsoft.com/office/spreadsheetml/2009/9/main"><x14:dataValidations 
count="1"
xmlns:xm="http://schemas.microsoft.com/office/excel/2006/main"><x14:dataValidation 
type="list" allowBlank="1" showInputMessage="1"
showErrorMessage="1"><x14:formula1><xm:f>Sheet2!$A$1</xm:f></x14:formula1><xm:sqref>A1</xm:sqref></x14:dataValidation></x14:dataValidations></ext>
</extLst>
(I added the file to the github test case here: 
https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/original-very-simple.xlsx) 

>
> 1) extract the XML after your created the template
> 2) extract the XML after you have exporting it from Google spreadsheets
> 3) extract the XML after importing and re-exporting it from Libreoffice
>
> Then compare the XML files and see what is actually different.

So, sticking with the original files I have unpacked the xlsx and taken 
a look at the xml differences. There are actually a lot of differences, 
too much to do a simple diff. However manually inspecting, I've noticed 
a difference in the dataValidation validation block for the sheet1.xml 
that looks suspicous. In the problematic exported case, it seems to be 
missing theĀ  count="1" attribute, whereas the other cases that work do 
have this.
i.e
original - 
https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/original-unpacked/xl/worksheets/sheet1.xml#L17
exported - 
https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/exported-unpacked/xl/worksheets/sheet1.xml#L1003
resaved - 
https://github.com/stuzart/poi-google-export-testcase/blob/master/src/test/resources/resaved-unpacked/xl/worksheets/sheet1.xml#L1007

I'm not familiar with the format enough to know if this is legal or not.
> Also I do not understand yet, why do you want to read the validation in
> POI again? In my limited understanding, the Google Spreadsheet will
> validate the user inputs already and assure, you will read a valid
> file?
I work on a tool that is based on embedding OWL ontologies (semantic web 
stuff) into spreadsheets, allowing scientists (mostly biologists) to 
annotate data in a way that is user friendly and familiar to them, and 
hides the nasty stuff. We use a trick with data validation to map human 
readable labels, with their sementic URI identifier, and then be able to 
export this again. Using data validations allows us to do this without 
any additional macros or vb scripting. The exported annotations and data 
can then be fed into a database as a graph, and be queried and reasoned 
across. You can find out more at https://rightfield.org.uk/
>
> Possible work-arounds in the meantime:
>
> 1) when opening the Google Spreadsheet file, read the Validations from
> the original XLSX file in parallel and merge that information
> (e.g. take the data from the Google Spreadsheet, but the structure and
> definitions from the original XLSX).
>
> 2) alternatively, run it automated through Libreoffice first (it has a
> command-line/shell only option)
Yes, option 2 is something I am thinking of as a fallback option.

many thanks,
Stuart
>
> Best regards
> Andreas
>
> On Thu, 2020-09-17 at 11:28 +0100, Stuart Owen wrote:
>> Hi,
>> I've encountered an odd problem reading validations from an xlsx
>> spreadsheet uploaded to Google Drive, and then exported.To cut a long
>> story short, my particular use-case involves a pipeline of creating
>> an Excel spreadsheet, which is then shared and used via Google Sheets
>> to allow scientists to collaborate and annotate biological data
>> (using data validations), and is then exported and the annotations
>> read via Apache POI.
>> I have found that Apache POI can detect the validations in the
>> original spreadsheet, but not in the exported version. The
>> validations are read via XSSFSheet.getDataValidations(). The
>> validations however, can be read fine when opening in either Excel or
>> LibreOffice, and if resaved can then again be read by Apache POI.
>> To clarify things, I've put together a test case to demonstrate the
>> problem at https://github.com/stuzart/poi-google-export-testcase
>>
>> I'm not sure if this is a problem with Google export, or Apache POI.
>> thanks,
>> Stuart
>> -------------------------------------------------------------------
>> --To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>> For additional commands, e-mail: user-help@poi.apache.org
>>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Perculiar problem reading validations from a XLSX exported from Google Sheets

Posted by Andreas Reichel <an...@manticore-projects.com>.
Stuart,

maybe you could provide the simpliest possible test case, e. g.
creating your form with only 2 columns and 1 row, add the validation
and then:

1) extract the XML after your created the template
2) extract the XML after you have exporting it from Google spreadsheets
3) extract the XML after importing and re-exporting it from Libreoffice

Then compare the XML files and see what is actually different.
Also I do not understand yet, why do you want to read the validation in
POI again? In my limited understanding, the Google Spreadsheet will
validate the user inputs already and assure, you will read a valid
file?

Possible work-arounds in the meantime:

1) when opening the Google Spreadsheet file, read the Validations from
the original XLSX file in parallel and merge that information
(e.g. take the data from the Google Spreadsheet, but the structure and
definitions from the original XLSX).

2) alternatively, run it automated through Libreoffice first (it has a
command-line/shell only option)

Best regards
Andreas

On Thu, 2020-09-17 at 11:28 +0100, Stuart Owen wrote:
> Hi,
> I've encountered an odd problem reading validations from an xlsx
> spreadsheet uploaded to Google Drive, and then exported.To cut a long
> story short, my particular use-case involves a pipeline of creating
> an Excel spreadsheet, which is then shared and used via Google Sheets
> to allow scientists to collaborate and annotate biological data
> (using data validations), and is then exported and the annotations
> read via Apache POI.
> I have found that Apache POI can detect the validations in the
> original spreadsheet, but not in the exported version. The
> validations are read via XSSFSheet.getDataValidations(). The
> validations however, can be read fine when opening in either Excel or
> LibreOffice, and if resaved can then again be read by Apache POI.
> To clarify things, I've put together a test case to demonstrate the
> problem at https://github.com/stuzart/poi-google-export-testcase
> 
> I'm not sure if this is a problem with Google export, or Apache POI.
> thanks,
> Stuart
> -------------------------------------------------------------------
> --To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>