You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by CL <ba...@gmail.com> on 2013/03/05 16:26:36 UTC

How to hide some Excel content

Hi,
I just started using Tika (1.3) for converting Excel (OOXML) content to
HTML. Looking good. Two things I'm wondering...

1) Is there a way to convert only a specific worksheet of a workbook that
has multiple worksheets?

2) I have hidden columns and worksheets in the workbook, but they become
visible in the output HTML. Is there a way to keep these out of the output?

Thanks!!

Re: How to hide some Excel content

Posted by CL <ba...@gmail.com>.
>
> There are several examples in Apache POI, and the code behind Tika is open
> source. Skipping certain slides should be fairly easy, other things will
> depend on "hiding" ends up written into the file
>
>
OK. I was just wondering if there was a built-in way to specify a customer
handler that could do something like this to avoid compiling a custom
version of the project.


>  Seems like that should be the default anyway, doesn't it?
>>
>
> A lot of people use Tika to feed indexing systems, and want all the text
> they can get!
>

I see. Good point.

Thanks

Re: How to hide some Excel content

Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 5 Mar 2013, CL wrote:
> Thanks for your feedback. I may go that route if I have to, but I'm not 
> finding any good converters. I was hoping to avoid writing my own, which 
> is why I'm trying Tika. Do you know if there's a relatively simple way 
> to extend a Tika class to filter out hidden content?

There are several examples in Apache POI, and the code behind Tika is open 
source. Skipping certain slides should be fairly easy, other things will 
depend on "hiding" ends up written into the file

> Seems like that should be the default anyway, doesn't it?

A lot of people use Tika to feed indexing systems, and want all the text 
they can get!

Nick

Re: How to hide some Excel content

Posted by CL <ba...@gmail.com>.
Thanks for your feedback. I may go that route if I have to, but I'm not
finding any good converters. I was hoping to avoid writing my own, which is
why I'm trying Tika. Do you know if there's a relatively simple way to
extend a Tika class to filter out hidden content?
Seems like that should be the default anyway, doesn't it?

Thanks


On Tue, Mar 5, 2013 at 8:32 AM, Nick Burch <ap...@gagravarr.org> wrote:

> On Tue, 5 Mar 2013, CL wrote:
>
>> 1) Is there a way to convert only a specific worksheet of a workbook that
>> has multiple worksheets?
>>
>> 2) I have hidden columns and worksheets in the workbook, but they become
>> visible in the output HTML. Is there a way to keep these out of the
>> output?
>>
>
> If you have quite specific requirements (which it sounds liek you do), and
> only need to work with one file format, you're probably better off calling
> Apache POI directly. That way, you can have full control over everything.
>
> Nick
>

Re: How to hide some Excel content

Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 5 Mar 2013, CL wrote:
> 1) Is there a way to convert only a specific worksheet of a workbook that
> has multiple worksheets?
>
> 2) I have hidden columns and worksheets in the workbook, but they become
> visible in the output HTML. Is there a way to keep these out of the output?

If you have quite specific requirements (which it sounds liek you do), and 
only need to work with one file format, you're probably better off calling 
Apache POI directly. That way, you can have full control over everything.

Nick