You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Sergiy Karpenko <se...@exoplatform.com> on 2010/09/08 09:40:57 UTC

How can I configure Tika to extract dates in single format?

Hello, freinds

I'm using Tika 0.7

When I test content and metadata extraction by Tika, I met next usecases:
- Date in metadata (DublinCore.DATE, MSOffice.LAST_SAVED,
MSOffice.CREATION_DATE)
Date returned as String, but format is different for different document
types. Probably you already working on this problem (I saw Date object in
metadata in Tika 0.8) but if not, how can I configure Tika to use single
Date format?

- Date in Excel file content.
As we know, Excel have Date fields, and Tika extract it well. But format is
not acceptable for me.

For example
I have field 03/10/2005
Tika extracts it as  10/03/2005
But, I need "yyyy-MM-dd HH:mm:ss.SSSZ"   - 2005-10-03 00:00:00.000+0300

So, the question is:
- Can I configure Tika to use singel Date format?
- Can I configure Excel parser to extract date/time objects with specified
date format?


Thanks

Re: How can I configure Tika to extract dates in single format?

Posted by Sergiy Karpenko <se...@exoplatform.com>.
Thanks for quick responce. I will make own Excel parser.

2010/9/8 Nick Burch <ni...@alfresco.com>

> On Wed, 8 Sep 2010, Sergiy Karpenko wrote:
>
>> When I test content and metadata extraction by Tika, I met next usecases:
>> - Date in metadata (DublinCore.DATE, MSOffice.LAST_SAVED,
>> MSOffice.CREATION_DATE)
>> Date returned as String, but format is different for different document
>> types. Probably you already working on this problem (I saw Date object in
>> metadata in Tika 0.8) but if not, how can I configure Tika to use single
>> Date format?
>>
>
> This has only recently been fixed:
>        https://issues.apache.org/jira/browse/TIKA-451
>
> You'll want to upgrade to a recent svn checkout / nightly build to get
> these improvements
>
>
>
>  - Date in Excel file content.
>> As we know, Excel have Date fields, and Tika extract it well. But format
>> is
>> not acceptable for me.
>>
>> For example
>> I have field 03/10/2005
>> Tika extracts it as  10/03/2005
>> But, I need "yyyy-MM-dd HH:mm:ss.SSSZ"   - 2005-10-03 00:00:00.000+0300
>>
>
> Tika does its best to return the dates in the format that they show up in
> Excel.
>
> If you want the dates to be in ISO8601 format, you have two options:
> * Set all your date cells in excel to be formatted as iso8601, rather
>  than whatever they currently are
> * Write your own excel parser for Tika, which ignores the date formatting
>   set for cells, and always uses iso8601
>
> For the latter, you'd probably start with Tika's ExcelExtractor, then in
> the NumberRecord switch case, use POI's DateUtils class to detect if the
> cell is a date cell or not. If it is, have the cell value turned into a date
> object, and format it as you require. If it isn't, then let the default
> "format this like excel does" logic kick in
>
> For the former, if your users don't want to reformat all their date cells
> for you, you could probably pre-process the file with POI and change all the
> formats.
>
> Nick
>

Re: How can I configure Tika to extract dates in single format?

Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 8 Sep 2010, Sergiy Karpenko wrote:
> When I test content and metadata extraction by Tika, I met next usecases:
> - Date in metadata (DublinCore.DATE, MSOffice.LAST_SAVED,
> MSOffice.CREATION_DATE)
> Date returned as String, but format is different for different document
> types. Probably you already working on this problem (I saw Date object in
> metadata in Tika 0.8) but if not, how can I configure Tika to use single
> Date format?

This has only recently been fixed:
 	https://issues.apache.org/jira/browse/TIKA-451

You'll want to upgrade to a recent svn checkout / nightly build to get 
these improvements


> - Date in Excel file content.
> As we know, Excel have Date fields, and Tika extract it well. But format is
> not acceptable for me.
>
> For example
> I have field 03/10/2005
> Tika extracts it as  10/03/2005
> But, I need "yyyy-MM-dd HH:mm:ss.SSSZ"   - 2005-10-03 00:00:00.000+0300

Tika does its best to return the dates in the format that they show up in 
Excel.

If you want the dates to be in ISO8601 format, you have two options:
* Set all your date cells in excel to be formatted as iso8601, rather
   than whatever they currently are
* Write your own excel parser for Tika, which ignores the date formatting
    set for cells, and always uses iso8601

For the latter, you'd probably start with Tika's ExcelExtractor, then in 
the NumberRecord switch case, use POI's DateUtils class to detect if the 
cell is a date cell or not. If it is, have the cell value turned into a 
date object, and format it as you require. If it isn't, then let the 
default "format this like excel does" logic kick in

For the former, if your users don't want to reformat all their date cells 
for you, you could probably pre-process the file with POI and change all 
the formats.

Nick