You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Sergiy Karpenko <se...@exoplatform.com> on 2010/09/08 09:40:57 UTC
How can I configure Tika to extract dates in single format?
Hello, freinds
I'm using Tika 0.7
When I test content and metadata extraction by Tika, I met next usecases:
- Date in metadata (DublinCore.DATE, MSOffice.LAST_SAVED,
MSOffice.CREATION_DATE)
Date returned as String, but format is different for different document
types. Probably you already working on this problem (I saw Date object in
metadata in Tika 0.8) but if not, how can I configure Tika to use single
Date format?
- Date in Excel file content.
As we know, Excel have Date fields, and Tika extract it well. But format is
not acceptable for me.
For example
I have field 03/10/2005
Tika extracts it as 10/03/2005
But, I need "yyyy-MM-dd HH:mm:ss.SSSZ" - 2005-10-03 00:00:00.000+0300
So, the question is:
- Can I configure Tika to use singel Date format?
- Can I configure Excel parser to extract date/time objects with specified
date format?
Thanks
Re: How can I configure Tika to extract dates in single format?
Posted by Sergiy Karpenko <se...@exoplatform.com>.
Thanks for quick responce. I will make own Excel parser.
2010/9/8 Nick Burch <ni...@alfresco.com>
> On Wed, 8 Sep 2010, Sergiy Karpenko wrote:
>
>> When I test content and metadata extraction by Tika, I met next usecases:
>> - Date in metadata (DublinCore.DATE, MSOffice.LAST_SAVED,
>> MSOffice.CREATION_DATE)
>> Date returned as String, but format is different for different document
>> types. Probably you already working on this problem (I saw Date object in
>> metadata in Tika 0.8) but if not, how can I configure Tika to use single
>> Date format?
>>
>
> This has only recently been fixed:
> https://issues.apache.org/jira/browse/TIKA-451
>
> You'll want to upgrade to a recent svn checkout / nightly build to get
> these improvements
>
>
>
> - Date in Excel file content.
>> As we know, Excel have Date fields, and Tika extract it well. But format
>> is
>> not acceptable for me.
>>
>> For example
>> I have field 03/10/2005
>> Tika extracts it as 10/03/2005
>> But, I need "yyyy-MM-dd HH:mm:ss.SSSZ" - 2005-10-03 00:00:00.000+0300
>>
>
> Tika does its best to return the dates in the format that they show up in
> Excel.
>
> If you want the dates to be in ISO8601 format, you have two options:
> * Set all your date cells in excel to be formatted as iso8601, rather
> than whatever they currently are
> * Write your own excel parser for Tika, which ignores the date formatting
> set for cells, and always uses iso8601
>
> For the latter, you'd probably start with Tika's ExcelExtractor, then in
> the NumberRecord switch case, use POI's DateUtils class to detect if the
> cell is a date cell or not. If it is, have the cell value turned into a date
> object, and format it as you require. If it isn't, then let the default
> "format this like excel does" logic kick in
>
> For the former, if your users don't want to reformat all their date cells
> for you, you could probably pre-process the file with POI and change all the
> formats.
>
> Nick
>
Re: How can I configure Tika to extract dates in single format?
Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 8 Sep 2010, Sergiy Karpenko wrote:
> When I test content and metadata extraction by Tika, I met next usecases:
> - Date in metadata (DublinCore.DATE, MSOffice.LAST_SAVED,
> MSOffice.CREATION_DATE)
> Date returned as String, but format is different for different document
> types. Probably you already working on this problem (I saw Date object in
> metadata in Tika 0.8) but if not, how can I configure Tika to use single
> Date format?
This has only recently been fixed:
https://issues.apache.org/jira/browse/TIKA-451
You'll want to upgrade to a recent svn checkout / nightly build to get
these improvements
> - Date in Excel file content.
> As we know, Excel have Date fields, and Tika extract it well. But format is
> not acceptable for me.
>
> For example
> I have field 03/10/2005
> Tika extracts it as 10/03/2005
> But, I need "yyyy-MM-dd HH:mm:ss.SSSZ" - 2005-10-03 00:00:00.000+0300
Tika does its best to return the dates in the format that they show up in
Excel.
If you want the dates to be in ISO8601 format, you have two options:
* Set all your date cells in excel to be formatted as iso8601, rather
than whatever they currently are
* Write your own excel parser for Tika, which ignores the date formatting
set for cells, and always uses iso8601
For the latter, you'd probably start with Tika's ExcelExtractor, then in
the NumberRecord switch case, use POI's DateUtils class to detect if the
cell is a date cell or not. If it is, have the cell value turned into a
date object, and format it as you require. If it isn't, then let the
default "format this like excel does" logic kick in
For the former, if your users don't want to reformat all their date cells
for you, you could probably pre-process the file with POI and change all
the formats.
Nick