You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by nilesh gorle <ni...@gmail.com> on 2013/02/13 10:50:48 UTC
Fwd: Query On Apache Tika
Hello,
I am using apache tika. Its really better choice.
But, I need your help for word counting. I used follwing command for
getting WORD-COUNT from METADATA
input -: java -jar tika_cmd.jar --metadata XXX.doc
output -:
Application-Name: Microsoft Office Word
Author: XXX
Character Count: 10329
Company:
Content-Length: 47616
Content-Type: application/msword
Creation-Date: 2012-08-01T14:34:00Z
Edit-Time: 600000000
Last-Modified: 2012-08-01T14:34:00Z
Last-Printed: 2012-08-01T14:32:00Z
Last-Save-Date: 2012-08-01T14:34:00Z
Page-Count: 6
Revision-Number: 2
Template: Normal.dotm
Word-Count: 1812
cp:revision: 2
creator: xXX
date: 2012-08-01T14:34:00Z
dc:creator: XXX
dc:title: MUTUAL CONFIDENTIALITY AGREEMENT
dcterms:created: 2012-08-01T14:34:00Z
dcterms:modified: 2012-08-01T14:34:00Z
extended-properties:Application: Microsoft Office Word
extended-properties:Company:
extended-properties:Template: Normal.dotm
meta:author: XXX
meta:character-count: 10329
meta:creation-date: 2012-08-01T14:34:00Z
meta:last-author: Roxanne Potgieter
meta:page-count: 6
meta:print-date: 2012-08-01T14:32:00Z
meta:save-date: 2012-08-01T14:34:00Z
meta:word-count: 1812
modified: 2012-08-01T14:34:00Z
resourceName: Confidentiality Agreement.doc
title: MUTUAL CONFIDENTIALITY AGREEMENT
xmpTPg:NPages: 6
Now I am using same command for other documents which is created in
Openoffice or Libreoffice and save it as doc, docx, xls, xlsx, ppt, pptx.
So I am not getting WORD-COUNT
input -: java -jar tika_cmd.jar --metadata XXX.doc ( XXX.doc is
file which is created in openoffice or libreoffice)
output -:
Application-Name: Microsoft Excel
Application-Version: 12.0000
Author: XXX
Content-Length: 15986
Content-Type:
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Creation-Date: 2013-01-30T16:15:54Z
Last-Modified: 2013-02-05T14:13:31Z
Last-Save-Date: 2013-02-05T14:13:31Z
creator: XXX
date: 2013-01-30T16:15:54Z
dc:creator: XXX
dc:publisher: XXX
dcterms:created: 2013-01-30T16:15:54Z
dcterms:modified: 2013-02-05T14:13:31Z
extended-properties:AppVersion: 12.0000
extended-properties:Application: Microsoft Excel
extended-properties:Company: XXX
meta:author: XXX
meta:creation-date: 2013-01-30T16:15:54Z
meta:last-author: XXX
meta:save-date: 2013-02-05T14:13:31Z
modified: 2013-02-05T14:13:31Z
protected: false
publisher: leosys
resourceName: XXX
Please, suggest me why I am not getting WORD-COUNT
---------- Forwarded message ----------
From: nilesh gorle <ni...@gmail.com>
Date: 13 February 2013 11:38
Subject: Query On Apache Tika
To: user@tika.apache.org
Hello,
I am using apache tika. Its really better choice.
But, I need your help for word counting. I used follwing command for
getting WORD-COUNT from METADATA
input -: java -jar tika_cmd.jar --metadata XXX.doc
output -:
Application-Name: Microsoft Office Word
Author: XXX
Character Count: 10329
Company:
Content-Length: 47616
Content-Type: application/msword
Creation-Date: 2012-08-01T14:34:00Z
Edit-Time: 600000000
Last-Modified: 2012-08-01T14:34:00Z
Last-Printed: 2012-08-01T14:32:00Z
Last-Save-Date: 2012-08-01T14:34:00Z
Page-Count: 6
Revision-Number: 2
Template: Normal.dotm
Word-Count: 1812
cp:revision: 2
creator: xXX
date: 2012-08-01T14:34:00Z
dc:creator: XXX
dc:title: MUTUAL CONFIDENTIALITY AGREEMENT
dcterms:created: 2012-08-01T14:34:00Z
dcterms:modified: 2012-08-01T14:34:00Z
extended-properties:Application: Microsoft Office Word
extended-properties:Company:
extended-properties:Template: Normal.dotm
meta:author: XXX
meta:character-count: 10329
meta:creation-date: 2012-08-01T14:34:00Z
meta:last-author: Roxanne Potgieter
meta:page-count: 6
meta:print-date: 2012-08-01T14:32:00Z
meta:save-date: 2012-08-01T14:34:00Z
meta:word-count: 1812
modified: 2012-08-01T14:34:00Z
resourceName: Confidentiality Agreement.doc
title: MUTUAL CONFIDENTIALITY AGREEMENT
xmpTPg:NPages: 6
Now I am using same command for other documents which is created in
Openoffice or Libreoffice and save it as doc, docx, xls, xlsx, ppt, pptx.
So I am not getting WORD-COUNT
input -: java -jar tika_cmd.jar --metadata XXX.doc ( XXX.doc is
file which is created in openoffice or libreoffice)
output -:
Application-Name: Microsoft Excel
Application-Version: 12.0000
Author: XXX
Content-Length: 15986
Content-Type:
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Creation-Date: 2013-01-30T16:15:54Z
Last-Modified: 2013-02-05T14:13:31Z
Last-Save-Date: 2013-02-05T14:13:31Z
creator: XXX
date: 2013-01-30T16:15:54Z
dc:creator: XXX
dc:publisher: XXX
dcterms:created: 2013-01-30T16:15:54Z
dcterms:modified: 2013-02-05T14:13:31Z
extended-properties:AppVersion: 12.0000
extended-properties:Application: Microsoft Excel
extended-properties:Company: XXX
meta:author: XXX
meta:creation-date: 2013-01-30T16:15:54Z
meta:last-author: XXX
meta:save-date: 2013-02-05T14:13:31Z
modified: 2013-02-05T14:13:31Z
protected: false
publisher: leosys
resourceName: XXX
Please, suggest me why I am not getting WORD-COUNT
--
Thanks & Regards -:
Nilesh G.
nileshgorle@gmail.com
--
Thanks & Regards -:
Nilesh G.
nileshgorle@gmail.com
9970056516
RE: Query On Apache Tika
Posted by Uwe Schindler <uw...@thetaphi.de>.
TIKA relies on this information to be given in the Word file (it is just Metadata somewhere in the header of the file). TIKA does *not* count the words, so it relies on the Application that saved the file. If Open-/LibreOffice does not do this, TIKA cannot get it.
You can verify this with Microsoft Windows, if you right click on the word file and select "Details" tab. Windows Explorer shows the metadata on this tab, if it does not display word count it is really not in the file.
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de
> -----Original Message-----
> From: nilesh gorle [mailto:nileshgorle@gmail.com]
> Sent: Wednesday, February 13, 2013 10:51 AM
> To: dev@tika.apache.org
> Subject: Fwd: Query On Apache Tika
>
> Hello,
>
> I am using apache tika. Its really better choice.
> But, I need your help for word counting. I used follwing command for getting
> WORD-COUNT from METADATA
>
> input -: java -jar tika_cmd.jar --metadata XXX.doc
>
> output -:
>
> Application-Name: Microsoft Office Word
> Author: XXX
> Character Count: 10329
> Company:
> Content-Length: 47616
> Content-Type: application/msword
> Creation-Date: 2012-08-01T14:34:00Z
> Edit-Time: 600000000
> Last-Modified: 2012-08-01T14:34:00Z
> Last-Printed: 2012-08-01T14:32:00Z
> Last-Save-Date: 2012-08-01T14:34:00Z
> Page-Count: 6
> Revision-Number: 2
> Template: Normal.dotm
> Word-Count: 1812
> cp:revision: 2
> creator: xXX
> date: 2012-08-01T14:34:00Z
> dc:creator: XXX
> dc:title: MUTUAL CONFIDENTIALITY AGREEMENT
> dcterms:created: 2012-08-01T14:34:00Z
> dcterms:modified: 2012-08-01T14:34:00Z
> extended-properties:Application: Microsoft Office Word
> extended-properties:Company:
> extended-properties:Template: Normal.dotm
> meta:author: XXX
> meta:character-count: 10329
> meta:creation-date: 2012-08-01T14:34:00Z
> meta:last-author: Roxanne Potgieter
> meta:page-count: 6
> meta:print-date: 2012-08-01T14:32:00Z
> meta:save-date: 2012-08-01T14:34:00Z
> meta:word-count: 1812
> modified: 2012-08-01T14:34:00Z
> resourceName: Confidentiality Agreement.doc
> title: MUTUAL CONFIDENTIALITY AGREEMENT
> xmpTPg:NPages: 6
>
> Now I am using same command for other documents which is created in
> Openoffice or Libreoffice and save it as doc, docx, xls, xlsx, ppt, pptx.
> So I am not getting WORD-COUNT
>
> input -: java -jar tika_cmd.jar --metadata XXX.doc ( XXX.doc is
> file which is created in openoffice or libreoffice)
>
> output -:
>
> Application-Name: Microsoft Excel
> Application-Version: 12.0000
> Author: XXX
> Content-Length: 15986
> Content-Type:
> application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
> Creation-Date: 2013-01-30T16:15:54Z
> Last-Modified: 2013-02-05T14:13:31Z
> Last-Save-Date: 2013-02-05T14:13:31Z
> creator: XXX
> date: 2013-01-30T16:15:54Z
> dc:creator: XXX
> dc:publisher: XXX
> dcterms:created: 2013-01-30T16:15:54Z
> dcterms:modified: 2013-02-05T14:13:31Z
> extended-properties:AppVersion: 12.0000
> extended-properties:Application: Microsoft Excel
> extended-properties:Company: XXX
> meta:author: XXX
> meta:creation-date: 2013-01-30T16:15:54Z
> meta:last-author: XXX
> meta:save-date: 2013-02-05T14:13:31Z
> modified: 2013-02-05T14:13:31Z
> protected: false
> publisher: leosys
> resourceName: XXX
>
> Please, suggest me why I am not getting WORD-COUNT
>
> ---------- Forwarded message ----------
> From: nilesh gorle <ni...@gmail.com>
> Date: 13 February 2013 11:38
> Subject: Query On Apache Tika
> To: user@tika.apache.org
>
>
> Hello,
>
> I am using apache tika. Its really better choice.
> But, I need your help for word counting. I used follwing command for getting
> WORD-COUNT from METADATA
>
> input -: java -jar tika_cmd.jar --metadata XXX.doc
>
> output -:
>
> Application-Name: Microsoft Office Word
> Author: XXX
> Character Count: 10329
> Company:
> Content-Length: 47616
> Content-Type: application/msword
> Creation-Date: 2012-08-01T14:34:00Z
> Edit-Time: 600000000
> Last-Modified: 2012-08-01T14:34:00Z
> Last-Printed: 2012-08-01T14:32:00Z
> Last-Save-Date: 2012-08-01T14:34:00Z
> Page-Count: 6
> Revision-Number: 2
> Template: Normal.dotm
> Word-Count: 1812
> cp:revision: 2
> creator: xXX
> date: 2012-08-01T14:34:00Z
> dc:creator: XXX
> dc:title: MUTUAL CONFIDENTIALITY AGREEMENT
> dcterms:created: 2012-08-01T14:34:00Z
> dcterms:modified: 2012-08-01T14:34:00Z
> extended-properties:Application: Microsoft Office Word
> extended-properties:Company:
> extended-properties:Template: Normal.dotm
> meta:author: XXX
> meta:character-count: 10329
> meta:creation-date: 2012-08-01T14:34:00Z
> meta:last-author: Roxanne Potgieter
> meta:page-count: 6
> meta:print-date: 2012-08-01T14:32:00Z
> meta:save-date: 2012-08-01T14:34:00Z
> meta:word-count: 1812
> modified: 2012-08-01T14:34:00Z
> resourceName: Confidentiality Agreement.doc
> title: MUTUAL CONFIDENTIALITY AGREEMENT
> xmpTPg:NPages: 6
>
> Now I am using same command for other documents which is created in
> Openoffice or Libreoffice and save it as doc, docx, xls, xlsx, ppt, pptx.
> So I am not getting WORD-COUNT
>
> input -: java -jar tika_cmd.jar --metadata XXX.doc ( XXX.doc is
> file which is created in openoffice or libreoffice)
>
> output -:
>
> Application-Name: Microsoft Excel
> Application-Version: 12.0000
> Author: XXX
> Content-Length: 15986
> Content-Type:
> application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
> Creation-Date: 2013-01-30T16:15:54Z
> Last-Modified: 2013-02-05T14:13:31Z
> Last-Save-Date: 2013-02-05T14:13:31Z
> creator: XXX
> date: 2013-01-30T16:15:54Z
> dc:creator: XXX
> dc:publisher: XXX
> dcterms:created: 2013-01-30T16:15:54Z
> dcterms:modified: 2013-02-05T14:13:31Z
> extended-properties:AppVersion: 12.0000
> extended-properties:Application: Microsoft Excel
> extended-properties:Company: XXX
> meta:author: XXX
> meta:creation-date: 2013-01-30T16:15:54Z
> meta:last-author: XXX
> meta:save-date: 2013-02-05T14:13:31Z
> modified: 2013-02-05T14:13:31Z
> protected: false
> publisher: leosys
> resourceName: XXX
>
> Please, suggest me why I am not getting WORD-COUNT
>
> --
> Thanks & Regards -:
>
> Nilesh G.
> nileshgorle@gmail.com
>
>
>
>
> --
> Thanks & Regards -:
>
> Nilesh G.
> nileshgorle@gmail.com
> 9970056516