You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by nilesh gorle <ni...@gmail.com> on 2013/02/13 10:50:48 UTC

Fwd: Query On Apache Tika

Hello,

I am using apache tika. Its really better choice.
But, I need your help for word counting. I used follwing command for
getting WORD-COUNT from METADATA

    input -: java -jar tika_cmd.jar --metadata XXX.doc

    output -:

Application-Name: Microsoft Office Word
Author: XXX
Character Count: 10329
Company:
Content-Length: 47616
Content-Type: application/msword
Creation-Date: 2012-08-01T14:34:00Z
Edit-Time: 600000000
Last-Modified: 2012-08-01T14:34:00Z
Last-Printed: 2012-08-01T14:32:00Z
Last-Save-Date: 2012-08-01T14:34:00Z
Page-Count: 6
Revision-Number: 2
Template: Normal.dotm
Word-Count: 1812
cp:revision: 2
creator: xXX
date: 2012-08-01T14:34:00Z
dc:creator: XXX
dc:title: MUTUAL CONFIDENTIALITY AGREEMENT
dcterms:created: 2012-08-01T14:34:00Z
dcterms:modified: 2012-08-01T14:34:00Z
extended-properties:Application: Microsoft Office Word
extended-properties:Company:
extended-properties:Template: Normal.dotm
meta:author: XXX
meta:character-count: 10329
meta:creation-date: 2012-08-01T14:34:00Z
meta:last-author: Roxanne Potgieter
meta:page-count: 6
meta:print-date: 2012-08-01T14:32:00Z
meta:save-date: 2012-08-01T14:34:00Z
meta:word-count: 1812
modified: 2012-08-01T14:34:00Z
resourceName: Confidentiality Agreement.doc
title: MUTUAL CONFIDENTIALITY AGREEMENT
xmpTPg:NPages: 6

Now I am using same command for other documents which is created in
Openoffice or Libreoffice and save it as doc, docx, xls, xlsx, ppt, pptx.
So I am not getting WORD-COUNT

      input -: java -jar tika_cmd.jar --metadata XXX.doc      ( XXX.doc is
file which is created in openoffice or libreoffice)

      output -:

Application-Name: Microsoft Excel
Application-Version: 12.0000
Author: XXX
Content-Length: 15986
Content-Type:
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Creation-Date: 2013-01-30T16:15:54Z
Last-Modified: 2013-02-05T14:13:31Z
Last-Save-Date: 2013-02-05T14:13:31Z
creator: XXX
date: 2013-01-30T16:15:54Z
dc:creator: XXX
dc:publisher: XXX
dcterms:created: 2013-01-30T16:15:54Z
dcterms:modified: 2013-02-05T14:13:31Z
extended-properties:AppVersion: 12.0000
extended-properties:Application: Microsoft Excel
extended-properties:Company: XXX
meta:author: XXX
meta:creation-date: 2013-01-30T16:15:54Z
meta:last-author: XXX
meta:save-date: 2013-02-05T14:13:31Z
modified: 2013-02-05T14:13:31Z
protected: false
publisher: leosys
resourceName: XXX

      Please, suggest me why I am not getting WORD-COUNT

---------- Forwarded message ----------
From: nilesh gorle <ni...@gmail.com>
Date: 13 February 2013 11:38
Subject: Query On Apache Tika
To: user@tika.apache.org


Hello,

I am using apache tika. Its really better choice.
But, I need your help for word counting. I used follwing command for
getting WORD-COUNT from METADATA

    input -: java -jar tika_cmd.jar --metadata XXX.doc

    output -:

Application-Name: Microsoft Office Word
Author: XXX
 Character Count: 10329
Company:
Content-Length: 47616
 Content-Type: application/msword
Creation-Date: 2012-08-01T14:34:00Z
 Edit-Time: 600000000
Last-Modified: 2012-08-01T14:34:00Z
Last-Printed: 2012-08-01T14:32:00Z
 Last-Save-Date: 2012-08-01T14:34:00Z
Page-Count: 6
Revision-Number: 2
 Template: Normal.dotm
Word-Count: 1812
cp:revision: 2
 creator: xXX
date: 2012-08-01T14:34:00Z
dc:creator: XXX
 dc:title: MUTUAL CONFIDENTIALITY AGREEMENT
dcterms:created: 2012-08-01T14:34:00Z
 dcterms:modified: 2012-08-01T14:34:00Z
extended-properties:Application: Microsoft Office Word
 extended-properties:Company:
extended-properties:Template: Normal.dotm
 meta:author: XXX
meta:character-count: 10329
meta:creation-date: 2012-08-01T14:34:00Z
 meta:last-author: Roxanne Potgieter
meta:page-count: 6
meta:print-date: 2012-08-01T14:32:00Z
 meta:save-date: 2012-08-01T14:34:00Z
meta:word-count: 1812
modified: 2012-08-01T14:34:00Z
 resourceName: Confidentiality Agreement.doc
title: MUTUAL CONFIDENTIALITY AGREEMENT
 xmpTPg:NPages: 6

Now I am using same command for other documents which is created in
Openoffice or Libreoffice and save it as doc, docx, xls, xlsx, ppt, pptx.
So I am not getting WORD-COUNT

      input -: java -jar tika_cmd.jar --metadata XXX.doc      ( XXX.doc is
file which is created in openoffice or libreoffice)

      output -:

Application-Name: Microsoft Excel
 Application-Version: 12.0000
Author: XXX
Content-Length: 15986
 Content-Type:
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Creation-Date: 2013-01-30T16:15:54Z
 Last-Modified: 2013-02-05T14:13:31Z
Last-Save-Date: 2013-02-05T14:13:31Z
 creator: XXX
date: 2013-01-30T16:15:54Z
dc:creator: XXX
 dc:publisher: XXX
dcterms:created: 2013-01-30T16:15:54Z
dcterms:modified: 2013-02-05T14:13:31Z
 extended-properties:AppVersion: 12.0000
extended-properties:Application: Microsoft Excel
 extended-properties:Company: XXX
meta:author: XXX
meta:creation-date: 2013-01-30T16:15:54Z
 meta:last-author: XXX
meta:save-date: 2013-02-05T14:13:31Z
modified: 2013-02-05T14:13:31Z
 protected: false
publisher: leosys
resourceName: XXX

      Please, suggest me why I am not getting WORD-COUNT

-- 
Thanks & Regards -:

Nilesh G.
nileshgorle@gmail.com




-- 
Thanks & Regards -:

Nilesh G.
nileshgorle@gmail.com
9970056516

RE: Query On Apache Tika

Posted by Uwe Schindler <uw...@thetaphi.de>.
TIKA relies on this information to be given in the Word file (it is just Metadata somewhere in the header of the file). TIKA does *not* count the words, so it relies on the Application that saved the file. If Open-/LibreOffice does not do this, TIKA cannot get it.
You can verify this with Microsoft Windows, if you right click on the word file and select "Details" tab. Windows Explorer shows the metadata on this tab, if it does not display word count it is really not in the file.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: nilesh gorle [mailto:nileshgorle@gmail.com]
> Sent: Wednesday, February 13, 2013 10:51 AM
> To: dev@tika.apache.org
> Subject: Fwd: Query On Apache Tika
> 
> Hello,
> 
> I am using apache tika. Its really better choice.
> But, I need your help for word counting. I used follwing command for getting
> WORD-COUNT from METADATA
> 
>     input -: java -jar tika_cmd.jar --metadata XXX.doc
> 
>     output -:
> 
> Application-Name: Microsoft Office Word
> Author: XXX
> Character Count: 10329
> Company:
> Content-Length: 47616
> Content-Type: application/msword
> Creation-Date: 2012-08-01T14:34:00Z
> Edit-Time: 600000000
> Last-Modified: 2012-08-01T14:34:00Z
> Last-Printed: 2012-08-01T14:32:00Z
> Last-Save-Date: 2012-08-01T14:34:00Z
> Page-Count: 6
> Revision-Number: 2
> Template: Normal.dotm
> Word-Count: 1812
> cp:revision: 2
> creator: xXX
> date: 2012-08-01T14:34:00Z
> dc:creator: XXX
> dc:title: MUTUAL CONFIDENTIALITY AGREEMENT
> dcterms:created: 2012-08-01T14:34:00Z
> dcterms:modified: 2012-08-01T14:34:00Z
> extended-properties:Application: Microsoft Office Word
> extended-properties:Company:
> extended-properties:Template: Normal.dotm
> meta:author: XXX
> meta:character-count: 10329
> meta:creation-date: 2012-08-01T14:34:00Z
> meta:last-author: Roxanne Potgieter
> meta:page-count: 6
> meta:print-date: 2012-08-01T14:32:00Z
> meta:save-date: 2012-08-01T14:34:00Z
> meta:word-count: 1812
> modified: 2012-08-01T14:34:00Z
> resourceName: Confidentiality Agreement.doc
> title: MUTUAL CONFIDENTIALITY AGREEMENT
> xmpTPg:NPages: 6
> 
> Now I am using same command for other documents which is created in
> Openoffice or Libreoffice and save it as doc, docx, xls, xlsx, ppt, pptx.
> So I am not getting WORD-COUNT
> 
>       input -: java -jar tika_cmd.jar --metadata XXX.doc      ( XXX.doc is
> file which is created in openoffice or libreoffice)
> 
>       output -:
> 
> Application-Name: Microsoft Excel
> Application-Version: 12.0000
> Author: XXX
> Content-Length: 15986
> Content-Type:
> application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
> Creation-Date: 2013-01-30T16:15:54Z
> Last-Modified: 2013-02-05T14:13:31Z
> Last-Save-Date: 2013-02-05T14:13:31Z
> creator: XXX
> date: 2013-01-30T16:15:54Z
> dc:creator: XXX
> dc:publisher: XXX
> dcterms:created: 2013-01-30T16:15:54Z
> dcterms:modified: 2013-02-05T14:13:31Z
> extended-properties:AppVersion: 12.0000
> extended-properties:Application: Microsoft Excel
> extended-properties:Company: XXX
> meta:author: XXX
> meta:creation-date: 2013-01-30T16:15:54Z
> meta:last-author: XXX
> meta:save-date: 2013-02-05T14:13:31Z
> modified: 2013-02-05T14:13:31Z
> protected: false
> publisher: leosys
> resourceName: XXX
> 
>       Please, suggest me why I am not getting WORD-COUNT
> 
> ---------- Forwarded message ----------
> From: nilesh gorle <ni...@gmail.com>
> Date: 13 February 2013 11:38
> Subject: Query On Apache Tika
> To: user@tika.apache.org
> 
> 
> Hello,
> 
> I am using apache tika. Its really better choice.
> But, I need your help for word counting. I used follwing command for getting
> WORD-COUNT from METADATA
> 
>     input -: java -jar tika_cmd.jar --metadata XXX.doc
> 
>     output -:
> 
> Application-Name: Microsoft Office Word
> Author: XXX
>  Character Count: 10329
> Company:
> Content-Length: 47616
>  Content-Type: application/msword
> Creation-Date: 2012-08-01T14:34:00Z
>  Edit-Time: 600000000
> Last-Modified: 2012-08-01T14:34:00Z
> Last-Printed: 2012-08-01T14:32:00Z
>  Last-Save-Date: 2012-08-01T14:34:00Z
> Page-Count: 6
> Revision-Number: 2
>  Template: Normal.dotm
> Word-Count: 1812
> cp:revision: 2
>  creator: xXX
> date: 2012-08-01T14:34:00Z
> dc:creator: XXX
>  dc:title: MUTUAL CONFIDENTIALITY AGREEMENT
> dcterms:created: 2012-08-01T14:34:00Z
>  dcterms:modified: 2012-08-01T14:34:00Z
> extended-properties:Application: Microsoft Office Word
>  extended-properties:Company:
> extended-properties:Template: Normal.dotm
>  meta:author: XXX
> meta:character-count: 10329
> meta:creation-date: 2012-08-01T14:34:00Z
>  meta:last-author: Roxanne Potgieter
> meta:page-count: 6
> meta:print-date: 2012-08-01T14:32:00Z
>  meta:save-date: 2012-08-01T14:34:00Z
> meta:word-count: 1812
> modified: 2012-08-01T14:34:00Z
>  resourceName: Confidentiality Agreement.doc
> title: MUTUAL CONFIDENTIALITY AGREEMENT
>  xmpTPg:NPages: 6
> 
> Now I am using same command for other documents which is created in
> Openoffice or Libreoffice and save it as doc, docx, xls, xlsx, ppt, pptx.
> So I am not getting WORD-COUNT
> 
>       input -: java -jar tika_cmd.jar --metadata XXX.doc      ( XXX.doc is
> file which is created in openoffice or libreoffice)
> 
>       output -:
> 
> Application-Name: Microsoft Excel
>  Application-Version: 12.0000
> Author: XXX
> Content-Length: 15986
>  Content-Type:
> application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
> Creation-Date: 2013-01-30T16:15:54Z
>  Last-Modified: 2013-02-05T14:13:31Z
> Last-Save-Date: 2013-02-05T14:13:31Z
>  creator: XXX
> date: 2013-01-30T16:15:54Z
> dc:creator: XXX
>  dc:publisher: XXX
> dcterms:created: 2013-01-30T16:15:54Z
> dcterms:modified: 2013-02-05T14:13:31Z
>  extended-properties:AppVersion: 12.0000
> extended-properties:Application: Microsoft Excel
>  extended-properties:Company: XXX
> meta:author: XXX
> meta:creation-date: 2013-01-30T16:15:54Z
>  meta:last-author: XXX
> meta:save-date: 2013-02-05T14:13:31Z
> modified: 2013-02-05T14:13:31Z
>  protected: false
> publisher: leosys
> resourceName: XXX
> 
>       Please, suggest me why I am not getting WORD-COUNT
> 
> --
> Thanks & Regards -:
> 
> Nilesh G.
> nileshgorle@gmail.com
> 
> 
> 
> 
> --
> Thanks & Regards -:
> 
> Nilesh G.
> nileshgorle@gmail.com
> 9970056516