You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Paul Borgermans (JIRA)" <ji...@apache.org> on 2008/12/06 17:07:44 UTC

[jira] Created: (TIKA-179) Tika stand alone CLI --text output mostly not working, other output formats are fine

Tika stand alone CLI --text output mostly not working, other output formats are fine
------------------------------------------------------------------------------------

                 Key: TIKA-179
                 URL: https://issues.apache.org/jira/browse/TIKA-179
             Project: Tika
          Issue Type: Bug
          Components: cli
    Affects Versions: 0.2, 0.3
         Environment: Java 1.5 (also tried Java 1.6). OS used:  Mac OS X, Linux (CentOS)
            Reporter: Paul Borgermans


When using Tika standalone jar after mvn install in CLI mode, in most of my test documents (pdf, doc, ppt, odt, ), the plain text output option (-t or --text) does not produce any result. When using the other options (xml, html, metadata), the output is correct. Activating debug mode (-v) does not produce additional info either.

When using the GUI, dragging and dropping does produce the expected results, also in the plain text tab/window

I rebuilt tika many times in the past 2 months (cleared .m2 directory every time) from svn (latest revision tried:  724002), the CLI --text result is always the same: usually missing output.

For now, I use the -x output option chained to html2txt as a workaround, but would prefer to use just tika to convert to plain text (which is used for further indexing in Solr).

Thanks




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-179) Tika stand alone CLI --text output mostly not working, other output formats are fine

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680420#action_12680420 ] 

Michael McCandless commented on TIKA-179:
-----------------------------------------

I'm still seeing this issue on the 0.3 RC1.  I'm on Debian Linux, and when I run a trivial PDF doc, like this:

{code}
cat PDF.pdf | java -cp target/tika-0.3-standalone.jar org.apache.tika.cli.TikaCLI --text
{code}

I get no output...

But if I leave off the --text, I do get output.  Same with --html, --xml and --metadata.  My CLASSPATH is otherwise empty.  Not sure what's going on...

> Tika stand alone CLI --text output mostly not working, other output formats are fine
> ------------------------------------------------------------------------------------
>
>                 Key: TIKA-179
>                 URL: https://issues.apache.org/jira/browse/TIKA-179
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.2, 0.3
>         Environment: Java 1.5 (also tried Java 1.6). OS used:  Mac OS X, Linux (CentOS)
>            Reporter: Paul Borgermans
>            Assignee: Jukka Zitting
>             Fix For: 0.3
>
>
> When using Tika standalone jar after mvn install in CLI mode, in most of my test documents (pdf, doc, ppt, odt, ), the plain text output option (-t or --text) does not produce any result. When using the other options (xml, html, metadata), the output is correct. Activating debug mode (-v) does not produce additional info either.
> When using the GUI, dragging and dropping does produce the expected results, also in the plain text tab/window
> I rebuilt tika many times in the past 2 months (cleared .m2 directory every time) from svn (latest revision tried:  724002), the CLI --text result is always the same: usually missing output.
> For now, I use the -x output option chained to html2txt as a workaround, but would prefer to use just tika to convert to plain text (which is used for further indexing in Solr).
> Thanks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-179) Tika stand alone CLI --text output mostly not working, other output formats are fine

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-179.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.3
         Assignee: Jukka Zitting

Fixed in revision 724176. Thanks for reporting this!

> Tika stand alone CLI --text output mostly not working, other output formats are fine
> ------------------------------------------------------------------------------------
>
>                 Key: TIKA-179
>                 URL: https://issues.apache.org/jira/browse/TIKA-179
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.2, 0.3
>         Environment: Java 1.5 (also tried Java 1.6). OS used:  Mac OS X, Linux (CentOS)
>            Reporter: Paul Borgermans
>            Assignee: Jukka Zitting
>             Fix For: 0.3
>
>
> When using Tika standalone jar after mvn install in CLI mode, in most of my test documents (pdf, doc, ppt, odt, ), the plain text output option (-t or --text) does not produce any result. When using the other options (xml, html, metadata), the output is correct. Activating debug mode (-v) does not produce additional info either.
> When using the GUI, dragging and dropping does produce the expected results, also in the plain text tab/window
> I rebuilt tika many times in the past 2 months (cleared .m2 directory every time) from svn (latest revision tried:  724002), the CLI --text result is always the same: usually missing output.
> For now, I use the -x output option chained to html2txt as a workaround, but would prefer to use just tika to convert to plain text (which is used for further indexing in Solr).
> Thanks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-179) Tika stand alone CLI --text output mostly not working, other output formats are fine

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680450#action_12680450 ] 

Michael McCandless commented on TIKA-179:
-----------------------------------------

OK, phew, just user error.  Sorry for the noise.

Though, it's odd that without any args, it works.  Oh I see, the code simply falls back to "-" when there are no args.

I agree this can wait until 0.4.  Thanks!

> Tika stand alone CLI --text output mostly not working, other output formats are fine
> ------------------------------------------------------------------------------------
>
>                 Key: TIKA-179
>                 URL: https://issues.apache.org/jira/browse/TIKA-179
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.2, 0.3
>         Environment: Java 1.5 (also tried Java 1.6). OS used:  Mac OS X, Linux (CentOS)
>            Reporter: Paul Borgermans
>            Assignee: Jukka Zitting
>             Fix For: 0.3
>
>
> When using Tika standalone jar after mvn install in CLI mode, in most of my test documents (pdf, doc, ppt, odt, ), the plain text output option (-t or --text) does not produce any result. When using the other options (xml, html, metadata), the output is correct. Activating debug mode (-v) does not produce additional info either.
> When using the GUI, dragging and dropping does produce the expected results, also in the plain text tab/window
> I rebuilt tika many times in the past 2 months (cleared .m2 directory every time) from svn (latest revision tried:  724002), the CLI --text result is always the same: usually missing output.
> For now, I use the -x output option chained to html2txt as a workaround, but would prefer to use just tika to convert to plain text (which is used for further indexing in Solr).
> Thanks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-179) Tika stand alone CLI --text output mostly not working, other output formats are fine

Posted by "Jonathan Koren (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680425#action_12680425 ] 

Jonathan Koren commented on TIKA-179:
-------------------------------------

@Michael

You're not running the command correctly.  
See java -jar tika-0.3-standalone.jar --help   


> Tika stand alone CLI --text output mostly not working, other output formats are fine
> ------------------------------------------------------------------------------------
>
>                 Key: TIKA-179
>                 URL: https://issues.apache.org/jira/browse/TIKA-179
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.2, 0.3
>         Environment: Java 1.5 (also tried Java 1.6). OS used:  Mac OS X, Linux (CentOS)
>            Reporter: Paul Borgermans
>            Assignee: Jukka Zitting
>             Fix For: 0.3
>
>
> When using Tika standalone jar after mvn install in CLI mode, in most of my test documents (pdf, doc, ppt, odt, ), the plain text output option (-t or --text) does not produce any result. When using the other options (xml, html, metadata), the output is correct. Activating debug mode (-v) does not produce additional info either.
> When using the GUI, dragging and dropping does produce the expected results, also in the plain text tab/window
> I rebuilt tika many times in the past 2 months (cleared .m2 directory every time) from svn (latest revision tried:  724002), the CLI --text result is always the same: usually missing output.
> For now, I use the -x output option chained to html2txt as a workaround, but would prefer to use just tika to convert to plain text (which is used for further indexing in Solr).
> Thanks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-179) Tika stand alone CLI --text output mostly not working, other output formats are fine

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654221#action_12654221 ] 

Jukka Zitting commented on TIKA-179:
------------------------------------

Yeah, you're right! The --text option only seems to work on large documents and even then the tail end of the text gets clipped. Looks very much like a buffering issue.

> Tika stand alone CLI --text output mostly not working, other output formats are fine
> ------------------------------------------------------------------------------------
>
>                 Key: TIKA-179
>                 URL: https://issues.apache.org/jira/browse/TIKA-179
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.2, 0.3
>         Environment: Java 1.5 (also tried Java 1.6). OS used:  Mac OS X, Linux (CentOS)
>            Reporter: Paul Borgermans
>
> When using Tika standalone jar after mvn install in CLI mode, in most of my test documents (pdf, doc, ppt, odt, ), the plain text output option (-t or --text) does not produce any result. When using the other options (xml, html, metadata), the output is correct. Activating debug mode (-v) does not produce additional info either.
> When using the GUI, dragging and dropping does produce the expected results, also in the plain text tab/window
> I rebuilt tika many times in the past 2 months (cleared .m2 directory every time) from svn (latest revision tried:  724002), the CLI --text result is always the same: usually missing output.
> For now, I use the -x output option chained to html2txt as a workaround, but would prefer to use just tika to convert to plain text (which is used for further indexing in Solr).
> Thanks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-179) Tika stand alone CLI --text output mostly not working, other output formats are fine

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680431#action_12680431 ] 

Jukka Zitting commented on TIKA-179:
------------------------------------

Also, even the following works (as documented in --help):

    cat PDF.pdf | java -cp target/tika-0.3-standalone.jar org.apache.tika.cli.TikaCLI --text -

Let's file an improvement request to make the pipe mode work also without the "-".

> Tika stand alone CLI --text output mostly not working, other output formats are fine
> ------------------------------------------------------------------------------------
>
>                 Key: TIKA-179
>                 URL: https://issues.apache.org/jira/browse/TIKA-179
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.2, 0.3
>         Environment: Java 1.5 (also tried Java 1.6). OS used:  Mac OS X, Linux (CentOS)
>            Reporter: Paul Borgermans
>            Assignee: Jukka Zitting
>             Fix For: 0.3
>
>
> When using Tika standalone jar after mvn install in CLI mode, in most of my test documents (pdf, doc, ppt, odt, ), the plain text output option (-t or --text) does not produce any result. When using the other options (xml, html, metadata), the output is correct. Activating debug mode (-v) does not produce additional info either.
> When using the GUI, dragging and dropping does produce the expected results, also in the plain text tab/window
> I rebuilt tika many times in the past 2 months (cleared .m2 directory every time) from svn (latest revision tried:  724002), the CLI --text result is always the same: usually missing output.
> For now, I use the -x output option chained to html2txt as a workaround, but would prefer to use just tika to convert to plain text (which is used for further indexing in Solr).
> Thanks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-179) Tika stand alone CLI --text output mostly not working, other output formats are fine

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680433#action_12680433 ] 

Jukka Zitting commented on TIKA-179:
------------------------------------

See TIKA-206. We can address it in Tika 0.4.

> Tika stand alone CLI --text output mostly not working, other output formats are fine
> ------------------------------------------------------------------------------------
>
>                 Key: TIKA-179
>                 URL: https://issues.apache.org/jira/browse/TIKA-179
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.2, 0.3
>         Environment: Java 1.5 (also tried Java 1.6). OS used:  Mac OS X, Linux (CentOS)
>            Reporter: Paul Borgermans
>            Assignee: Jukka Zitting
>             Fix For: 0.3
>
>
> When using Tika standalone jar after mvn install in CLI mode, in most of my test documents (pdf, doc, ppt, odt, ), the plain text output option (-t or --text) does not produce any result. When using the other options (xml, html, metadata), the output is correct. Activating debug mode (-v) does not produce additional info either.
> When using the GUI, dragging and dropping does produce the expected results, also in the plain text tab/window
> I rebuilt tika many times in the past 2 months (cleared .m2 directory every time) from svn (latest revision tried:  724002), the CLI --text result is always the same: usually missing output.
> For now, I use the -x output option chained to html2txt as a workaround, but would prefer to use just tika to convert to plain text (which is used for further indexing in Solr).
> Thanks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-179) Tika stand alone CLI --text output mostly not working, other output formats are fine

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680430#action_12680430 ] 

Jukka Zitting commented on TIKA-179:
------------------------------------

Well, it should work also when the input is piped. I was able to reproduce this, the --text option works when the input is given as a file argument, but not when it's piped.

    java -cp target/tika-0.3-standalone.jar org.apache.tika.cli.TikaCLI --text PDF.pdf




> Tika stand alone CLI --text output mostly not working, other output formats are fine
> ------------------------------------------------------------------------------------
>
>                 Key: TIKA-179
>                 URL: https://issues.apache.org/jira/browse/TIKA-179
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.2, 0.3
>         Environment: Java 1.5 (also tried Java 1.6). OS used:  Mac OS X, Linux (CentOS)
>            Reporter: Paul Borgermans
>            Assignee: Jukka Zitting
>             Fix For: 0.3
>
>
> When using Tika standalone jar after mvn install in CLI mode, in most of my test documents (pdf, doc, ppt, odt, ), the plain text output option (-t or --text) does not produce any result. When using the other options (xml, html, metadata), the output is correct. Activating debug mode (-v) does not produce additional info either.
> When using the GUI, dragging and dropping does produce the expected results, also in the plain text tab/window
> I rebuilt tika many times in the past 2 months (cleared .m2 directory every time) from svn (latest revision tried:  724002), the CLI --text result is always the same: usually missing output.
> For now, I use the -x output option chained to html2txt as a workaround, but would prefer to use just tika to convert to plain text (which is used for further indexing in Solr).
> Thanks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.