You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Peter Wolanin <pe...@acquia.com> on 2009/07/11 23:39:03 UTC

Select tika output for extract-only?

I had been assuming that I could choose among possible tika output
formats when using the extracting request handler in extract-only mode
as if from the CLI with the tika jar:

    -x or --xml        Output XHTML content (default)
    -h or --html       Output HTML content
    -t or --text       Output plain text content
    -m or --metadata   Output only metadata

However, looking at the docs and source, it seems that only the xml
option is available (hard-coded) in ExtractingDocumentLoader:

serializer = new XMLSerializer(writer, new OutputFormat("XML", "UTF-8", true));

In addition, it seems that the metadata is always appended to the response.

Are there any open issues relating to this, or opinions on whether
adding additional flexibility to the response format would be of
interest for 1.4?

Thanks,

Peter

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wolanin@acquia.com

Re: Select tika output for extract-only?

Posted by Peter Wolanin <pe...@acquia.com>.

Ok, thanks. I played with it enough to to get plain text out at least,
but I'll wait for the resolution of SOLR-284

-Peter

On Sun, Jul 12, 2009 at 9:20 AM, Yonik Seeley<yo...@lucidimagination.com> wrote:
> Peter, I'm hacking up solr cell right now, trying to simplify the
> parameters and fix some bugs (see SOLR-284)
> A quick patch to specify the output format should make it into 1.4 -
> but you may want to wait until I finish.
>
> -Yonik
> http://www.lucidimagination.com
>
> On Sat, Jul 11, 2009 at 5:39 PM, Peter Wolanin<pe...@acquia.com> wrote:
>> I had been assuming that I could choose among possible tika output
>> formats when using the extracting request handler in extract-only mode
>> as if from the CLI with the tika jar:
>>
>>    -x or --xml        Output XHTML content (default)
>>    -h or --html       Output HTML content
>>    -t or --text       Output plain text content
>>    -m or --metadata   Output only metadata
>>
>> However, looking at the docs and source, it seems that only the xml
>> option is available (hard-coded) in ExtractingDocumentLoader:
>>
>> serializer = new XMLSerializer(writer, new OutputFormat("XML", "UTF-8", true));
>>
>> In addition, it seems that the metadata is always appended to the response.
>>
>> Are there any open issues relating to this, or opinions on whether
>> adding additional flexibility to the response format would be of
>> interest for 1.4?
>>
>> Thanks,
>>
>> Peter
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wolanin@acquia.com
>>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wolanin@acquia.com

Re: Select tika output for extract-only?

Posted by Yonik Seeley <yo...@lucidimagination.com>.

Peter, I'm hacking up solr cell right now, trying to simplify the
parameters and fix some bugs (see SOLR-284)
A quick patch to specify the output format should make it into 1.4 -
but you may want to wait until I finish.

-Yonik
http://www.lucidimagination.com

On Sat, Jul 11, 2009 at 5:39 PM, Peter Wolanin<pe...@acquia.com> wrote:
> I had been assuming that I could choose among possible tika output
> formats when using the extracting request handler in extract-only mode
> as if from the CLI with the tika jar:
>
>    -x or --xml        Output XHTML content (default)
>    -h or --html       Output HTML content
>    -t or --text       Output plain text content
>    -m or --metadata   Output only metadata
>
> However, looking at the docs and source, it seems that only the xml
> option is available (hard-coded) in ExtractingDocumentLoader:
>
> serializer = new XMLSerializer(writer, new OutputFormat("XML", "UTF-8", true));
>
> In addition, it seems that the metadata is always appended to the response.
>
> Are there any open issues relating to this, or opinions on whether
> adding additional flexibility to the response format would be of
> interest for 1.4?
>
> Thanks,
>
> Peter
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wolanin@acquia.com
>

Re: Select tika output for extract-only?

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 11, 2009, at 5:39 PM, Peter Wolanin wrote:

> I had been assuming that I could choose among possible tika output
> formats when using the extracting request handler in extract-only mode
> as if from the CLI with the tika jar:
>
>    -x or --xml        Output XHTML content (default)
>    -h or --html       Output HTML content
>    -t or --text       Output plain text content
>    -m or --metadata   Output only metadata
>
> However, looking at the docs and source, it seems that only the xml
> option is available (hard-coded) in ExtractingDocumentLoader:
>
> serializer = new XMLSerializer(writer, new OutputFormat("XML",  
> "UTF-8", true));
>
> In addition, it seems that the metadata is always appended to the  
> response.
>
> Are there any open issues relating to this,

Not that I know of.

> or opinions on whether
> adding additional flexibility to the response format would be of
> interest for 1.4?
>

Sure, patches welcome.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search