You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Erik Hatcher <er...@ehatchersolutions.com> on 2009/02/10 04:17:50 UTC

Solr Cell (ExtractingRequestHandler) and plain text files

One other person has reported this to me off-list, and I just  
encountered it myself.  ExtractingRequestHandler does not handle plain  
text files properly (no text is extracted).  Here's an example:

curl "http://localhost:8982/solr/update/extract?ext.ignore.und.fl=true&wt=ruby&stream.file=/Users/erikhatcher/dev/cvreg/docs/JoeProgrammer.txt&ext.idx.attr=false&ext.literal.file_type_facet=txt&ext.def.fl=text_t&ext.literal.type_s=Resume&ext.literal.id=Resume:20&ext.extract.only=true&ext.resource.name=foo.txt 
"

{'responseHeader'=>{'status'=>0,'QTime'=>1},'JoeProgrammer.txt'=>'<? 
xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
     <head>
         <title/>
     </head>
     <body/>
</html>
'}

Bound to be something simple with Tika config or to add a missing JAR  
or something?

Anyone have the magic incantation?

Thanks,
	Erik


Re: Solr Cell (ExtractingRequestHandler) and plain text files

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Feb 10, 2009, at 10:57 AM, Grant Ingersoll wrote:

> So, this seems to be an issue with Tika and it's mime type detection  
> of plain text.  For some discussion on it, see http://www.lucidimagination.com/search/document/64e27546d23e67b9/mime_type_identification_of_plain_text_files 
>  and also https://issues.apache.org/jira/browse/TIKA-154, which has  
> been committed and should be in 0.3.
>
> In the meantime, you can add the ext.stream.type=text/plain or the  
> ext.resource.name=foo.txt,

Thanks Grant.  Adding ext.resource.name DOES work for me.  I had  
reported it didn't work, but I had some conflicting JAR files from  
earlier versions of Solr Cell.

	Erik


Re: Solr Cell (ExtractingRequestHandler) and plain text files

Posted by Grant Ingersoll <gs...@apache.org>.
So, this seems to be an issue with Tika and it's mime type detection  
of plain text.  For some discussion on it, see http://www.lucidimagination.com/search/document/64e27546d23e67b9/mime_type_identification_of_plain_text_files 
  and also https://issues.apache.org/jira/browse/TIKA-154, which has  
been committed and should be in 0.3.

In the meantime, you can add the ext.stream.type=text/plain or the  
ext.resource.name=foo.txt, i.e.:
  curl "http://localhost:8983/solr/update/extract/?ext.idx.attr=true&ext.extract.only=true&ext.stream.type=text/plain 
" -F "myfile=@foo.txt"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int  
name="QTime">1</int></lst><str name="foo.txt">&lt;?xml version="1.0"  
encoding="UTF-8"?&gt;
&lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
     &lt;head&gt;
         &lt;title/&gt;
     &lt;/head&gt;
     &lt;body&gt;
         &lt;p&gt;this is some text

here is some more text
&lt;/p&gt;
     &lt;/body&gt;
&lt;/html&gt;
</str>
</response>

or
  curl "http://localhost:8983/solr/update/extract/?ext.idx.attr=true&ext.extract.only=true&ext.resource.name=foo.txt 
" -F "myfile=@foo.txt"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int  
name="QTime">1</int></lst><str name="foo.txt">&lt;?xml version="1.0"  
encoding="UTF-8"?&gt;
&lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
     &lt;head&gt;
         &lt;title/&gt;
     &lt;/head&gt;
     &lt;body&gt;
         &lt;p&gt;this is some text

here is some more text
&lt;/p&gt;
     &lt;/body&gt;
&lt;/html&gt;
</str>
</response>


So, I guess the bottom line is that we should file a JIRA so we don't  
lose track of it and test with 0.3.


On Feb 10, 2009, at 10:39 AM, Grant Ingersoll wrote:

> OK, I have reproduced this.  Let me debug for a moment and then we  
> can likely file a JIRA
>
> On Feb 9, 2009, at 10:17 PM, Erik Hatcher wrote:
>
>> One other person has reported this to me off-list, and I just  
>> encountered it myself.  ExtractingRequestHandler does not handle  
>> plain text files properly (no text is extracted).  Here's an example:
>>
>> curl "http://localhost:8982/solr/update/extract?ext.ignore.und.fl=true&wt=ruby&stream.file=/Users/erikhatcher/dev/cvreg/docs/JoeProgrammer.txt&ext.idx.attr=false&ext.literal.file_type_facet=txt&ext.def.fl=text_t&ext.literal.type_s=Resume&ext.literal.id=Resume:20&ext.extract.only=true&ext.resource.name=foo.txt 
>> "
>>
>> {'responseHeader'=>{'status'=>0,'QTime'=>1},'JoeProgrammer.txt'=>'<? 
>> xml version="1.0" encoding="UTF-8"?>
>> <html xmlns="http://www.w3.org/1999/xhtml">
>>   <head>
>>       <title/>
>>   </head>
>>   <body/>
>> </html>
>> '}
>>
>> Bound to be something simple with Tika config or to add a missing  
>> JAR or something?
>>
>> Anyone have the magic incantation?
>>
>> Thanks,
>> 	Erik
>>


Re: Solr Cell (ExtractingRequestHandler) and plain text files

Posted by Grant Ingersoll <gs...@apache.org>.
OK, I have reproduced this.  Let me debug for a moment and then we can  
likely file a JIRA

On Feb 9, 2009, at 10:17 PM, Erik Hatcher wrote:

> One other person has reported this to me off-list, and I just  
> encountered it myself.  ExtractingRequestHandler does not handle  
> plain text files properly (no text is extracted).  Here's an example:
>
> curl "http://localhost:8982/solr/update/extract?ext.ignore.und.fl=true&wt=ruby&stream.file=/Users/erikhatcher/dev/cvreg/docs/JoeProgrammer.txt&ext.idx.attr=false&ext.literal.file_type_facet=txt&ext.def.fl=text_t&ext.literal.type_s=Resume&ext.literal.id=Resume:20&ext.extract.only=true&ext.resource.name=foo.txt 
> "
>
> {'responseHeader'=>{'status'=>0,'QTime'=>1},'JoeProgrammer.txt'=>'<? 
> xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
>    <head>
>        <title/>
>    </head>
>    <body/>
> </html>
> '}
>
> Bound to be something simple with Tika config or to add a missing  
> JAR or something?
>
> Anyone have the magic incantation?
>
> Thanks,
> 	Erik
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Re: Solr Cell (ExtractingRequestHandler) and plain text files

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
And yes, the file does have textual content :)

And I tried both ext.resource.name and stream.contentType to no avail.

	Erik

On Feb 9, 2009, at 10:17 PM, Erik Hatcher wrote:

> One other person has reported this to me off-list, and I just  
> encountered it myself.  ExtractingRequestHandler does not handle  
> plain text files properly (no text is extracted).  Here's an example:
>
> curl "http://localhost:8982/solr/update/extract?ext.ignore.und.fl=true&wt=ruby&stream.file=/Users/erikhatcher/dev/cvreg/docs/JoeProgrammer.txt&ext.idx.attr=false&ext.literal.file_type_facet=txt&ext.def.fl=text_t&ext.literal.type_s=Resume&ext.literal.id=Resume:20&ext.extract.only=true&ext.resource.name=foo.txt 
> "
>
> {'responseHeader'=>{'status'=>0,'QTime'=>1},'JoeProgrammer.txt'=>'<? 
> xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
>    <head>
>        <title/>
>    </head>
>    <body/>
> </html>
> '}
>
> Bound to be something simple with Tika config or to add a missing  
> JAR or something?
>
> Anyone have the magic incantation?
>
> Thanks,
> 	Erik