You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Erik Hatcher <er...@ehatchersolutions.com> on 2009/02/10 04:17:50 UTC
Solr Cell (ExtractingRequestHandler) and plain text files
One other person has reported this to me off-list, and I just
encountered it myself. ExtractingRequestHandler does not handle plain
text files properly (no text is extracted). Here's an example:
curl "http://localhost:8982/solr/update/extract?ext.ignore.und.fl=true&wt=ruby&stream.file=/Users/erikhatcher/dev/cvreg/docs/JoeProgrammer.txt&ext.idx.attr=false&ext.literal.file_type_facet=txt&ext.def.fl=text_t&ext.literal.type_s=Resume&ext.literal.id=Resume:20&ext.extract.only=true&ext.resource.name=foo.txt
"
{'responseHeader'=>{'status'=>0,'QTime'=>1},'JoeProgrammer.txt'=>'<?
xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body/>
</html>
'}
Bound to be something simple with Tika config or to add a missing JAR
or something?
Anyone have the magic incantation?
Thanks,
Erik
Re: Solr Cell (ExtractingRequestHandler) and plain text files
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Feb 10, 2009, at 10:57 AM, Grant Ingersoll wrote:
> So, this seems to be an issue with Tika and it's mime type detection
> of plain text. For some discussion on it, see http://www.lucidimagination.com/search/document/64e27546d23e67b9/mime_type_identification_of_plain_text_files
> and also https://issues.apache.org/jira/browse/TIKA-154, which has
> been committed and should be in 0.3.
>
> In the meantime, you can add the ext.stream.type=text/plain or the
> ext.resource.name=foo.txt,
Thanks Grant. Adding ext.resource.name DOES work for me. I had
reported it didn't work, but I had some conflicting JAR files from
earlier versions of Solr Cell.
Erik
Re: Solr Cell (ExtractingRequestHandler) and plain text files
Posted by Grant Ingersoll <gs...@apache.org>.
So, this seems to be an issue with Tika and it's mime type detection
of plain text. For some discussion on it, see http://www.lucidimagination.com/search/document/64e27546d23e67b9/mime_type_identification_of_plain_text_files
and also https://issues.apache.org/jira/browse/TIKA-154, which has
been committed and should be in 0.3.
In the meantime, you can add the ext.stream.type=text/plain or the
ext.resource.name=foo.txt, i.e.:
curl "http://localhost:8983/solr/update/extract/?ext.idx.attr=true&ext.extract.only=true&ext.stream.type=text/plain
" -F "myfile=@foo.txt"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int
name="QTime">1</int></lst><str name="foo.txt"><?xml version="1.0"
encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body>
<p>this is some text
here is some more text
</p>
</body>
</html>
</str>
</response>
or
curl "http://localhost:8983/solr/update/extract/?ext.idx.attr=true&ext.extract.only=true&ext.resource.name=foo.txt
" -F "myfile=@foo.txt"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int
name="QTime">1</int></lst><str name="foo.txt"><?xml version="1.0"
encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body>
<p>this is some text
here is some more text
</p>
</body>
</html>
</str>
</response>
So, I guess the bottom line is that we should file a JIRA so we don't
lose track of it and test with 0.3.
On Feb 10, 2009, at 10:39 AM, Grant Ingersoll wrote:
> OK, I have reproduced this. Let me debug for a moment and then we
> can likely file a JIRA
>
> On Feb 9, 2009, at 10:17 PM, Erik Hatcher wrote:
>
>> One other person has reported this to me off-list, and I just
>> encountered it myself. ExtractingRequestHandler does not handle
>> plain text files properly (no text is extracted). Here's an example:
>>
>> curl "http://localhost:8982/solr/update/extract?ext.ignore.und.fl=true&wt=ruby&stream.file=/Users/erikhatcher/dev/cvreg/docs/JoeProgrammer.txt&ext.idx.attr=false&ext.literal.file_type_facet=txt&ext.def.fl=text_t&ext.literal.type_s=Resume&ext.literal.id=Resume:20&ext.extract.only=true&ext.resource.name=foo.txt
>> "
>>
>> {'responseHeader'=>{'status'=>0,'QTime'=>1},'JoeProgrammer.txt'=>'<?
>> xml version="1.0" encoding="UTF-8"?>
>> <html xmlns="http://www.w3.org/1999/xhtml">
>> <head>
>> <title/>
>> </head>
>> <body/>
>> </html>
>> '}
>>
>> Bound to be something simple with Tika config or to add a missing
>> JAR or something?
>>
>> Anyone have the magic incantation?
>>
>> Thanks,
>> Erik
>>
Re: Solr Cell (ExtractingRequestHandler) and plain text files
Posted by Grant Ingersoll <gs...@apache.org>.
OK, I have reproduced this. Let me debug for a moment and then we can
likely file a JIRA
On Feb 9, 2009, at 10:17 PM, Erik Hatcher wrote:
> One other person has reported this to me off-list, and I just
> encountered it myself. ExtractingRequestHandler does not handle
> plain text files properly (no text is extracted). Here's an example:
>
> curl "http://localhost:8982/solr/update/extract?ext.ignore.und.fl=true&wt=ruby&stream.file=/Users/erikhatcher/dev/cvreg/docs/JoeProgrammer.txt&ext.idx.attr=false&ext.literal.file_type_facet=txt&ext.def.fl=text_t&ext.literal.type_s=Resume&ext.literal.id=Resume:20&ext.extract.only=true&ext.resource.name=foo.txt
> "
>
> {'responseHeader'=>{'status'=>0,'QTime'=>1},'JoeProgrammer.txt'=>'<?
> xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body/>
> </html>
> '}
>
> Bound to be something simple with Tika config or to add a missing
> JAR or something?
>
> Anyone have the magic incantation?
>
> Thanks,
> Erik
>
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search
Re: Solr Cell (ExtractingRequestHandler) and plain text files
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
And yes, the file does have textual content :)
And I tried both ext.resource.name and stream.contentType to no avail.
Erik
On Feb 9, 2009, at 10:17 PM, Erik Hatcher wrote:
> One other person has reported this to me off-list, and I just
> encountered it myself. ExtractingRequestHandler does not handle
> plain text files properly (no text is extracted). Here's an example:
>
> curl "http://localhost:8982/solr/update/extract?ext.ignore.und.fl=true&wt=ruby&stream.file=/Users/erikhatcher/dev/cvreg/docs/JoeProgrammer.txt&ext.idx.attr=false&ext.literal.file_type_facet=txt&ext.def.fl=text_t&ext.literal.type_s=Resume&ext.literal.id=Resume:20&ext.extract.only=true&ext.resource.name=foo.txt
> "
>
> {'responseHeader'=>{'status'=>0,'QTime'=>1},'JoeProgrammer.txt'=>'<?
> xml version="1.0" encoding="UTF-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title/>
> </head>
> <body/>
> </html>
> '}
>
> Bound to be something simple with Tika config or to add a missing
> JAR or something?
>
> Anyone have the magic incantation?
>
> Thanks,
> Erik