You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Joey Hanzel <ph...@nearinfinity.com> on 2011/04/11 04:35:26 UTC

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

Hi Gary,

I have been experiencing the same problem... Unable to extract content from
archive file formats.  I just tried again with a clean install of Solr 3.1.0
(using Tika 0.8) and continue to experience the same results.  Did you have
any success with this problem with Solr 1.4.1 or 3.1.0 ?

I'm using this curl command to send data to Solr.
curl "
http://localhost:8080/solr/update/extract?literal.id=doc1&fmap.content=attr_content&commit=true"
-H "application/octet-stream" -F  "myfile=@data.zip"

No problem extracting single rich text documents, but archive files only
result in the file names within the archive being indexed. Am I missing
something else in my configuration? Solr doesn't seem to be unpacking the
archive files. Based on the email chain associated with your first message,
some people have been able to get this functionality to work as desired.

On Mon, Jan 31, 2011 at 8:27 AM, Gary Taylor <gt...@inovem.com> wrote:

> Can anyone shed any light on this, and whether it could be a config issue?
>  I'm now using the latest SVN trunk, which includes the Tika 0.8 jars.
>
> When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt) to
> the ExtractingRequestHandler, I get the following log entry (formatted for
> ease of reading) :
>
> SolrInputDocument[
>    {
>    ignored_meta=ignored_meta(1.0)={
>        [stream_source_info, file, stream_content_type,
> application/octet-stream, stream_size, 260, stream_name, solr1.zip,
> Content-Type, application/zip]
>        },
>    ignored_=ignored_(1.0)={
>        [package-entry, package-entry]
>        },
>    ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
>
>  ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream},
>
>    ignored_stream_size=ignored_stream_size(1.0)={260},
>    ignored_stream_name=ignored_stream_name(1.0)={solr1.zip},
>    ignored_content_type=ignored_content_type(1.0)={application/zip},
>    docid=docid(1.0)={74},
>    type=type(1.0)={5},
>    text=text(1.0)={                  doc2.txt    doc1.txt    }
>    }
> ]
>
> So, the data coming back from Tika when parsing a ZIP file does not include
> the file contents, only the names of the files contained therein.  I've
> tried forcing stream.type=application/zip in the CURL string, but that makes
> no difference.  If I specify an invalid stream.type then I get an exception
> response, so I know it's being used.
>
> When I send one of those txt files individually to the
> ExtractingRequestHandler, I get:
>
> SolrInputDocument[
>    {
>    ignored_meta=ignored_meta(1.0)={
>        [stream_source_info, file, stream_content_type, text/plain,
> stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt]
>        },
>    ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
>
>  ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain},
>    ignored_stream_size=ignored_stream_size(1.0)={30},
>    ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1},
>    ignored_stream_name=ignored_stream_name(1.0)={doc1.txt},
>    docid=docid(1.0)={74},
>    type=type(1.0)={5},
>    text=text(1.0)={                The quick brown fox  }
>    }
> ]
>
> and we see the file contents in the "text" field.
>
> I'm using the following requestHandler definition in solrconfig.xml:
>
> <!-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler -->
> <requestHandler name="/update/extract"
> class="org.apache.solr.handler.extraction.ExtractingRequestHandler"
> startup="lazy">
> <lst name="defaults">
> <!-- All the main content goes into "text"... if you need to return
>           the extracted text or do highlighting, use a stored field. -->
> <str name="fmap.content">text</str>
> <str name="lowernames">true</str>
> <str name="uprefix">ignored_</str>
>
> <!-- capture link hrefs but ignore div attributes -->
> <str name="captureAttr">true</str>
> <str name="fmap.a">links</str>
> <str name="fmap.div">ignored_</str>
> </lst>
> </requestHandler>
>
> Is there any further debug or diagnostic I can get out of Tika to help me
> work out why it's only returning the file names and not the file contents
> when parsing a ZIP file?
>
>
> Thanks and kind regards,
> Gary.
>
>
>
> On 25/01/2011 16:48, Jayendra Patil wrote:
>
>> Hi Gary,
>>
>> The latest Solr Trunk was able to extract and index the contents of the
>> zip
>> file using the ExtractingRequestHandler.
>> The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
>> worked pretty well.
>>
>> Tested again with sample url and works fine -
>> curl "
>>
>> http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zip&literal.id=777045&literal.title=Test&commit=true
>> "
>>
>> You would probably need to drill down to the Tika Jars and
>> the apache-solr-cell-4.0-dev.jar used for Rich documents indexing.
>>
>> Regards,
>> Jayendra
>>
>>
>

Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

Posted by Gary Taylor <gt...@inovem.com>.

Jayendra,

I cleared out my local repository, and replayed all of my steps from 
Friday and it now it works.  The only difference (or the only one that's 
obvious to me) was that I applied the patch before doing a full 
compile/test/dist.  But I assumed that given I was seeing my new log 
entries (from ExtractingDocumentLoader.java) I was running the correct 
code anyway.

However, I'm very pleased that it's working now - I get the full 
contents of the zipped files indexed and not just the file names.

Thank you again for your assistance, and the patch!

Kind regards,
Gary.

On 21/05/2011 03:12, Jayendra Patil wrote:
> Hi Gary,
>
> I tried the patch on the the 3.1 source code (@
> http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_1/)
> as well and it worked fine.
> @Patch - https://issues.apache.org/jira/browse/SOLR-2416, which deals
> with the Solr Cell module.
>
> You may want to verify the contents from the results by enabling the
> stored attribute on the text field.
>
> e.g. URL curl "http://localhost:8983/solr/update/extract?stream.file=C:/Test.zip&literal.id=777045&literal.title=Test&commit=true"
>
> Let me know if it works. I would be happy to share the generated
> artifact you can test on.
>
> Regards,
> Jayendra

Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

Posted by Jayendra Patil <ja...@gmail.com>.

Hi Gary,

I tried the patch on the the 3.1 source code (@
http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_1/)
as well and it worked fine.
@Patch - https://issues.apache.org/jira/browse/SOLR-2416, which deals
with the Solr Cell module.

You may want to verify the contents from the results by enabling the
stored attribute on the text field.

e.g. URL curl "http://localhost:8983/solr/update/extract?stream.file=C:/Test.zip&literal.id=777045&literal.title=Test&commit=true"

Let me know if it works. I would be happy to share the generated
artifact you can test on.

Regards,
Jayendra

On Fri, May 20, 2011 at 11:15 AM, Gary Taylor <gt...@inovem.com> wrote:
> Hello again.  Unfortunately, I'm still getting nowhere with this.  I have
> checked-out the 3.1 source and applied Jayendra's patches (see below) and it
> still appears that the contents of the files in the zipfile are not being
> indexed, only the filenames of those contained files.
>
> I'm using a simple CURL invocation to test this:
>
> curl
> "http://localhost:8983/solr/core0/update/extract?literal.docid=74&fmap.content=text&literal.type=5"
> -F "commit=true" -F "file=@solr1.zip"
>
> solr1.zip contains two simple txt files (doc1.txt and doc2.txt).  I'm
> expecting the contents of those txt files to be extracted from the zip and
> indexed, but this isn't happening - or at least, I don't get the desired
> result when I do a query afterwards.  I do get a match if I search for
> either "doc1.txt" or "doc2.txt", but not if I search for a word that appears
> in their contents.
>
> If I index one of the txt files (instead of the zipfile), I can query the
> content OK, so I'm assuming my query is sensible and matches the field
> specified on the CURL string (ie. "text").  I'm also happy that the Solr
> Cell content extraction is working because I can successfully index PDF,
> Word, etc. files.
>
> In a fit of desperation I have added log.info statements into the files
> referenced by Jayendra's patches (SOLR-2416 and SOLR-2332) and I see those
> in the log when I submit the zipfile with CURL, so I know I'm running those
> patched files in the build.
>
> If anyone can shed any light on what's happening here, I'd be very grateful.
>
> Thanks and kind regards,
> Gary.
>
>
> On 11/04/2011 11:12, Gary Taylor wrote:
>>
>> Jayendra,
>>
>> Thanks for the info - been keeping an eye on this list in case this topic
>> cropped up again.  It's currently a background task for me, so I'll try and
>> take a look at the patches and re-test soon.
>>
>> Joey - glad you brought this issue up again.  I haven't progressed any
>> further with it.  I've not yet moved to Solr 3.1 but it's on my to-do list,
>> as is testing out the patches referenced by Jayendra.  I'll post my findings
>> on this thread - if you manage to test the patches before me, let me know
>> how you get on.
>>
>> Thanks and kind regards,
>> Gary.
>>
>>
>> On 11/04/2011 05:02, Jayendra Patil wrote:
>>>
>>> The migration of Tika to the latest 0.8 version seems to have
>>> reintroduced the issue.
>>>
>>> I was able to get this working again with the following patches. (Solr
>>> Cell and Data Import handler)
>>>
>>> https://issues.apache.org/jira/browse/SOLR-2416
>>> https://issues.apache.org/jira/browse/SOLR-2332
>>>
>>> You can try these.
>>>
>>> Regards,
>>> Jayendra
>>>
>>> On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzel<ph...@nearinfinity.com>
>>>  wrote:
>>>>
>>>> Hi Gary,
>>>>
>>>> I have been experiencing the same problem... Unable to extract content
>>>> from
>>>> archive file formats.  I just tried again with a clean install of Solr
>>>> 3.1.0
>>>> (using Tika 0.8) and continue to experience the same results.  Did you
>>>> have
>>>> any success with this problem with Solr 1.4.1 or 3.1.0 ?
>>>>
>>>> I'm using this curl command to send data to Solr.
>>>> curl "
>>>>
>>>> http://localhost:8080/solr/update/extract?literal.id=doc1&fmap.content=attr_content&commit=true"
>>>> -H "application/octet-stream" -F  "myfile=@data.zip"
>>>>
>>>> No problem extracting single rich text documents, but archive files only
>>>> result in the file names within the archive being indexed. Am I missing
>>>> something else in my configuration? Solr doesn't seem to be unpacking
>>>> the
>>>> archive files. Based on the email chain associated with your first
>>>> message,
>>>> some people have been able to get this functionality to work as desired.
>>>>
>>>
>>
>>
>
>
> --
> Gary Taylor
> INOVEM
>
> Tel +44 (0)1488 648 480
> Fax +44 (0)7092 115 933
> gary.taylor@inovem.com
> www.inovem.com
>
> INOVEM Ltd is registered in England and Wales No 4228932
> Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
>
>

Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

Posted by Gary Taylor <gt...@inovem.com>.

Hello again.  Unfortunately, I'm still getting nowhere with this.  I 
have checked-out the 3.1 source and applied Jayendra's patches (see 
below) and it still appears that the contents of the files in the 
zipfile are not being indexed, only the filenames of those contained files.

I'm using a simple CURL invocation to test this:

curl 
"http://localhost:8983/solr/core0/update/extract?literal.docid=74&fmap.content=text&literal.type=5" 
-F "commit=true" -F "file=@solr1.zip"

solr1.zip contains two simple txt files (doc1.txt and doc2.txt).  I'm 
expecting the contents of those txt files to be extracted from the zip 
and indexed, but this isn't happening - or at least, I don't get the 
desired result when I do a query afterwards.  I do get a match if I 
search for either "doc1.txt" or "doc2.txt", but not if I search for a 
word that appears in their contents.

If I index one of the txt files (instead of the zipfile), I can query 
the content OK, so I'm assuming my query is sensible and matches the 
field specified on the CURL string (ie. "text").  I'm also happy that 
the Solr Cell content extraction is working because I can successfully 
index PDF, Word, etc. files.

In a fit of desperation I have added log.info statements into the files 
referenced by Jayendra's patches (SOLR-2416 and SOLR-2332) and I see 
those in the log when I submit the zipfile with CURL, so I know I'm 
running those patched files in the build.

If anyone can shed any light on what's happening here, I'd be very grateful.

Thanks and kind regards,
Gary.

On 11/04/2011 11:12, Gary Taylor wrote:
> Jayendra,
>
> Thanks for the info - been keeping an eye on this list in case this 
> topic cropped up again.  It's currently a background task for me, so 
> I'll try and take a look at the patches and re-test soon.
>
> Joey - glad you brought this issue up again.  I haven't progressed any 
> further with it.  I've not yet moved to Solr 3.1 but it's on my to-do 
> list, as is testing out the patches referenced by Jayendra.  I'll post 
> my findings on this thread - if you manage to test the patches before 
> me, let me know how you get on.
>
> Thanks and kind regards,
> Gary.
>
>
> On 11/04/2011 05:02, Jayendra Patil wrote:
>> The migration of Tika to the latest 0.8 version seems to have
>> reintroduced the issue.
>>
>> I was able to get this working again with the following patches. (Solr
>> Cell and Data Import handler)
>>
>> https://issues.apache.org/jira/browse/SOLR-2416
>> https://issues.apache.org/jira/browse/SOLR-2332
>>
>> You can try these.
>>
>> Regards,
>> Jayendra
>>
>> On Sun, Apr 10, 2011 at 10:35 PM, Joey 
>> Hanzel<ph...@nearinfinity.com>  wrote:
>>> Hi Gary,
>>>
>>> I have been experiencing the same problem... Unable to extract 
>>> content from
>>> archive file formats.  I just tried again with a clean install of 
>>> Solr 3.1.0
>>> (using Tika 0.8) and continue to experience the same results.  Did 
>>> you have
>>> any success with this problem with Solr 1.4.1 or 3.1.0 ?
>>>
>>> I'm using this curl command to send data to Solr.
>>> curl "
>>> http://localhost:8080/solr/update/extract?literal.id=doc1&fmap.content=attr_content&commit=true" 
>>>
>>> -H "application/octet-stream" -F  "myfile=@data.zip"
>>>
>>> No problem extracting single rich text documents, but archive files 
>>> only
>>> result in the file names within the archive being indexed. Am I missing
>>> something else in my configuration? Solr doesn't seem to be 
>>> unpacking the
>>> archive files. Based on the email chain associated with your first 
>>> message,
>>> some people have been able to get this functionality to work as 
>>> desired.
>>>
>>
>
>

-- 
Gary Taylor
INOVEM

Tel +44 (0)1488 648 480
Fax +44 (0)7092 115 933
gary.taylor@inovem.com
www.inovem.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

Posted by Gary Taylor <gt...@inovem.com>.

Jayendra,

Thanks for the info - been keeping an eye on this list in case this 
topic cropped up again.  It's currently a background task for me, so 
I'll try and take a look at the patches and re-test soon.

Joey - glad you brought this issue up again.  I haven't progressed any 
further with it.  I've not yet moved to Solr 3.1 but it's on my to-do 
list, as is testing out the patches referenced by Jayendra.  I'll post 
my findings on this thread - if you manage to test the patches before 
me, let me know how you get on.

Thanks and kind regards,
Gary.

On 11/04/2011 05:02, Jayendra Patil wrote:
> The migration of Tika to the latest 0.8 version seems to have
> reintroduced the issue.
>
> I was able to get this working again with the following patches. (Solr
> Cell and Data Import handler)
>
> https://issues.apache.org/jira/browse/SOLR-2416
> https://issues.apache.org/jira/browse/SOLR-2332
>
> You can try these.
>
> Regards,
> Jayendra
>
> On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzel<ph...@nearinfinity.com>  wrote:
>> Hi Gary,
>>
>> I have been experiencing the same problem... Unable to extract content from
>> archive file formats.  I just tried again with a clean install of Solr 3.1.0
>> (using Tika 0.8) and continue to experience the same results.  Did you have
>> any success with this problem with Solr 1.4.1 or 3.1.0 ?
>>
>> I'm using this curl command to send data to Solr.
>> curl "
>> http://localhost:8080/solr/update/extract?literal.id=doc1&fmap.content=attr_content&commit=true"
>> -H "application/octet-stream" -F  "myfile=@data.zip"
>>
>> No problem extracting single rich text documents, but archive files only
>> result in the file names within the archive being indexed. Am I missing
>> something else in my configuration? Solr doesn't seem to be unpacking the
>> archive files. Based on the email chain associated with your first message,
>> some people have been able to get this functionality to work as desired.
>>
>

-- 
Gary Taylor
INOVEM

Tel +44 (0)1488 648 480
Fax +44 (0)7092 115 933
gary.taylor@inovem.com
www.inovem.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

Posted by Joey Hanzel <ph...@nearinfinity.com>.

Awesome. Thanks Jayendra.  I hadn't caught these patches yet.

I applied SOLR-2416 patch to the solr-3.1 release tag. This resolved the
problem of archive files not being unpacked and indexed with Solr CELL.
Thanks for the FYI.
https://issues.apache.org/jira/browse/SOLR-2416

On Mon, Apr 11, 2011 at 12:02 AM, Jayendra Patil <
jayendra.patil.001@gmail.com> wrote:

> The migration of Tika to the latest 0.8 version seems to have
> reintroduced the issue.
>
> I was able to get this working again with the following patches. (Solr
> Cell and Data Import handler)
>
> https://issues.apache.org/jira/browse/SOLR-2416
> https://issues.apache.org/jira/browse/SOLR-2332
>
> You can try these.
>
> Regards,
> Jayendra
>
> On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzel <ph...@nearinfinity.com>
> wrote:
> > Hi Gary,
> >
> > I have been experiencing the same problem... Unable to extract content
> from
> > archive file formats.  I just tried again with a clean install of Solr
> 3.1.0
> > (using Tika 0.8) and continue to experience the same results.  Did you
> have
> > any success with this problem with Solr 1.4.1 or 3.1.0 ?
> >
> > I'm using this curl command to send data to Solr.
> > curl "
> >
> http://localhost:8080/solr/update/extract?literal.id=doc1&fmap.content=attr_content&commit=true
> "
> > -H "application/octet-stream" -F  "myfile=@data.zip"
> >
> > No problem extracting single rich text documents, but archive files only
> > result in the file names within the archive being indexed. Am I missing
> > something else in my configuration? Solr doesn't seem to be unpacking the
> > archive files. Based on the email chain associated with your first
> message,
> > some people have been able to get this functionality to work as desired.
> >
> > On Mon, Jan 31, 2011 at 8:27 AM, Gary Taylor <gt...@inovem.com> wrote:
> >
> >> Can anyone shed any light on this, and whether it could be a config
> issue?
> >>  I'm now using the latest SVN trunk, which includes the Tika 0.8 jars.
> >>
> >> When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt)
> to
> >> the ExtractingRequestHandler, I get the following log entry (formatted
> for
> >> ease of reading) :
> >>
> >> SolrInputDocument[
> >>    {
> >>    ignored_meta=ignored_meta(1.0)={
> >>        [stream_source_info, file, stream_content_type,
> >> application/octet-stream, stream_size, 260, stream_name, solr1.zip,
> >> Content-Type, application/zip]
> >>        },
> >>    ignored_=ignored_(1.0)={
> >>        [package-entry, package-entry]
> >>        },
> >>    ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
> >>
> >>
>  ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream},
> >>
> >>    ignored_stream_size=ignored_stream_size(1.0)={260},
> >>    ignored_stream_name=ignored_stream_name(1.0)={solr1.zip},
> >>    ignored_content_type=ignored_content_type(1.0)={application/zip},
> >>    docid=docid(1.0)={74},
> >>    type=type(1.0)={5},
> >>    text=text(1.0)={                  doc2.txt    doc1.txt    }
> >>    }
> >> ]
> >>
> >> So, the data coming back from Tika when parsing a ZIP file does not
> include
> >> the file contents, only the names of the files contained therein.  I've
> >> tried forcing stream.type=application/zip in the CURL string, but that
> makes
> >> no difference.  If I specify an invalid stream.type then I get an
> exception
> >> response, so I know it's being used.
> >>
> >> When I send one of those txt files individually to the
> >> ExtractingRequestHandler, I get:
> >>
> >> SolrInputDocument[
> >>    {
> >>    ignored_meta=ignored_meta(1.0)={
> >>        [stream_source_info, file, stream_content_type, text/plain,
> >> stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt]
> >>        },
> >>    ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
> >>
> >>
>  ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain},
> >>    ignored_stream_size=ignored_stream_size(1.0)={30},
> >>    ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1},
> >>    ignored_stream_name=ignored_stream_name(1.0)={doc1.txt},
> >>    docid=docid(1.0)={74},
> >>    type=type(1.0)={5},
> >>    text=text(1.0)={                The quick brown fox  }
> >>    }
> >> ]
> >>
> >> and we see the file contents in the "text" field.
> >>
> >> I'm using the following requestHandler definition in solrconfig.xml:
> >>
> >> <!-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler-->
> >> <requestHandler name="/update/extract"
> >> class="org.apache.solr.handler.extraction.ExtractingRequestHandler"
> >> startup="lazy">
> >> <lst name="defaults">
> >> <!-- All the main content goes into "text"... if you need to return
> >>           the extracted text or do highlighting, use a stored field. -->
> >> <str name="fmap.content">text</str>
> >> <str name="lowernames">true</str>
> >> <str name="uprefix">ignored_</str>
> >>
> >> <!-- capture link hrefs but ignore div attributes -->
> >> <str name="captureAttr">true</str>
> >> <str name="fmap.a">links</str>
> >> <str name="fmap.div">ignored_</str>
> >> </lst>
> >> </requestHandler>
> >>
> >> Is there any further debug or diagnostic I can get out of Tika to help
> me
> >> work out why it's only returning the file names and not the file
> contents
> >> when parsing a ZIP file?
> >>
> >>
> >> Thanks and kind regards,
> >> Gary.
> >>
> >>
> >>
> >> On 25/01/2011 16:48, Jayendra Patil wrote:
> >>
> >>> Hi Gary,
> >>>
> >>> The latest Solr Trunk was able to extract and index the contents of the
> >>> zip
> >>> file using the ExtractingRequestHandler.
> >>> The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
> >>> worked pretty well.
> >>>
> >>> Tested again with sample url and works fine -
> >>> curl "
> >>>
> >>>
> http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zip&literal.id=777045&literal.title=Test&commit=true
> >>> "
> >>>
> >>> You would probably need to drill down to the Tika Jars and
> >>> the apache-solr-cell-4.0-dev.jar used for Rich documents indexing.
> >>>
> >>> Regards,
> >>> Jayendra
> >>>
> >>>
> >>
> >
>

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

Posted by Jayendra Patil <ja...@gmail.com>.

The migration of Tika to the latest 0.8 version seems to have
reintroduced the issue.

I was able to get this working again with the following patches. (Solr
Cell and Data Import handler)

https://issues.apache.org/jira/browse/SOLR-2416
https://issues.apache.org/jira/browse/SOLR-2332

You can try these.

Regards,
Jayendra

On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzel <ph...@nearinfinity.com> wrote:
> Hi Gary,
>
> I have been experiencing the same problem... Unable to extract content from
> archive file formats.  I just tried again with a clean install of Solr 3.1.0
> (using Tika 0.8) and continue to experience the same results.  Did you have
> any success with this problem with Solr 1.4.1 or 3.1.0 ?
>
> I'm using this curl command to send data to Solr.
> curl "
> http://localhost:8080/solr/update/extract?literal.id=doc1&fmap.content=attr_content&commit=true"
> -H "application/octet-stream" -F  "myfile=@data.zip"
>
> No problem extracting single rich text documents, but archive files only
> result in the file names within the archive being indexed. Am I missing
> something else in my configuration? Solr doesn't seem to be unpacking the
> archive files. Based on the email chain associated with your first message,
> some people have been able to get this functionality to work as desired.
>
> On Mon, Jan 31, 2011 at 8:27 AM, Gary Taylor <gt...@inovem.com> wrote:
>
>> Can anyone shed any light on this, and whether it could be a config issue?
>>  I'm now using the latest SVN trunk, which includes the Tika 0.8 jars.
>>
>> When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt) to
>> the ExtractingRequestHandler, I get the following log entry (formatted for
>> ease of reading) :
>>
>> SolrInputDocument[
>>    {
>>    ignored_meta=ignored_meta(1.0)={
>>        [stream_source_info, file, stream_content_type,
>> application/octet-stream, stream_size, 260, stream_name, solr1.zip,
>> Content-Type, application/zip]
>>        },
>>    ignored_=ignored_(1.0)={
>>        [package-entry, package-entry]
>>        },
>>    ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
>>
>>  ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream},
>>
>>    ignored_stream_size=ignored_stream_size(1.0)={260},
>>    ignored_stream_name=ignored_stream_name(1.0)={solr1.zip},
>>    ignored_content_type=ignored_content_type(1.0)={application/zip},
>>    docid=docid(1.0)={74},
>>    type=type(1.0)={5},
>>    text=text(1.0)={                  doc2.txt    doc1.txt    }
>>    }
>> ]
>>
>> So, the data coming back from Tika when parsing a ZIP file does not include
>> the file contents, only the names of the files contained therein.  I've
>> tried forcing stream.type=application/zip in the CURL string, but that makes
>> no difference.  If I specify an invalid stream.type then I get an exception
>> response, so I know it's being used.
>>
>> When I send one of those txt files individually to the
>> ExtractingRequestHandler, I get:
>>
>> SolrInputDocument[
>>    {
>>    ignored_meta=ignored_meta(1.0)={
>>        [stream_source_info, file, stream_content_type, text/plain,
>> stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt]
>>        },
>>    ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
>>
>>  ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain},
>>    ignored_stream_size=ignored_stream_size(1.0)={30},
>>    ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1},
>>    ignored_stream_name=ignored_stream_name(1.0)={doc1.txt},
>>    docid=docid(1.0)={74},
>>    type=type(1.0)={5},
>>    text=text(1.0)={                The quick brown fox  }
>>    }
>> ]
>>
>> and we see the file contents in the "text" field.
>>
>> I'm using the following requestHandler definition in solrconfig.xml:
>>
>> <!-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler -->
>> <requestHandler name="/update/extract"
>> class="org.apache.solr.handler.extraction.ExtractingRequestHandler"
>> startup="lazy">
>> <lst name="defaults">
>> <!-- All the main content goes into "text"... if you need to return
>>           the extracted text or do highlighting, use a stored field. -->
>> <str name="fmap.content">text</str>
>> <str name="lowernames">true</str>
>> <str name="uprefix">ignored_</str>
>>
>> <!-- capture link hrefs but ignore div attributes -->
>> <str name="captureAttr">true</str>
>> <str name="fmap.a">links</str>
>> <str name="fmap.div">ignored_</str>
>> </lst>
>> </requestHandler>
>>
>> Is there any further debug or diagnostic I can get out of Tika to help me
>> work out why it's only returning the file names and not the file contents
>> when parsing a ZIP file?
>>
>>
>> Thanks and kind regards,
>> Gary.
>>
>>
>>
>> On 25/01/2011 16:48, Jayendra Patil wrote:
>>
>>> Hi Gary,
>>>
>>> The latest Solr Trunk was able to extract and index the contents of the
>>> zip
>>> file using the ExtractingRequestHandler.
>>> The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
>>> worked pretty well.
>>>
>>> Tested again with sample url and works fine -
>>> curl "
>>>
>>> http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zip&literal.id=777045&literal.title=Test&commit=true
>>> "
>>>
>>> You would probably need to drill down to the Tika Jars and
>>> the apache-solr-cell-4.0-dev.jar used for Rich documents indexing.
>>>
>>> Regards,
>>> Jayendra
>>>
>>>
>>
>