You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Lee Carroll <le...@googlemail.com> on 2018/04/24 16:26:02 UTC

solr cell: write entire file content binary to index along with metadata

Does the solr cell contrib give access to the files raw content  along with
the extracted metadata?

cheers Lee C

Re: solr cell: write entire file content binary to index along with metadata

Posted by Rahul Singh <ra...@gmail.com>.

Lucene ( the major underlying Tech in SolR ) can handle any data, but it’s optimized to be an index , not a file store. Better to put that in another DB or file system like Cassandra, S3, etc. (better than SolR).

In our experience , leveraging the tika binary / microservice as a pre-index process can improve the overall stability of the SolR service.


--
Rahul Singh
rahul.singh@anant.us

Anant Corporation

On Apr 25, 2018, 12:49 PM -0400, Shawn Heisey <ap...@elyograg.org>, wrote:
> On 4/25/2018 4:02 AM, Lee Carroll wrote:
> > *We don't recommend using solr-cell for production indexing.*
> >
> > Ok. Are the reasons for:
> >
> > Performance. I think we have rather modest index requirement (1000 a day...
> > on a busy day)
> >
> > Security. The index workflow is, upload files to public facing server with
> > auth. Files written to disk, scanned and copied to internal server and
> > ingested into index via here.
> >
> > other reasons we should worry about ?
>
> Tika is the underlying technology in solr-cell.  Tika is a separate
> Apache product designed for parsing common rich-text formats, like
> Microsoft, PDF, etc.
>
> http://tika.apache.org/
>
> The problems that can result are related to running Tika inside of Solr,
> which is what solr-cell does.
>
> The Tika authors try very hard to make sure that Tika doesn't misbehave,
> but the very nature of what Tika does means it is somewhat prone to
> misbehaving.  Many of the file formats that Tika processes are
> undocumented, or any documentation that is available is not available to
> open source developers.  Also, sometimes documents in those formats will
> be constructed in a way that the Tika authors have never seen before, or
> they may completely violate what conventions the authors DO know about.
>
> Long story short -- Tika can encounter documents that can cause it to
> crash, or to consume all the memory in the system, or misbehave in other
> ways.  If Tika is running inside Solr, then when it has a problem, Solr
> itself can blow up and have a problem too.
>
> For this reason, and because Tika can sometimes use a lot of resources
> even when it is working correctly, we recommend running it outside of
> Solr in another program that takes its output and sends it to Solr.
> Ideally, it will be running on a completely different machine than Solr
> is running on.
>
> Thanks,
> Shawn
>

Re: solr cell: write entire file content binary to index along with metadata

Posted by Shawn Heisey <ap...@elyograg.org>.

On 4/25/2018 4:02 AM, Lee Carroll wrote:
>     *We don't recommend using solr-cell for production indexing.*
>
> Ok. Are the reasons for:
>
> Performance. I think we have rather modest index requirement (1000 a day...
> on a busy day)
>
> Security. The index workflow is, upload files to public facing server with
> auth. Files written to disk, scanned and copied to internal server and
> ingested into index via here.
>
>   other reasons we should worry about ?

Tika is the underlying technology in solr-cell.  Tika is a separate 
Apache product designed for parsing common rich-text formats, like 
Microsoft, PDF, etc.

http://tika.apache.org/

The problems that can result are related to running Tika inside of Solr, 
which is what solr-cell does.

The Tika authors try very hard to make sure that Tika doesn't misbehave, 
but the very nature of what Tika does means it is somewhat prone to 
misbehaving.  Many of the file formats that Tika processes are 
undocumented, or any documentation that is available is not available to 
open source developers.  Also, sometimes documents in those formats will 
be constructed in a way that the Tika authors have never seen before, or 
they may completely violate what conventions the authors DO know about.

Long story short -- Tika can encounter documents that can cause it to 
crash, or to consume all the memory in the system, or misbehave in other 
ways.  If Tika is running inside Solr, then when it has a problem, Solr 
itself can blow up and have a problem too.

For this reason, and because Tika can sometimes use a lot of resources 
even when it is working correctly, we recommend running it outside of 
Solr in another program that takes its output and sends it to Solr.  
Ideally, it will be running on a completely different machine than Solr 
is running on.

Thanks,
Shawn

Re: solr cell: write entire file content binary to index along with metadata

Posted by Lee Carroll <le...@googlemail.com>.

>
>
>
>
> *That's not usually the kind of information you want to have in a
> Solrindex.  Most of the time, there will be an entry in the Solr index
> thattells the system making queries how to locate the actual data --
> afilename, a URL, a database lookup key, etc.*

 Agreed. The app will have a few implementations for storing the binary
file. Easiest for a user to configure for proto-typing would be store in
index impl. A live impl would probably be fs

   *We don't recommend using solr-cell for production indexing.*

Ok. Are the reasons for:

Performance. I think we have rather modest index requirement (1000 a day...
on a busy day)

Security. The index workflow is, upload files to public facing server with
auth. Files written to disk, scanned and copied to internal server and
ingested into index via here.

 other reasons we should worry about ?

Cheers Lee C

On 25 April 2018 at 00:37, Shawn Heisey <ap...@elyograg.org> wrote:

> On 4/24/2018 10:26 AM, Lee Carroll wrote:
> > Does the solr cell contrib give access to the files raw content  along
> with
> > the extracted metadata?\
>
> That's not usually the kind of information you want to have in a Solr
> index.  Most of the time, there will be an entry in the Solr index that
> tells the system making queries how to locate the actual data -- a
> filename, a URL, a database lookup key, etc.
>
> I have no idea whether solr-cell can put the info in the index.  My best
> guess would be that it can't, since putting the entire binary content
> into the index isn't recommended.
>
> We don't recommend using solr-cell for production indexing.  If you
> follow recommendations and write your own indexing program using Tika,
> then you can do pretty much anything you want, including writing the
> full content into the index.
>
> Thanks,
> Shawn
>
>

Re: solr cell: write entire file content binary to index along with metadata

Posted by Shawn Heisey <ap...@elyograg.org>.

On 4/24/2018 10:26 AM, Lee Carroll wrote:
> Does the solr cell contrib give access to the files raw content  along with
> the extracted metadata?\

That's not usually the kind of information you want to have in a Solr
index.  Most of the time, there will be an entry in the Solr index that
tells the system making queries how to locate the actual data -- a
filename, a URL, a database lookup key, etc.

I have no idea whether solr-cell can put the info in the index.  My best
guess would be that it can't, since putting the entire binary content
into the index isn't recommended.

We don't recommend using solr-cell for production indexing.  If you
follow recommendations and write your own indexing program using Tika,
then you can do pretty much anything you want, including writing the
full content into the index.

Thanks,
Shawn