You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@solr.apache.org by Jan Høydahl <ja...@cominvent.com> on 2023/03/07 22:48:01 UTC

[DISCUSS] Future of SolrCell in Solr

Hi,

This has been a recurring topic, and there have been many suggestions for what to do with "Solr Cell" aka Extracting Request Handler aka Tika.

Most agree it's a bad idea to parse huge PDFs in Solr's JVM process like we do.

Proposals over the years have been

* Move SolrCell to a package, outside of Solr's tarball SOLR-15951 <https://issues.apache.org/jira/browse/SOLR-15951>
* Deprecate SolrCell SOLR-13973 <https://issues.apache.org/jira/browse/SOLR-13973>
* Keep in Solr but use Tika-Server <https://cwiki.apache.org/confluence/display/TIKA/TikaServer>,  SOLR-7632 <https://issues.apache.org/jira/browse/SOLR-7632>
* Integrate Tika client-side SOLR-1526 <https://issues.apache.org/jira/browse/SOLR-1526>

We should make a plan now for what the Tika story will be for Solr 10.0. We should not under-estimate the number of Solr users who rely on SolrCell, and should therefore not take this decision lightly. A well communicated story and a well executed migration path will give user satisfaction. A bad experience will repell users.

Personally I prefer to run Tika on client side and index the already-extracted text to Solr. We already document that Solr Cell is not recommended for production use <https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-with-tika.html#solr-cell-performance-implications>.

My current thinking / proposal is to:
* Build a new, thin Solr module that exposes a compatible /update/extract handler, delegating to Tika-Server (user-hosted)
* Deprecate SolrCell in current form
* From 10.0, Solr will not ship with embedded Tika, only the new handler delegating to Tika-Server

WDYT?

Jan

Re: [DISCUSS] Future of SolrCell in Solr

Posted by Tim Allison <ta...@apache.org>.

Sounds good, Jan.  If you're heading in this direction, I'd recommend
the /tika endpoint with an Accept header set to "application/json".

Please let me know if I can help.

Best,

      Tim


On Thu, Mar 23, 2023 at 2:43 PM Jan Høydahl <ja...@cominvent.com> wrote:
>
> Documentation wise we can re-write the chapter we have on rich text indexing to mention several options, including tika-server, tika-pipes with solr emitter.
>
> Wrt SolrCell successor, I still think a super-thin module forwarding to TikaServer is the best. Users would get same features and API as today, so users who rely on SolrCell have a simple migration path. It may also be a benefit that they get better control over their Tika Server wrt version, scaling, what parsers are included etc. I want to do a quick POC on this to see how it flies.
>
> Jan
>
> > 23. mar. 2023 kl. 17:14 skrev Tim Allison <ta...@apache.org>:
> >
> > Apologies for being late to the show, and thank you Eric for pinging me on this.
> >
> > I'm 100% for factoring out Tika from the same jvm as Solr.  I see three options for removing Tika from Solr's jvm, making it easier for users and keeping Tika's jar hell all to itself.
> >
> > 1) As already proposed, use Tika server and somehow figure out how to integrate that seamlessly.
> >
> > 2) Use Tika pipes within Solr directly or within a package (as Eric suggest).  This forks a process for parsing, and all the heavy dependencies go into the forked process.  Solr would need tika-core, but could specify a directory with tika-app.jar in it.  The dependency nightmare in tika-app.jar would not get loaded into Solr's jvm.  We'd probably have to make some mods to tika-pipes for this to work roughly as Tika is being used now, but I think something like this is doable...
> >
> > 3) Direct users to tika-pipes directly.  We have a Solr emitter.  Users can aim tika-pipes at a directory of files, an S3 bucket, a gcs thing, etc, and Tika will safely parse the files in a forked process and forward the results to Solr.  This is not as easy as curling bytes to Solr and having those bytes parsed, but it is possible.
> >
> > Please let me know how I can help.
> >
> > Best,
> >
> >    Tim
> >
> > On 2023/03/10 03:57:45 Gus Heck wrote:
> >> While I totally think that for any heavy-duty use case or any use case
> >> where the document's are not constrained to a known set with polite
> >> characteristics (i.e. known not to be password protected, reasonable
> >> length, etc), Tika should not run inside solr. That said, as I see it the
> >> key downside of not having solr-cell as part of solr would be that we would
> >> likely  remove the docs for it too, and the entire concept of how to get a
> >> "normal" document into solr evaporates from our ref guide. So I like the
> >> sound of it being an official package as Eric suggests, and perhaps even
> >> the canonical example of how to install a package... Along with heavy
> >> documentation caveats of why Tika should run outside of solr for most
> >> production purposes of course.
> >>
> >> -Gus
> >>
> >>
> >> On Thu, Mar 9, 2023 at 8:09 AM Eric Pugh <ep...@opensourceconnections.com>
> >> wrote:
> >>
> >>> I did a series of blog posts about Tika, and while conventional wisdom is
> >>> that running Tika in Solr is bad, I’ve had GREAT luck with it over the
> >>> years.
> >>> https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
> >>> <
> >>> https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
> >>>>
> >>>
> >>> Having said that, my bigger beef with Tika in Solr is about all the
> >>> dependencies that it drags along.   I am constantly looking up a package
> >>> wondering how we use it in Solr just to find it’s a Tika package….  So….
> >>> For that reason I think we need to do something better.
> >>>
> >>> I like SolrCell to a package (
> >>> https://issues.apache.org/jira/browse/SOLR-15951 <
> >>> https://issues.apache.org/jira/browse/SOLR-15951>).   We have this
> >>> powerful packaging feature, and yet we hardly dog food it ourselves….  I’d
> >>> love to see us separate out SolrCell and make it easy to do `bin/solr
> >>> package install solrcell` and have it work!  It would both validate the
> >>> whole Package concept, and minimize the dependencies in Solr’s tarball.
> >>>
> >>> Secondly, for folks who really do want to run a separate Tika server, I’d
> >>> love to make it easier to use.    Tika has introduced a new “pipes” concept
> >>> to reduce the amount of back and forth when working with Tika Server that
> >>> might tie nicely into the Solr update pipeline.  I don’t think any real
> >>> work has been done on this…. Hoping Tim Allison weighs in on this topic ;-)
> >>>
> >>> Eric
> >>>
> >>>
> >>>> On Mar 8, 2023, at 9:50 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> >>>>
> >>>> On 3/7/2023 3:48 PM, Jan Høydahl wrote:
> >>>>> * Move SolrCell to a package, outside of Solr's tarball SOLR-15951 <
> >>> https://issues.apache.org/jira/browse/SOLR-15951>
> >>>>> * Deprecate SolrCell SOLR-13973 <
> >>> https://issues.apache.org/jira/browse/SOLR-13973>
> >>>>> * Keep in Solr but use Tika-Server <
> >>> https://cwiki.apache.org/confluence/display/TIKA/TikaServer>,  SOLR-7632 <
> >>> https://issues.apache.org/jira/browse/SOLR-7632>
> >>>>> * Integrate Tika client-side SOLR-1526 <
> >>> https://issues.apache.org/jira/browse/SOLR-1526>
> >>>>
> >>>> As you likely know, the big problem is that Tika has a habit of crashing
> >>> or misbehaving, particularly with PDFs, and if it's running inside Solr,
> >>> then Solr itself is going to suffer whatever bad effects Tika causes.
> >>>>
> >>>>> My current thinking / proposal is to:
> >>>>> * Build a new, thin Solr module that exposes a compatible
> >>> /update/extract handler, delegating to Tika-Server (user-hosted)
> >>>>> * Deprecate SolrCell in current form
> >>>>> * From 10.0, Solr will not ship with embedded Tika, only the new
> >>> handler delegating to Tika-Server
> >>>>
> >>>> I was thinking something along these lines too.  A separate JVM running
> >>> Tika Server that can crash without taking Solr down, and communication so
> >>> ERH can send commands to it, receive extracted data, and hopefully know
> >>> when the other JVM crashes.  If we design it well, then the framework could
> >>> be used to integrate with other extraction mechanisms besides Tika.  I
> >>> think that would be quite a bit of work.
> >>>>
> >>>> It might be a good idea to make that a separate project as was done for
> >>> DIH, but I have no way of guessing whether there is enough interest in the
> >>> community to keep it maintained.  If it's a separate project, then I think
> >>> it would just incorporate SolrJ and Tika, rather than using a special
> >>> handler.  I have never used ERH in a production setting, and barely have
> >>> experience with it in non-production.
> >>>>
> >>>> Thanks,
> >>>> Shawn
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> >>>> For additional commands, e-mail: dev-help@solr.apache.org
> >>>>
> >>>
> >>> _______________________
> >>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> >>> http://www.opensourceconnections.com <
> >>> http://www.opensourceconnections.com/> | My Free/Busy <
> >>> http://tinyurl.com/eric-cal>
> >>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> >>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
> >>>
> >>> This e-mail and all contents, including attachments, is considered to be
> >>> Company Confidential unless explicitly stated otherwise, regardless of
> >>> whether attachments are marked as such.
> >>>
> >>>
> >>
> >> --
> >> http://www.needhamsoftware.com (work)
> >> http://www.the111shift.com (play)
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> > For additional commands, e-mail: dev-help@solr.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> For additional commands, e-mail: dev-help@solr.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org

Re: [DISCUSS] Future of SolrCell in Solr

Posted by Jan Høydahl <ja...@cominvent.com>.

Documentation wise we can re-write the chapter we have on rich text indexing to mention several options, including tika-server, tika-pipes with solr emitter.

Wrt SolrCell successor, I still think a super-thin module forwarding to TikaServer is the best. Users would get same features and API as today, so users who rely on SolrCell have a simple migration path. It may also be a benefit that they get better control over their Tika Server wrt version, scaling, what parsers are included etc. I want to do a quick POC on this to see how it flies.

Jan

> 23. mar. 2023 kl. 17:14 skrev Tim Allison <ta...@apache.org>:
> 
> Apologies for being late to the show, and thank you Eric for pinging me on this.
> 
> I'm 100% for factoring out Tika from the same jvm as Solr.  I see three options for removing Tika from Solr's jvm, making it easier for users and keeping Tika's jar hell all to itself.
> 
> 1) As already proposed, use Tika server and somehow figure out how to integrate that seamlessly.
> 
> 2) Use Tika pipes within Solr directly or within a package (as Eric suggest).  This forks a process for parsing, and all the heavy dependencies go into the forked process.  Solr would need tika-core, but could specify a directory with tika-app.jar in it.  The dependency nightmare in tika-app.jar would not get loaded into Solr's jvm.  We'd probably have to make some mods to tika-pipes for this to work roughly as Tika is being used now, but I think something like this is doable...
> 
> 3) Direct users to tika-pipes directly.  We have a Solr emitter.  Users can aim tika-pipes at a directory of files, an S3 bucket, a gcs thing, etc, and Tika will safely parse the files in a forked process and forward the results to Solr.  This is not as easy as curling bytes to Solr and having those bytes parsed, but it is possible.
> 
> Please let me know how I can help.
> 
> Best,
> 
>    Tim
> 
> On 2023/03/10 03:57:45 Gus Heck wrote:
>> While I totally think that for any heavy-duty use case or any use case
>> where the document's are not constrained to a known set with polite
>> characteristics (i.e. known not to be password protected, reasonable
>> length, etc), Tika should not run inside solr. That said, as I see it the
>> key downside of not having solr-cell as part of solr would be that we would
>> likely  remove the docs for it too, and the entire concept of how to get a
>> "normal" document into solr evaporates from our ref guide. So I like the
>> sound of it being an official package as Eric suggests, and perhaps even
>> the canonical example of how to install a package... Along with heavy
>> documentation caveats of why Tika should run outside of solr for most
>> production purposes of course.
>> 
>> -Gus
>> 
>> 
>> On Thu, Mar 9, 2023 at 8:09 AM Eric Pugh <ep...@opensourceconnections.com>
>> wrote:
>> 
>>> I did a series of blog posts about Tika, and while conventional wisdom is
>>> that running Tika in Solr is bad, I’ve had GREAT luck with it over the
>>> years.
>>> https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
>>> <
>>> https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
>>>> 
>>> 
>>> Having said that, my bigger beef with Tika in Solr is about all the
>>> dependencies that it drags along.   I am constantly looking up a package
>>> wondering how we use it in Solr just to find it’s a Tika package….  So….
>>> For that reason I think we need to do something better.
>>> 
>>> I like SolrCell to a package (
>>> https://issues.apache.org/jira/browse/SOLR-15951 <
>>> https://issues.apache.org/jira/browse/SOLR-15951>).   We have this
>>> powerful packaging feature, and yet we hardly dog food it ourselves….  I’d
>>> love to see us separate out SolrCell and make it easy to do `bin/solr
>>> package install solrcell` and have it work!  It would both validate the
>>> whole Package concept, and minimize the dependencies in Solr’s tarball.
>>> 
>>> Secondly, for folks who really do want to run a separate Tika server, I’d
>>> love to make it easier to use.    Tika has introduced a new “pipes” concept
>>> to reduce the amount of back and forth when working with Tika Server that
>>> might tie nicely into the Solr update pipeline.  I don’t think any real
>>> work has been done on this…. Hoping Tim Allison weighs in on this topic ;-)
>>> 
>>> Eric
>>> 
>>> 
>>>> On Mar 8, 2023, at 9:50 PM, Shawn Heisey <ap...@elyograg.org> wrote:
>>>> 
>>>> On 3/7/2023 3:48 PM, Jan Høydahl wrote:
>>>>> * Move SolrCell to a package, outside of Solr's tarball SOLR-15951 <
>>> https://issues.apache.org/jira/browse/SOLR-15951>
>>>>> * Deprecate SolrCell SOLR-13973 <
>>> https://issues.apache.org/jira/browse/SOLR-13973>
>>>>> * Keep in Solr but use Tika-Server <
>>> https://cwiki.apache.org/confluence/display/TIKA/TikaServer>,  SOLR-7632 <
>>> https://issues.apache.org/jira/browse/SOLR-7632>
>>>>> * Integrate Tika client-side SOLR-1526 <
>>> https://issues.apache.org/jira/browse/SOLR-1526>
>>>> 
>>>> As you likely know, the big problem is that Tika has a habit of crashing
>>> or misbehaving, particularly with PDFs, and if it's running inside Solr,
>>> then Solr itself is going to suffer whatever bad effects Tika causes.
>>>> 
>>>>> My current thinking / proposal is to:
>>>>> * Build a new, thin Solr module that exposes a compatible
>>> /update/extract handler, delegating to Tika-Server (user-hosted)
>>>>> * Deprecate SolrCell in current form
>>>>> * From 10.0, Solr will not ship with embedded Tika, only the new
>>> handler delegating to Tika-Server
>>>> 
>>>> I was thinking something along these lines too.  A separate JVM running
>>> Tika Server that can crash without taking Solr down, and communication so
>>> ERH can send commands to it, receive extracted data, and hopefully know
>>> when the other JVM crashes.  If we design it well, then the framework could
>>> be used to integrate with other extraction mechanisms besides Tika.  I
>>> think that would be quite a bit of work.
>>>> 
>>>> It might be a good idea to make that a separate project as was done for
>>> DIH, but I have no way of guessing whether there is enough interest in the
>>> community to keep it maintained.  If it's a separate project, then I think
>>> it would just incorporate SolrJ and Tika, rather than using a special
>>> handler.  I have never used ERH in a production setting, and barely have
>>> experience with it in non-production.
>>>> 
>>>> Thanks,
>>>> Shawn
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
>>>> For additional commands, e-mail: dev-help@solr.apache.org
>>>> 
>>> 
>>> _______________________
>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
>>> http://www.opensourceconnections.com <
>>> http://www.opensourceconnections.com/> | My Free/Busy <
>>> http://tinyurl.com/eric-cal>
>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>>> 
>>> This e-mail and all contents, including attachments, is considered to be
>>> Company Confidential unless explicitly stated otherwise, regardless of
>>> whether attachments are marked as such.
>>> 
>>> 
>> 
>> -- 
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> For additional commands, e-mail: dev-help@solr.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org

Re: [DISCUSS] Future of SolrCell in Solr

Posted by Tim Allison <ta...@apache.org>.

Apologies for being late to the show, and thank you Eric for pinging me on this.

I'm 100% for factoring out Tika from the same jvm as Solr.  I see three options for removing Tika from Solr's jvm, making it easier for users and keeping Tika's jar hell all to itself.

1) As already proposed, use Tika server and somehow figure out how to integrate that seamlessly.

2) Use Tika pipes within Solr directly or within a package (as Eric suggest).  This forks a process for parsing, and all the heavy dependencies go into the forked process.  Solr would need tika-core, but could specify a directory with tika-app.jar in it.  The dependency nightmare in tika-app.jar would not get loaded into Solr's jvm.  We'd probably have to make some mods to tika-pipes for this to work roughly as Tika is being used now, but I think something like this is doable...

3) Direct users to tika-pipes directly.  We have a Solr emitter.  Users can aim tika-pipes at a directory of files, an S3 bucket, a gcs thing, etc, and Tika will safely parse the files in a forked process and forward the results to Solr.  This is not as easy as curling bytes to Solr and having those bytes parsed, but it is possible.

Please let me know how I can help.

Best,

    Tim

On 2023/03/10 03:57:45 Gus Heck wrote:
> While I totally think that for any heavy-duty use case or any use case
> where the document's are not constrained to a known set with polite
> characteristics (i.e. known not to be password protected, reasonable
> length, etc), Tika should not run inside solr. That said, as I see it the
> key downside of not having solr-cell as part of solr would be that we would
> likely  remove the docs for it too, and the entire concept of how to get a
> "normal" document into solr evaporates from our ref guide. So I like the
> sound of it being an official package as Eric suggests, and perhaps even
> the canonical example of how to install a package... Along with heavy
> documentation caveats of why Tika should run outside of solr for most
> production purposes of course.
> 
> -Gus
> 
> 
> On Thu, Mar 9, 2023 at 8:09 AM Eric Pugh <ep...@opensourceconnections.com>
> wrote:
> 
> > I did a series of blog posts about Tika, and while conventional wisdom is
> > that running Tika in Solr is bad, I’ve had GREAT luck with it over the
> > years.
> > https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
> > <
> > https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
> > >
> >
> > Having said that, my bigger beef with Tika in Solr is about all the
> > dependencies that it drags along.   I am constantly looking up a package
> > wondering how we use it in Solr just to find it’s a Tika package….  So….
> > For that reason I think we need to do something better.
> >
> > I like SolrCell to a package (
> > https://issues.apache.org/jira/browse/SOLR-15951 <
> > https://issues.apache.org/jira/browse/SOLR-15951>).   We have this
> > powerful packaging feature, and yet we hardly dog food it ourselves….  I’d
> > love to see us separate out SolrCell and make it easy to do `bin/solr
> > package install solrcell` and have it work!  It would both validate the
> > whole Package concept, and minimize the dependencies in Solr’s tarball.
> >
> > Secondly, for folks who really do want to run a separate Tika server, I’d
> > love to make it easier to use.    Tika has introduced a new “pipes” concept
> > to reduce the amount of back and forth when working with Tika Server that
> > might tie nicely into the Solr update pipeline.  I don’t think any real
> > work has been done on this…. Hoping Tim Allison weighs in on this topic ;-)
> >
> > Eric
> >
> >
> > > On Mar 8, 2023, at 9:50 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> > >
> > > On 3/7/2023 3:48 PM, Jan Høydahl wrote:
> > >> * Move SolrCell to a package, outside of Solr's tarball SOLR-15951 <
> > https://issues.apache.org/jira/browse/SOLR-15951>
> > >> * Deprecate SolrCell SOLR-13973 <
> > https://issues.apache.org/jira/browse/SOLR-13973>
> > >> * Keep in Solr but use Tika-Server <
> > https://cwiki.apache.org/confluence/display/TIKA/TikaServer>,  SOLR-7632 <
> > https://issues.apache.org/jira/browse/SOLR-7632>
> > >> * Integrate Tika client-side SOLR-1526 <
> > https://issues.apache.org/jira/browse/SOLR-1526>
> > >
> > > As you likely know, the big problem is that Tika has a habit of crashing
> > or misbehaving, particularly with PDFs, and if it's running inside Solr,
> > then Solr itself is going to suffer whatever bad effects Tika causes.
> > >
> > >> My current thinking / proposal is to:
> > >> * Build a new, thin Solr module that exposes a compatible
> > /update/extract handler, delegating to Tika-Server (user-hosted)
> > >> * Deprecate SolrCell in current form
> > >> * From 10.0, Solr will not ship with embedded Tika, only the new
> > handler delegating to Tika-Server
> > >
> > > I was thinking something along these lines too.  A separate JVM running
> > Tika Server that can crash without taking Solr down, and communication so
> > ERH can send commands to it, receive extracted data, and hopefully know
> > when the other JVM crashes.  If we design it well, then the framework could
> > be used to integrate with other extraction mechanisms besides Tika.  I
> > think that would be quite a bit of work.
> > >
> > > It might be a good idea to make that a separate project as was done for
> > DIH, but I have no way of guessing whether there is enough interest in the
> > community to keep it maintained.  If it's a separate project, then I think
> > it would just incorporate SolrJ and Tika, rather than using a special
> > handler.  I have never used ERH in a production setting, and barely have
> > experience with it in non-production.
> > >
> > > Thanks,
> > > Shawn
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> > > For additional commands, e-mail: dev-help@solr.apache.org
> > >
> >
> > _______________________
> > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> > http://www.opensourceconnections.com <
> > http://www.opensourceconnections.com/> | My Free/Busy <
> > http://tinyurl.com/eric-cal>
> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> > https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
> >
> > This e-mail and all contents, including attachments, is considered to be
> > Company Confidential unless explicitly stated otherwise, regardless of
> > whether attachments are marked as such.
> >
> >
> 
> -- 
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org

Re: [DISCUSS] Future of SolrCell in Solr

Posted by Gus Heck <gu...@gmail.com>.

While I totally think that for any heavy-duty use case or any use case
where the document's are not constrained to a known set with polite
characteristics (i.e. known not to be password protected, reasonable
length, etc), Tika should not run inside solr. That said, as I see it the
key downside of not having solr-cell as part of solr would be that we would
likely  remove the docs for it too, and the entire concept of how to get a
"normal" document into solr evaporates from our ref guide. So I like the
sound of it being an official package as Eric suggests, and perhaps even
the canonical example of how to install a package... Along with heavy
documentation caveats of why Tika should run outside of solr for most
production purposes of course.

-Gus


On Thu, Mar 9, 2023 at 8:09 AM Eric Pugh <ep...@opensourceconnections.com>
wrote:

> I did a series of blog posts about Tika, and while conventional wisdom is
> that running Tika in Solr is bad, I’ve had GREAT luck with it over the
> years.
> https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
> <
> https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
> >
>
> Having said that, my bigger beef with Tika in Solr is about all the
> dependencies that it drags along.   I am constantly looking up a package
> wondering how we use it in Solr just to find it’s a Tika package….  So….
> For that reason I think we need to do something better.
>
> I like SolrCell to a package (
> https://issues.apache.org/jira/browse/SOLR-15951 <
> https://issues.apache.org/jira/browse/SOLR-15951>).   We have this
> powerful packaging feature, and yet we hardly dog food it ourselves….  I’d
> love to see us separate out SolrCell and make it easy to do `bin/solr
> package install solrcell` and have it work!  It would both validate the
> whole Package concept, and minimize the dependencies in Solr’s tarball.
>
> Secondly, for folks who really do want to run a separate Tika server, I’d
> love to make it easier to use.    Tika has introduced a new “pipes” concept
> to reduce the amount of back and forth when working with Tika Server that
> might tie nicely into the Solr update pipeline.  I don’t think any real
> work has been done on this…. Hoping Tim Allison weighs in on this topic ;-)
>
> Eric
>
>
> > On Mar 8, 2023, at 9:50 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> >
> > On 3/7/2023 3:48 PM, Jan Høydahl wrote:
> >> * Move SolrCell to a package, outside of Solr's tarball SOLR-15951 <
> https://issues.apache.org/jira/browse/SOLR-15951>
> >> * Deprecate SolrCell SOLR-13973 <
> https://issues.apache.org/jira/browse/SOLR-13973>
> >> * Keep in Solr but use Tika-Server <
> https://cwiki.apache.org/confluence/display/TIKA/TikaServer>,  SOLR-7632 <
> https://issues.apache.org/jira/browse/SOLR-7632>
> >> * Integrate Tika client-side SOLR-1526 <
> https://issues.apache.org/jira/browse/SOLR-1526>
> >
> > As you likely know, the big problem is that Tika has a habit of crashing
> or misbehaving, particularly with PDFs, and if it's running inside Solr,
> then Solr itself is going to suffer whatever bad effects Tika causes.
> >
> >> My current thinking / proposal is to:
> >> * Build a new, thin Solr module that exposes a compatible
> /update/extract handler, delegating to Tika-Server (user-hosted)
> >> * Deprecate SolrCell in current form
> >> * From 10.0, Solr will not ship with embedded Tika, only the new
> handler delegating to Tika-Server
> >
> > I was thinking something along these lines too.  A separate JVM running
> Tika Server that can crash without taking Solr down, and communication so
> ERH can send commands to it, receive extracted data, and hopefully know
> when the other JVM crashes.  If we design it well, then the framework could
> be used to integrate with other extraction mechanisms besides Tika.  I
> think that would be quite a bit of work.
> >
> > It might be a good idea to make that a separate project as was done for
> DIH, but I have no way of guessing whether there is enough interest in the
> community to keep it maintained.  If it's a separate project, then I think
> it would just incorporate SolrJ and Tika, rather than using a special
> handler.  I have never used ERH in a production setting, and barely have
> experience with it in non-production.
> >
> > Thanks,
> > Shawn
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> > For additional commands, e-mail: dev-help@solr.apache.org
> >
>
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>

-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: [DISCUSS] Future of SolrCell in Solr

Posted by Eric Pugh <ep...@opensourceconnections.com>.

I did a series of blog posts about Tika, and while conventional wisdom is that running Tika in Solr is bad, I’ve had GREAT luck with it over the years.  https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/ <https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/>

Having said that, my bigger beef with Tika in Solr is about all the dependencies that it drags along.   I am constantly looking up a package wondering how we use it in Solr just to find it’s a Tika package….  So…. For that reason I think we need to do something better.

I like SolrCell to a package (https://issues.apache.org/jira/browse/SOLR-15951 <https://issues.apache.org/jira/browse/SOLR-15951>).   We have this powerful packaging feature, and yet we hardly dog food it ourselves….  I’d love to see us separate out SolrCell and make it easy to do `bin/solr package install solrcell` and have it work!  It would both validate the whole Package concept, and minimize the dependencies in Solr’s tarball.

Secondly, for folks who really do want to run a separate Tika server, I’d love to make it easier to use.    Tika has introduced a new “pipes” concept to reduce the amount of back and forth when working with Tika Server that might tie nicely into the Solr update pipeline.  I don’t think any real work has been done on this…. Hoping Tim Allison weighs in on this topic ;-)

Eric

> On Mar 8, 2023, at 9:50 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> 
> On 3/7/2023 3:48 PM, Jan Høydahl wrote:
>> * Move SolrCell to a package, outside of Solr's tarball SOLR-15951 <https://issues.apache.org/jira/browse/SOLR-15951>
>> * Deprecate SolrCell SOLR-13973 <https://issues.apache.org/jira/browse/SOLR-13973>
>> * Keep in Solr but use Tika-Server <https://cwiki.apache.org/confluence/display/TIKA/TikaServer>,  SOLR-7632 <https://issues.apache.org/jira/browse/SOLR-7632>
>> * Integrate Tika client-side SOLR-1526 <https://issues.apache.org/jira/browse/SOLR-1526>
> 
> As you likely know, the big problem is that Tika has a habit of crashing or misbehaving, particularly with PDFs, and if it's running inside Solr, then Solr itself is going to suffer whatever bad effects Tika causes.
> 
>> My current thinking / proposal is to:
>> * Build a new, thin Solr module that exposes a compatible /update/extract handler, delegating to Tika-Server (user-hosted)
>> * Deprecate SolrCell in current form
>> * From 10.0, Solr will not ship with embedded Tika, only the new handler delegating to Tika-Server
> 
> I was thinking something along these lines too.  A separate JVM running Tika Server that can crash without taking Solr down, and communication so ERH can send commands to it, receive extracted data, and hopefully know when the other JVM crashes.  If we design it well, then the framework could be used to integrate with other extraction mechanisms besides Tika.  I think that would be quite a bit of work.
> 
> It might be a good idea to make that a separate project as was done for DIH, but I have no way of guessing whether there is enough interest in the community to keep it maintained.  If it's a separate project, then I think it would just incorporate SolrJ and Tika, rather than using a special handler.  I have never used ERH in a production setting, and barely have experience with it in non-production.
> 
> Thanks,
> Shawn
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> For additional commands, e-mail: dev-help@solr.apache.org
> 

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

Re: [DISCUSS] Future of SolrCell in Solr

Posted by Shawn Heisey <ap...@elyograg.org>.

On 3/7/2023 3:48 PM, Jan Høydahl wrote:
> * Move SolrCell to a package, outside of Solr's tarball SOLR-15951 <https://issues.apache.org/jira/browse/SOLR-15951>
> * Deprecate SolrCell SOLR-13973 <https://issues.apache.org/jira/browse/SOLR-13973>
> * Keep in Solr but use Tika-Server <https://cwiki.apache.org/confluence/display/TIKA/TikaServer>,  SOLR-7632 <https://issues.apache.org/jira/browse/SOLR-7632>
> * Integrate Tika client-side SOLR-1526 <https://issues.apache.org/jira/browse/SOLR-1526>

As you likely know, the big problem is that Tika has a habit of crashing 
or misbehaving, particularly with PDFs, and if it's running inside Solr, 
then Solr itself is going to suffer whatever bad effects Tika causes.

> My current thinking / proposal is to:
> * Build a new, thin Solr module that exposes a compatible /update/extract handler, delegating to Tika-Server (user-hosted)
> * Deprecate SolrCell in current form
> * From 10.0, Solr will not ship with embedded Tika, only the new handler delegating to Tika-Server

I was thinking something along these lines too.  A separate JVM running 
Tika Server that can crash without taking Solr down, and communication 
so ERH can send commands to it, receive extracted data, and hopefully 
know when the other JVM crashes.  If we design it well, then the 
framework could be used to integrate with other extraction mechanisms 
besides Tika.  I think that would be quite a bit of work.

It might be a good idea to make that a separate project as was done for 
DIH, but I have no way of guessing whether there is enough interest in 
the community to keep it maintained.  If it's a separate project, then I 
think it would just incorporate SolrJ and Tika, rather than using a 
special handler.  I have never used ERH in a production setting, and 
barely have experience with it in non-production.

Thanks,
Shawn

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org