You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by "Markwood, Crag" <cm...@epizyme.com> on 2022/08/22 17:27:46 UTC

getting started?

Hello,

I've run a couple of the demos and am ready to try Solr on some of my own documents. Is there a 'tips/tricks' document for this, or should I just use schemaless mode and point Solr at a subset (%10?) of my repository (~100k Microsoft/text/pdf files)?

Thank you in advance!
Crag

Crag Markwood
Senior Director, Research Informatics
Epizyme, an Ipsen Company

Re: getting started?

Posted by Gus Heck <gu...@gmail.com>.

If you're moving towards mocking up a production system I'd move away from
schemaless mode, as it enables both explosion of the number of fields if
you get bad or unexpected data, and is prone to difficult to fix errors
where it misidentifies numbers/strings ... particularly if a string field
happens to first appear with only  numbers, you can be in trouble.
"Schemaless" is a misleading name... "Schema Guessing" is more accurate,
since there is in fact a schema, and it will subsequently object to strings
sent to things it initially recognized as a number.

Also it sounds like you might be using the extracting request handler, so
be sure to read this section:
https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-with-tika.html#solr-cell-performance-implications
Note especially the warning (which should probably be a larger warning box).
>
> For these reasons, Solr Cell is not recommended for use in a production
> system.
>
> It is a best practice to use Solr Cell as a proof-of-concept tool during
> development and then run Tika as an external process that sends the
> extracted documents to Solr (via SolrJ
> <https://solr.apache.org/guide/solr/latest/deployment-guide/solrj.html>)
> for indexing. This way, any extraction failures that occur are isolated
> from Solr itself and can be handled gracefully.
>

Starting with a subset of your data is however an excellent idea. It's
normal early on to index things, see if you like the result, tweak the
schema or tika settings etc and try again. Keeping the initial set smaller
but representative facilitates iteration.

Best,
Gus

On Tue, Aug 23, 2022 at 7:45 AM Mikhail Khludnev <mk...@apache.org> wrote:

> Hello, Crag.
> It's probably something like
> https://solr.apache.org/guide/solr/latest/getting-started/tutorial-diy.html
>
> On Tue, Aug 23, 2022 at 10:57 AM Markwood, Crag <cm...@epizyme.com>
> wrote:
>
> > Hello,
> >
> > I've run a couple of the demos and am ready to try Solr on some of my own
> > documents. Is there a 'tips/tricks' document for this, or should I just
> use
> > schemaless mode and point Solr at a subset (%10?) of my repository (~100k
> > Microsoft/text/pdf files)?
> >
> > Thank you in advance!
> > Crag
> >
> > Crag Markwood
> > Senior Director, Research Informatics
> > Epizyme, an Ipsen Company
> >
> >
>
> --
> Sincerely yours
> Mikhail Khludnev
>

-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: getting started?

Posted by Mikhail Khludnev <mk...@apache.org>.

Hello, Crag.
It's probably something like
https://solr.apache.org/guide/solr/latest/getting-started/tutorial-diy.html

On Tue, Aug 23, 2022 at 10:57 AM Markwood, Crag <cm...@epizyme.com>
wrote:

> Hello,
>
> I've run a couple of the demos and am ready to try Solr on some of my own
> documents. Is there a 'tips/tricks' document for this, or should I just use
> schemaless mode and point Solr at a subset (%10?) of my repository (~100k
> Microsoft/text/pdf files)?
>
> Thank you in advance!
> Crag
>
> Crag Markwood
> Senior Director, Research Informatics
> Epizyme, an Ipsen Company
>
>

-- 
Sincerely yours
Mikhail Khludnev