You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Terry Steichen <te...@net-frame.com> on 2018/03/29 21:59:13 UTC

Three Indexing Questions

First question: When indexing content in a directory, Solr's normal
behavior is to recursively index all the files found in that directory
and its subdirectories.  However, turns out that when the files are of
the form *.eml (email), solr won't do that.  I can use a wildcard to get
it to index the current directory, but it won't recurse.

I note this message that's displayed when I begin indexing: "Entering
auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log

Is there a way to get it to recurse through files with different
extensions, for example, like .eml?  When I manually add all the
subdirectory content, solr seems to parse the content very well,
recognizing all the standard email metadata.  I just can't get it to do
the indexing recursively.

Second question: if I want to index files from many different source
directories, is there a way to specify these different sources in one
command? (Right now I have to issue a separate indexing command for each
directory - which means I have to sit around and wait till each is
finished.)

Third question: I have a very large directory structure that includes a
couple of subdirectories I'd like to exclude from indexing.  Is there a
way to index recursively, but exclude specified directories?


Re: Three Indexing Questions

Posted by Shawn Heisey <ap...@elyograg.org>.
On 3/29/2018 3:59 PM, Terry Steichen wrote:
> First question: When indexing content in a directory, Solr's normal
> behavior is to recursively index all the files found in that directory
> and its subdirectories.  However, turns out that when the files are of
> the form *.eml (email), solr won't do that.  I can use a wildcard to get
> it to index the current directory, but it won't recurse.

At first I had no idea what program you were using.  I may have figured
it out, see below.

> I note this message that's displayed when I begin indexing: "Entering
> auto mode. File endings considered are
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log

That looks like the simple post tool included with Solr.  If it is, type
"bin/post -help" and you will see that there is a -filetypes option that
lets you change the list of extensions that are considered valid.

Note that the post tool included with Solr is a SIMPLE post tool.  It's
designed as a way to get your feet wet, not for heavy production usage. 
It does not have extensive capability.  We strongly recommend that you
graduate to a better indexing program.  Usually that means that you're
going to have to write one yourself, to be sure that it does everything
YOU want it to do.  The one included with Solr probably can't do some of
the things that you want it to do.

Also, indexing files using the post tool is going to run Tika extraction
inside Solr.  Tika is a separate Apache project.  Solr happens to
include a subset of Tika's capability that can run inside Solr.  That
program is known to sometimes behave explosively when it processes
documents.  If an explosion happens in Tika and it's running inside
Solr, then Solr itself might crash.  Running Tika outside Solr, usually
in a program that you write yourself, is highly recommended.  Doing this
will also give you access to the full range of Tika's capabilities.

Here's an example of a program that uses both JDBC and Tika to index to
Solr:

https://lucidworks.com/2012/02/14/indexing-with-solrj/

If you search google for "tika index solr" (without the quotes), you'll
find some other examples of custom programs that use Tika to index to
Solr.  There may be better searches you can do on Google as well.

Thanks,
Shawn


Re: Three Indexing Questions

Posted by Erik Hatcher <er...@gmail.com>.
Terry -

You’re speaking of bin/post, looks like.   bin/post is _just_ a simple tool to provide some basic utility.   The fact that it can recurse a directory structure at all is an extra bonus that really isn’t about “Solr” per se, but about posting content into it.   

Frankly, (even as the author of bin/post) I don’t think bin/post for file system crawling is the rightest way to go.   Having Solr parse content (which bin/post sends into Solr’s /update/extract handler) itself is recommended for production/scale.

All caveats aside and recommendations to upsize your file crawler…. it’s just a bin/post shell script and a Java class called SimplePostTool - I’d encourage you to adapt what it does to your requirements so that it will send over .eml files like apparently work manually (how did you test that?  curious on the details), and handle multiple directories.   It wasn’t designed to handle robust file crawls, but certainly is there for your taking to adjust to your needs if it is close enough.   And of course, if you want to generalize the handling and submit that back then bin/post can improve!

In short: no, bin/post can’t do the things you’re asking of it, but there’s no reason it couldn’t be evolved to handle those things.

	Erik


> 
> I note this message that's displayed when I begin indexing: "Entering
> auto mode. File endings considered are
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> 
> Is there a way to get it to recurse through files with different
> extensions, for example, like .eml?  When I manually add all the
> subdirectory content, solr seems to parse the content very well,
> recognizing all the standard email metadata.  I just can't get it to do
> the indexing recursively.
> 
> Second question: if I want to index files from many different source
> directories, is there a way to specify these different sources in one
> command? (Right now I have to issue a separate indexing command for each
> directory - which means I have to sit around and wait till each is
> finished.)
> 
> Third question: I have a very large directory structure that includes a
> couple of subdirectories I'd like to exclude from indexing.  Is there a
> way to index recursively, but exclude specified directories?
>