You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Syao Work <sy...@gmail.com> on 2013/03/05 10:38:39 UTC

Indexing directories and files in a File System. (Fetched: 2, Processed: 0)

Hello,

I am trying to index some FS folder tree.
Spent 2 days finding what could be the problem - got nothing :) There are
not so much examples on indexing File System.
In the logs I cant find any exceptions why it does not process the info
Data import configuration and debug response are attached


Using:
1. solr web admin tool,
2. Java version "1.7.0_09-icedtea"
   OpenJDK Runtime Environment (fedora-2.3.7.0.fc17-x86_64)
   OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)

Thank you for your time,
Ro

P.S. Excuse my bad English, I am not a native English speaker.

Re: Indexing directories and files in a File System. (Fetched: 2, Processed: 0)

Posted by Otis Gospodnetic <ot...@gmail.com>.
Hi Syao,

You should just write a simple (Java) app that traverses the dir tree, gets
info about each file, uses it to construct Solr doc objects
(SolrInputDocuments if you are working in Java with SolrJ) and sends them
to Solr for indexing.  Should be about 30 minutes of work or less.

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Wed, Mar 6, 2013 at 3:37 AM, Syao Work <sy...@gmail.com> wrote:

> So you are suggesting me to iterate file system and index fs tree entities
> including: directory names, file names, file size etc. and then post it to
> solr?
> I need to index the FS tree, not the file contents.
>
> On Tue, Mar 5, 2013 at 5:54 PM, Erik Hatcher <er...@gmail.com>
> wrote:
>
> > Would Solr's post.jar work for you?   It has a directory recurse option.
> >  The usage/help output is pasted below.
> >
> > Here's what should work for you: "java -Dauto -Drecursive -jar post.jar
> > /some/folder"
> >
> >         Erik
> >
> >
> >
> > exampledocs  java -jar post.jar --help
> > SimplePostTool version 1.5
> > Usage: java [SystemProperties] -jar post.jar [-h|-]
> [<file|folder|url|arg>
> > [<file|folder|url|arg>...]]
> >
> > Supported System Properties and their defaults:
> >   -Ddata=files|web|args|stdin (default=files)
> >   -Dtype=<content-type> (default=application/xml)
> >   -Durl=<solr-update-url> (default=http://localhost:8983/solr/update)
> >   -Dauto=yes|no (default=no)
> >   -Drecursive=yes|no|<depth> (default=0)
> >   -Ddelay=<seconds> (default=0 for files, 10 for web)
> >   -Dfiletypes=<type>[,<type>,...]
> >
> (default=xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log)
> >   -Dparams="<key>=<value>[&<key>=<value>...]" (values must be
> URL-encoded)
> >   -Dcommit=yes|no (default=yes)
> >   -Doptimize=yes|no (default=no)
> >   -Dout=yes|no (default=no)
> >
> > This is a simple command line tool for POSTing raw data to a Solr
> > port.  Data can be read from files specified as commandline args,
> > URLs specified as args, as raw commandline arg strings or via STDIN.
> > Examples:
> >   java -jar post.jar *.xml
> >   java -Ddata=args  -jar post.jar '<delete><id>42</id></delete>'
> >   java -Ddata=stdin -jar post.jar < hd.xml
> >   java -Ddata=web -jar post.jar http://example.com/
> >   java -Dtype=text/csv -jar post.jar *.csv
> >   java -Dtype=application/json -jar post.jar *.json
> >   java -Durl=http://localhost:8983/solr/update/extract -Dparams=
> literal.id=a
> > -Dtype=application/pdf -jar post.jar a.pdf
> >   java -Dauto -jar post.jar *
> >   java -Dauto -Drecursive -jar post.jar afolder
> >   java -Dauto -Dfiletypes=ppt,html -jar post.jar afolder
> > The options controlled by System Properties include the Solr
> > URL to POST to, the Content-Type of the data, whether a commit
> > or optimize should be executed, and whether the response should
> > be written to STDOUT. If auto=yes the tool will try to set type
> > and url automatically from file name. When posting rich documents
> > the file name will be propagated as "resource.name" and also used
> > as "literal.id". You may override these or any other request parameter
> > through the -Dparams property. To do a commit only, use "-" as argument.
> > The web mode is a simple crawler following links within domain, default
> > delay=10s.
> >
> >
> > On Mar 5, 2013, at 04:38 , Syao Work wrote:
> >
> > > Hello,
> > >
> > > I am trying to index some FS folder tree.
> > > Spent 2 days finding what could be the problem - got nothing :) There
> > are not so much examples on indexing File System.
> > > In the logs I cant find any exceptions why it does not process the info
> > > Data import configuration and debug response are attached
> > >
> > >
> > > Using:
> > > 1. solr web admin tool,
> > > 2. Java version "1.7.0_09-icedtea"
> > >    OpenJDK Runtime Environment (fedora-2.3.7.0.fc17-x86_64)
> > >    OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
> > >
> > > Thank you for your time,
> > > Ro
> > >
> > > P.S. Excuse my bad English, I am not a native English speaker.
> > > <data-config.xml><import-debug-response.json>
> >
> >
>

Re: Indexing directories and files in a File System. (Fetched: 2, Processed: 0)

Posted by Syao Work <sy...@gmail.com>.
So you are suggesting me to iterate file system and index fs tree entities
including: directory names, file names, file size etc. and then post it to
solr?
I need to index the FS tree, not the file contents.

On Tue, Mar 5, 2013 at 5:54 PM, Erik Hatcher <er...@gmail.com> wrote:

> Would Solr's post.jar work for you?   It has a directory recurse option.
>  The usage/help output is pasted below.
>
> Here's what should work for you: "java -Dauto -Drecursive -jar post.jar
> /some/folder"
>
>         Erik
>
>
>
> exampledocs  java -jar post.jar --help
> SimplePostTool version 1.5
> Usage: java [SystemProperties] -jar post.jar [-h|-] [<file|folder|url|arg>
> [<file|folder|url|arg>...]]
>
> Supported System Properties and their defaults:
>   -Ddata=files|web|args|stdin (default=files)
>   -Dtype=<content-type> (default=application/xml)
>   -Durl=<solr-update-url> (default=http://localhost:8983/solr/update)
>   -Dauto=yes|no (default=no)
>   -Drecursive=yes|no|<depth> (default=0)
>   -Ddelay=<seconds> (default=0 for files, 10 for web)
>   -Dfiletypes=<type>[,<type>,...]
> (default=xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log)
>   -Dparams="<key>=<value>[&<key>=<value>...]" (values must be URL-encoded)
>   -Dcommit=yes|no (default=yes)
>   -Doptimize=yes|no (default=no)
>   -Dout=yes|no (default=no)
>
> This is a simple command line tool for POSTing raw data to a Solr
> port.  Data can be read from files specified as commandline args,
> URLs specified as args, as raw commandline arg strings or via STDIN.
> Examples:
>   java -jar post.jar *.xml
>   java -Ddata=args  -jar post.jar '<delete><id>42</id></delete>'
>   java -Ddata=stdin -jar post.jar < hd.xml
>   java -Ddata=web -jar post.jar http://example.com/
>   java -Dtype=text/csv -jar post.jar *.csv
>   java -Dtype=application/json -jar post.jar *.json
>   java -Durl=http://localhost:8983/solr/update/extract -Dparams=literal.id=a
> -Dtype=application/pdf -jar post.jar a.pdf
>   java -Dauto -jar post.jar *
>   java -Dauto -Drecursive -jar post.jar afolder
>   java -Dauto -Dfiletypes=ppt,html -jar post.jar afolder
> The options controlled by System Properties include the Solr
> URL to POST to, the Content-Type of the data, whether a commit
> or optimize should be executed, and whether the response should
> be written to STDOUT. If auto=yes the tool will try to set type
> and url automatically from file name. When posting rich documents
> the file name will be propagated as "resource.name" and also used
> as "literal.id". You may override these or any other request parameter
> through the -Dparams property. To do a commit only, use "-" as argument.
> The web mode is a simple crawler following links within domain, default
> delay=10s.
>
>
> On Mar 5, 2013, at 04:38 , Syao Work wrote:
>
> > Hello,
> >
> > I am trying to index some FS folder tree.
> > Spent 2 days finding what could be the problem - got nothing :) There
> are not so much examples on indexing File System.
> > In the logs I cant find any exceptions why it does not process the info
> > Data import configuration and debug response are attached
> >
> >
> > Using:
> > 1. solr web admin tool,
> > 2. Java version "1.7.0_09-icedtea"
> >    OpenJDK Runtime Environment (fedora-2.3.7.0.fc17-x86_64)
> >    OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
> >
> > Thank you for your time,
> > Ro
> >
> > P.S. Excuse my bad English, I am not a native English speaker.
> > <data-config.xml><import-debug-response.json>
>
>

Re: Indexing directories and files in a File System. (Fetched: 2, Processed: 0)

Posted by Erik Hatcher <er...@gmail.com>.
Would Solr's post.jar work for you?   It has a directory recurse option.  The usage/help output is pasted below.

Here's what should work for you: "java -Dauto -Drecursive -jar post.jar /some/folder"

	Erik



exampledocs  java -jar post.jar --help
SimplePostTool version 1.5
Usage: java [SystemProperties] -jar post.jar [-h|-] [<file|folder|url|arg> [<file|folder|url|arg>...]]

Supported System Properties and their defaults:
  -Ddata=files|web|args|stdin (default=files)
  -Dtype=<content-type> (default=application/xml)
  -Durl=<solr-update-url> (default=http://localhost:8983/solr/update)
  -Dauto=yes|no (default=no)
  -Drecursive=yes|no|<depth> (default=0)
  -Ddelay=<seconds> (default=0 for files, 10 for web)
  -Dfiletypes=<type>[,<type>,...] (default=xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log)
  -Dparams="<key>=<value>[&<key>=<value>...]" (values must be URL-encoded)
  -Dcommit=yes|no (default=yes)
  -Doptimize=yes|no (default=no)
  -Dout=yes|no (default=no)

This is a simple command line tool for POSTing raw data to a Solr
port.  Data can be read from files specified as commandline args,
URLs specified as args, as raw commandline arg strings or via STDIN.
Examples:
  java -jar post.jar *.xml
  java -Ddata=args  -jar post.jar '<delete><id>42</id></delete>'
  java -Ddata=stdin -jar post.jar < hd.xml
  java -Ddata=web -jar post.jar http://example.com/
  java -Dtype=text/csv -jar post.jar *.csv
  java -Dtype=application/json -jar post.jar *.json
  java -Durl=http://localhost:8983/solr/update/extract -Dparams=literal.id=a -Dtype=application/pdf -jar post.jar a.pdf
  java -Dauto -jar post.jar *
  java -Dauto -Drecursive -jar post.jar afolder
  java -Dauto -Dfiletypes=ppt,html -jar post.jar afolder
The options controlled by System Properties include the Solr
URL to POST to, the Content-Type of the data, whether a commit
or optimize should be executed, and whether the response should
be written to STDOUT. If auto=yes the tool will try to set type
and url automatically from file name. When posting rich documents
the file name will be propagated as "resource.name" and also used
as "literal.id". You may override these or any other request parameter
through the -Dparams property. To do a commit only, use "-" as argument.
The web mode is a simple crawler following links within domain, default delay=10s.


On Mar 5, 2013, at 04:38 , Syao Work wrote:

> Hello,
> 
> I am trying to index some FS folder tree. 
> Spent 2 days finding what could be the problem - got nothing :) There are not so much examples on indexing File System.
> In the logs I cant find any exceptions why it does not process the info
> Data import configuration and debug response are attached 
> 
> 
> Using: 
> 1. solr web admin tool, 
> 2. Java version "1.7.0_09-icedtea"
>    OpenJDK Runtime Environment (fedora-2.3.7.0.fc17-x86_64) 
>    OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
> 
> Thank you for your time,
> Ro
> 
> P.S. Excuse my bad English, I am not a native English speaker.
> <data-config.xml><import-debug-response.json>


Re: Indexing directories and files in a File System. (Fetched: 2, Processed: 0)

Posted by Syao Work <sy...@gmail.com>.
Can you send an example?

On Tue, Mar 5, 2013 at 5:11 PM, Gora Mohanty <go...@mimirtech.com> wrote:

> On 5 March 2013 18:22, Syao Work <sy...@gmail.com> wrote:
> > And if I need to index file name, path, size and/or mime?
> [...]
>
> You would need to create separate entities for each field that
> you need to index. The referenced Wiki page on DIH has
> other examples of configurations with multiple entities.
>
> Regards,
> Gora
>

Re: Indexing directories and files in a File System. (Fetched: 2, Processed: 0)

Posted by Gora Mohanty <go...@mimirtech.com>.
On 5 March 2013 18:22, Syao Work <sy...@gmail.com> wrote:
> And if I need to index file name, path, size and/or mime?
[...]

You would need to create separate entities for each field that
you need to index. The referenced Wiki page on DIH has
other examples of configurations with multiple entities.

Regards,
Gora

Re: Indexing directories and files in a File System. (Fetched: 2, Processed: 0)

Posted by Syao Work <sy...@gmail.com>.
And if I need to index file name, path, size and/or mime?

On Tue, Mar 5, 2013 at 2:45 PM, Gora Mohanty <go...@mimirtech.com> wrote:

> On 5 March 2013 15:08, Syao Work <sy...@gmail.com> wrote:
> > Hello,
> >
> > I am trying to index some FS folder tree.
> > Spent 2 days finding what could be the problem - got nothing :) There are
> > not so much examples on indexing File System.
> > In the logs I cant find any exceptions why it does not process the info
> > Data import configuration and debug response are attached
> [...]
>
> Please look more closely at the sample data configuration file at
> http://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor
> You need to use something like XPathEntityProcessor to define
> entities for indexing. Other entity processors, such as
> PlainTextEntityProcessor,
> can instead be used if you are not using XML files. Also, make sure
> that the field definitions in your schema.xml match the field names
> here.
>
> Regards,
> Gora
>

Re: Indexing directories and files in a File System. (Fetched: 2, Processed: 0)

Posted by Gora Mohanty <go...@mimirtech.com>.
On 5 March 2013 15:08, Syao Work <sy...@gmail.com> wrote:
> Hello,
>
> I am trying to index some FS folder tree.
> Spent 2 days finding what could be the problem - got nothing :) There are
> not so much examples on indexing File System.
> In the logs I cant find any exceptions why it does not process the info
> Data import configuration and debug response are attached
[...]

Please look more closely at the sample data configuration file at
http://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor
You need to use something like XPathEntityProcessor to define
entities for indexing. Other entity processors, such as
PlainTextEntityProcessor,
can instead be used if you are not using XML files. Also, make sure
that the field definitions in your schema.xml match the field names
here.

Regards,
Gora