You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Zheng Lin Edwin Yeo <ed...@gmail.com> on 2015/10/16 05:22:46 UTC

Recursively scan documents for indexing in a folder in SolrJ

Hi,

I understand that in SimplePostTool (post.jar), there is this command to
automatically detect content types in a folder, and recursively scan it for
documents for indexing into a collection:
bin/post -c gettingstarted afolder/

This has been useful for me to do mass indexing of all the files that are
in the folder. Now that I'm moving to production and plans to use SolrJ to
do the indexing as it can do more things like robustness checks and retires
for indexes that fails.

However, I can't seems to find a way to do the same in SolrJ. Is it
possible for this to be done in SolrJ? I'm using Solr 5.3.0

Thank you.

Regards,
Edwin

Re: Recursively scan documents for indexing in a folder in SolrJ

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.

Yes, I've managed to "steal" some codes from post.jar to only send
rich-text documents format to /update/extract.

I've also change the setting of the Eclipse at Windows -> Preference ->
General -> Workspace. Under Text file encoding, select Other, and choose
UTF-8. The Eclipse is now able to read the Chinese characters successfully.

Thank you for your help.

Regards,
Edwin



On 19 October 2015 at 16:33, Duck Geraint (ext) GBJH <
geraint.duck@syngenta.com> wrote:

> "The problem for this is that it is indexing all the files regardless of
> the formats, instead of just those formats in post.jar. So I guess still
> have to "steal" some codes from there to detect the file format?"
>
> If you've not worked it out yourself yet, try something like:
>
> http://docs.oracle.com/javase/7/docs/api/java/io/File.html#listFiles(java.io.FilenameFilter)
>
> http://stackoverflow.com/questions/5751335/using-file-listfiles-with-filenameextensionfilter
>
> Geraint
>
> Geraint Duck
> Data Scientist
> Toxicology and Health Sciences
> Syngenta UK
> Email: geraint.duck@syngenta.com
>
> -----Original Message-----
> From: Zheng Lin Edwin Yeo [mailto:edwinyeozl@gmail.com]
> Sent: 17 October 2015 00:55
> To: solr-user@lucene.apache.org
> Subject: Re: Recursively scan documents for indexing in a folder in SolrJ
>
> Thanks for your advice. I also found this method which so far has been
> able to traverse all the documents in the folder and index them in Solr.
>
> public static void showFiles(File[] files) {
>     for (File file : files) {
>         if (file.isDirectory()) {
>             System.out.println("Directory: " + file.getName());
>             showFiles(file.listFiles()); // Calls same method again.
>         } else {
>             System.out.println("File: " + file.getName());
>         }
>     }}
>
> The problem for this is that it is indexing all the files regardless of
> the formats, instead of just those formats in post.jar. So I guess still
> have to "steal" some codes from there to detect the file format?
>
> As for files that contains non-English characters (Eg; Chinese
> characters), it is currently not able to read the Chinese characters, and
> it is all read as a series of "???". Any idea how to solve this problem?
>
> Thank you.
>
> Regards,
> Edwin
>
>
> On 16 October 2015 at 21:16, Duck Geraint (ext) GBJH <
> geraint.duck@syngenta.com> wrote:
>
> > Also, check this link for SolrJ example code (including the recursion):
> > https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
> >
> > Geraint
> >
> >
> > Geraint Duck
> > Data Scientist
> > Toxicology and Health Sciences
> > Syngenta UK
> > Email: geraint.duck@syngenta.com
> >
> > -----Original Message-----
> > From: Jan Høydahl [mailto:jan.asf@cominvent.com]
> > Sent: 16 October 2015 12:14
> > To: solr-user@lucene.apache.org
> > Subject: Re: Recursively scan documents for indexing in a folder in
> > SolrJ
> >
> > SolrJ does not have any file crawler built in.
> > But you are free to steal code from SimplePostTool.java related to
> > directory traversal, and then index each document found using SolrJ.
> >
> > Note that SimplePostTool.java tries to be smart with what endpoint to
> > post files to, xml, csv and json content will be posted to /update
> > while office docs go to /update/extract
> >
> > --
> > Jan Høydahl, search solution architect Cominvent AS -
> > www.cominvent.com
> >
> > > 16. okt. 2015 kl. 05.22 skrev Zheng Lin Edwin Yeo
> > ><edwinyeozl@gmail.com
> > >:
> > >
> > > Hi,
> > >
> > > I understand that in SimplePostTool (post.jar), there is this
> > > command to automatically detect content types in a folder, and
> > > recursively scan it for documents for indexing into a collection:
> > > bin/post -c gettingstarted afolder/
> > >
> > > This has been useful for me to do mass indexing of all the files
> > > that are in the folder. Now that I'm moving to production and plans
> > > to use SolrJ to do the indexing as it can do more things like
> > > robustness checks and retires for indexes that fails.
> > >
> > > However, I can't seems to find a way to do the same in SolrJ. Is it
> > > possible for this to be done in SolrJ? I'm using Solr 5.3.0
> > >
> > > Thank you.
> > >
> > > Regards,
> > > Edwin
> >
> >
> > ________________________________
> >
> >
> > Syngenta Limited, Registered in England No 2710846;Registered Office :
> > Syngenta Limited, European Regional Centre, Priestley Road, Surrey
> > Research Park, Guildford, Surrey, GU2 7YH, United Kingdom
> > ________________________________  This message may contain
> > confidential information. If you are not the designated recipient,
> > please notify the sender immediately, and delete the original and any
> > copies. Any use of the message by you is prohibited.
> >
> ________________________________
>
>
> Syngenta Limited, Registered in England No 2710846;Registered Office :
> Syngenta Limited, European Regional Centre, Priestley Road, Surrey Research
> Park, Guildford, Surrey, GU2 7YH, United Kingdom
> ________________________________
>  This message may contain confidential information. If you are not the
> designated recipient, please notify the sender immediately, and delete the
> original and any copies. Any use of the message by you is prohibited.
>

RE: Recursively scan documents for indexing in a folder in SolrJ

Posted by "Duck Geraint (ext) GBJH" <ge...@syngenta.com>.

"The problem for this is that it is indexing all the files regardless of the formats, instead of just those formats in post.jar. So I guess still have to "steal" some codes from there to detect the file format?"

If you've not worked it out yourself yet, try something like:
http://docs.oracle.com/javase/7/docs/api/java/io/File.html#listFiles(java.io.FilenameFilter)
http://stackoverflow.com/questions/5751335/using-file-listfiles-with-filenameextensionfilter

Geraint

Geraint Duck
Data Scientist
Toxicology and Health Sciences
Syngenta UK
Email: geraint.duck@syngenta.com

-----Original Message-----
From: Zheng Lin Edwin Yeo [mailto:edwinyeozl@gmail.com]
Sent: 17 October 2015 00:55
To: solr-user@lucene.apache.org
Subject: Re: Recursively scan documents for indexing in a folder in SolrJ

Thanks for your advice. I also found this method which so far has been able to traverse all the documents in the folder and index them in Solr.

public static void showFiles(File[] files) {
    for (File file : files) {
        if (file.isDirectory()) {
            System.out.println("Directory: " + file.getName());
            showFiles(file.listFiles()); // Calls same method again.
        } else {
            System.out.println("File: " + file.getName());
        }
    }}

The problem for this is that it is indexing all the files regardless of the formats, instead of just those formats in post.jar. So I guess still have to "steal" some codes from there to detect the file format?

As for files that contains non-English characters (Eg; Chinese characters), it is currently not able to read the Chinese characters, and it is all read as a series of "???". Any idea how to solve this problem?

Thank you.

Regards,
Edwin


On 16 October 2015 at 21:16, Duck Geraint (ext) GBJH < geraint.duck@syngenta.com> wrote:

> Also, check this link for SolrJ example code (including the recursion):
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>
> Geraint
>
>
> Geraint Duck
> Data Scientist
> Toxicology and Health Sciences
> Syngenta UK
> Email: geraint.duck@syngenta.com
>
> -----Original Message-----
> From: Jan Høydahl [mailto:jan.asf@cominvent.com]
> Sent: 16 October 2015 12:14
> To: solr-user@lucene.apache.org
> Subject: Re: Recursively scan documents for indexing in a folder in
> SolrJ
>
> SolrJ does not have any file crawler built in.
> But you are free to steal code from SimplePostTool.java related to
> directory traversal, and then index each document found using SolrJ.
>
> Note that SimplePostTool.java tries to be smart with what endpoint to
> post files to, xml, csv and json content will be posted to /update
> while office docs go to /update/extract
>
> --
> Jan Høydahl, search solution architect Cominvent AS -
> www.cominvent.com
>
> > 16. okt. 2015 kl. 05.22 skrev Zheng Lin Edwin Yeo
> ><edwinyeozl@gmail.com
> >:
> >
> > Hi,
> >
> > I understand that in SimplePostTool (post.jar), there is this
> > command to automatically detect content types in a folder, and
> > recursively scan it for documents for indexing into a collection:
> > bin/post -c gettingstarted afolder/
> >
> > This has been useful for me to do mass indexing of all the files
> > that are in the folder. Now that I'm moving to production and plans
> > to use SolrJ to do the indexing as it can do more things like
> > robustness checks and retires for indexes that fails.
> >
> > However, I can't seems to find a way to do the same in SolrJ. Is it
> > possible for this to be done in SolrJ? I'm using Solr 5.3.0
> >
> > Thank you.
> >
> > Regards,
> > Edwin
>
>
> ________________________________
>
>
> Syngenta Limited, Registered in England No 2710846;Registered Office :
> Syngenta Limited, European Regional Centre, Priestley Road, Surrey
> Research Park, Guildford, Surrey, GU2 7YH, United Kingdom
> ________________________________  This message may contain
> confidential information. If you are not the designated recipient,
> please notify the sender immediately, and delete the original and any
> copies. Any use of the message by you is prohibited.
>
________________________________


Syngenta Limited, Registered in England No 2710846;Registered Office : Syngenta Limited, European Regional Centre, Priestley Road, Surrey Research Park, Guildford, Surrey, GU2 7YH, United Kingdom
________________________________
 This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited.

Re: Recursively scan documents for indexing in a folder in SolrJ

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.

Thanks for your advice. I also found this method which so far has been able
to traverse all the documents in the folder and index them in Solr.

public static void showFiles(File[] files) {
    for (File file : files) {
        if (file.isDirectory()) {
            System.out.println("Directory: " + file.getName());
            showFiles(file.listFiles()); // Calls same method again.
        } else {
            System.out.println("File: " + file.getName());
        }
    }}

The problem for this is that it is indexing all the files regardless of the
formats, instead of just those formats in post.jar. So I guess still have
to "steal" some codes from there to detect the file format?

As for files that contains non-English characters (Eg; Chinese characters),
it is currently not able to read the Chinese characters, and it is all read
as a series of "???". Any idea how to solve this problem?

Thank you.

Regards,
Edwin


On 16 October 2015 at 21:16, Duck Geraint (ext) GBJH <
geraint.duck@syngenta.com> wrote:

> Also, check this link for SolrJ example code (including the recursion):
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>
> Geraint
>
>
> Geraint Duck
> Data Scientist
> Toxicology and Health Sciences
> Syngenta UK
> Email: geraint.duck@syngenta.com
>
> -----Original Message-----
> From: Jan Høydahl [mailto:jan.asf@cominvent.com]
> Sent: 16 October 2015 12:14
> To: solr-user@lucene.apache.org
> Subject: Re: Recursively scan documents for indexing in a folder in SolrJ
>
> SolrJ does not have any file crawler built in.
> But you are free to steal code from SimplePostTool.java related to
> directory traversal, and then index each document found using SolrJ.
>
> Note that SimplePostTool.java tries to be smart with what endpoint to post
> files to, xml, csv and json content will be posted to /update while office
> docs go to /update/extract
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 16. okt. 2015 kl. 05.22 skrev Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >:
> >
> > Hi,
> >
> > I understand that in SimplePostTool (post.jar), there is this command
> > to automatically detect content types in a folder, and recursively
> > scan it for documents for indexing into a collection:
> > bin/post -c gettingstarted afolder/
> >
> > This has been useful for me to do mass indexing of all the files that
> > are in the folder. Now that I'm moving to production and plans to use
> > SolrJ to do the indexing as it can do more things like robustness
> > checks and retires for indexes that fails.
> >
> > However, I can't seems to find a way to do the same in SolrJ. Is it
> > possible for this to be done in SolrJ? I'm using Solr 5.3.0
> >
> > Thank you.
> >
> > Regards,
> > Edwin
>
>
> ________________________________
>
>
> Syngenta Limited, Registered in England No 2710846;Registered Office :
> Syngenta Limited, European Regional Centre, Priestley Road, Surrey Research
> Park, Guildford, Surrey, GU2 7YH, United Kingdom
> ________________________________
>  This message may contain confidential information. If you are not the
> designated recipient, please notify the sender immediately, and delete the
> original and any copies. Any use of the message by you is prohibited.
>

RE: Recursively scan documents for indexing in a folder in SolrJ

Posted by "Duck Geraint (ext) GBJH" <ge...@syngenta.com>.

Also, check this link for SolrJ example code (including the recursion):
https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

Geraint


Geraint Duck
Data Scientist
Toxicology and Health Sciences
Syngenta UK
Email: geraint.duck@syngenta.com

-----Original Message-----
From: Jan Høydahl [mailto:jan.asf@cominvent.com]
Sent: 16 October 2015 12:14
To: solr-user@lucene.apache.org
Subject: Re: Recursively scan documents for indexing in a folder in SolrJ

SolrJ does not have any file crawler built in.
But you are free to steal code from SimplePostTool.java related to directory traversal, and then index each document found using SolrJ.

Note that SimplePostTool.java tries to be smart with what endpoint to post files to, xml, csv and json content will be posted to /update while office docs go to /update/extract

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 16. okt. 2015 kl. 05.22 skrev Zheng Lin Edwin Yeo <ed...@gmail.com>:
>
> Hi,
>
> I understand that in SimplePostTool (post.jar), there is this command
> to automatically detect content types in a folder, and recursively
> scan it for documents for indexing into a collection:
> bin/post -c gettingstarted afolder/
>
> This has been useful for me to do mass indexing of all the files that
> are in the folder. Now that I'm moving to production and plans to use
> SolrJ to do the indexing as it can do more things like robustness
> checks and retires for indexes that fails.
>
> However, I can't seems to find a way to do the same in SolrJ. Is it
> possible for this to be done in SolrJ? I'm using Solr 5.3.0
>
> Thank you.
>
> Regards,
> Edwin


________________________________


Syngenta Limited, Registered in England No 2710846;Registered Office : Syngenta Limited, European Regional Centre, Priestley Road, Surrey Research Park, Guildford, Surrey, GU2 7YH, United Kingdom
________________________________
 This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited.

Re: Recursively scan documents for indexing in a folder in SolrJ

Posted by Jan Høydahl <ja...@cominvent.com>.

SolrJ does not have any file crawler built in.
But you are free to steal code from SimplePostTool.java related to directory traversal,
and then index each document found using SolrJ.

Note that SimplePostTool.java tries to be smart with what endpoint to post files to,
xml, csv and json content will be posted to /update while office docs go to /update/extract

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 16. okt. 2015 kl. 05.22 skrev Zheng Lin Edwin Yeo <ed...@gmail.com>:
> 
> Hi,
> 
> I understand that in SimplePostTool (post.jar), there is this command to
> automatically detect content types in a folder, and recursively scan it for
> documents for indexing into a collection:
> bin/post -c gettingstarted afolder/
> 
> This has been useful for me to do mass indexing of all the files that are
> in the folder. Now that I'm moving to production and plans to use SolrJ to
> do the indexing as it can do more things like robustness checks and retires
> for indexes that fails.
> 
> However, I can't seems to find a way to do the same in SolrJ. Is it
> possible for this to be done in SolrJ? I'm using Solr 5.3.0
> 
> Thank you.
> 
> Regards,
> Edwin