You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by pavangolla <pa...@gmail.com> on 2013/03/07 21:39:19 UTC

Search a folder with File name and retrieve all the files matched

HI,
I am new to apache solr,

I am doing a poc, where there is a folder (in sys or some repository) which
has different files with diff extensions pdf, doc, xls..,

I want to search with a file name and retrieve all the files with the name
matching

How do i proceed on this.

Please help me on this.



--
View this message in context: http://lucene.472066.n3.nabble.com/Search-a-folder-with-File-name-and-retrieve-all-the-files-matched-tp4045629.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Search a folder with File name and retrieve all the files matched

Posted by Jan Høydahl <ja...@cominvent.com>.
Sure Erik,

Or since we already default to full path name as "id", perhaps we could change literal.resourcename to be the filename only. Guess that one is mostly for Tika to have more hints to guess the type of file, so it doesn't need to be absolute, especially when you have it in the ID already. See any downsides? Please just go ahead with whatever you think best :)

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

9. mars 2013 kl. 04:35 skrev Erik Hatcher <er...@gmail.com>:

> Thanks, Jan, for making the post tool do this type of thing.  Great stuff.
> 
> The filename would be a good one add for out of the box goodness.  We can easily add just the filename to the index with something like the patch below.  And on that note, what else would folks want in an easy to use document search system like this?
> 
> 	Erik
> 
> Index: core/src/java/org/apache/solr/util/SimplePostTool.java
> ===================================================================
> --- core/src/java/org/apache/solr/util/SimplePostTool.java	(revision 1450270)
> +++ core/src/java/org/apache/solr/util/SimplePostTool.java	(working copy)
> @@ -749,6 +749,7 @@
>               urlStr = appendParam(urlStr, "resource.name=" + URLEncoder.encode(file.getAbsolutePath(), "UTF-8"));
>             if(urlStr.indexOf("literal.id")==-1)
>               urlStr = appendParam(urlStr, "literal.id=" + URLEncoder.encode(file.getAbsolutePath(), "UTF-8"));
> +            urlStr = appendParam(urlStr, "literal.filename_s=" + URLEncoder.encode(file.getName(), "UTF-8"));
>             url = new URL(urlStr);
>           }
>         } else {
> 
> 
> 
> On Mar 8, 2013, at 19:16 , Jan Høydahl wrote:
> 
>> Since this is a POC you could simply run this command with the default example schema:
>> 
>> cd solr/example/exampledocs
>> java -Dauto -Drecursive=0 -jar post.jar path/to/folder
>> 
>> You will get the full file name with path in field "resourcename"
>> If you need to search just the filename, you can achieve that through adding a new field "filename" with a copyField resourcename->filename and a custom fieldType for filename with a PatternReplaceFilterFactory to remove the path.
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>> 
>> 7. mars 2013 kl. 22:11 skrev Alexandre Rafalovitch <ar...@gmail.com>:
>> 
>>> You could use DataImportHandler with FileListEntityProcessor to get the
>>> file names in:
>>> http://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor
>>> 
>>> Then, if it is recursive enumeration and not just one level, you probably
>>> want a tokenizer that splits on path separator characters (e.g. /). Or
>>> maybe you want to index filename as a separate field from full path (can do
>>> it in FileListEntityProcessor itself).
>>> 
>>> And if you combined the list of files with inner entity using Tika, you can
>>> load the file content for searching as well:
>>> http://wiki.apache.org/solr/DataImportHandler#Tika_Integration
>>> 
>>> Regards,
>>> Alex.
>>> 
>>> Personal blog: http://blog.outerthoughts.com/
>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>>> - Time is the quality of nature that keeps events from happening all at
>>> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
>>> 
>>> 
>>> On Thu, Mar 7, 2013 at 3:39 PM, pavangolla <pa...@gmail.com> wrote:
>>> 
>>>> HI,
>>>> I am new to apache solr,
>>>> 
>>>> I am doing a poc, where there is a folder (in sys or some repository) which
>>>> has different files with diff extensions pdf, doc, xls..,
>>>> 
>>>> I want to search with a file name and retrieve all the files with the name
>>>> matching
>>>> 
>>>> How do i proceed on this.
>>>> 
>>>> Please help me on this.
>>>> 
>>>> 
>>>> 
>>>> --
>>>> View this message in context:
>>>> http://lucene.472066.n3.nabble.com/Search-a-folder-with-File-name-and-retrieve-all-the-files-matched-tp4045629.html
>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>> 
>> 
> 


Re: Search a folder with File name and retrieve all the files matched

Posted by Erik Hatcher <er...@gmail.com>.
Thanks, Jan, for making the post tool do this type of thing.  Great stuff.

The filename would be a good one add for out of the box goodness.  We can easily add just the filename to the index with something like the patch below.  And on that note, what else would folks want in an easy to use document search system like this?

	Erik

Index: core/src/java/org/apache/solr/util/SimplePostTool.java
===================================================================
--- core/src/java/org/apache/solr/util/SimplePostTool.java	(revision 1450270)
+++ core/src/java/org/apache/solr/util/SimplePostTool.java	(working copy)
@@ -749,6 +749,7 @@
               urlStr = appendParam(urlStr, "resource.name=" + URLEncoder.encode(file.getAbsolutePath(), "UTF-8"));
             if(urlStr.indexOf("literal.id")==-1)
               urlStr = appendParam(urlStr, "literal.id=" + URLEncoder.encode(file.getAbsolutePath(), "UTF-8"));
+            urlStr = appendParam(urlStr, "literal.filename_s=" + URLEncoder.encode(file.getName(), "UTF-8"));
             url = new URL(urlStr);
           }
         } else {



On Mar 8, 2013, at 19:16 , Jan Høydahl wrote:

> Since this is a POC you could simply run this command with the default example schema:
> 
> cd solr/example/exampledocs
> java -Dauto -Drecursive=0 -jar post.jar path/to/folder
> 
> You will get the full file name with path in field "resourcename"
> If you need to search just the filename, you can achieve that through adding a new field "filename" with a copyField resourcename->filename and a custom fieldType for filename with a PatternReplaceFilterFactory to remove the path.
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
> 
> 7. mars 2013 kl. 22:11 skrev Alexandre Rafalovitch <ar...@gmail.com>:
> 
>> You could use DataImportHandler with FileListEntityProcessor to get the
>> file names in:
>> http://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor
>> 
>> Then, if it is recursive enumeration and not just one level, you probably
>> want a tokenizer that splits on path separator characters (e.g. /). Or
>> maybe you want to index filename as a separate field from full path (can do
>> it in FileListEntityProcessor itself).
>> 
>> And if you combined the list of files with inner entity using Tika, you can
>> load the file content for searching as well:
>> http://wiki.apache.org/solr/DataImportHandler#Tika_Integration
>> 
>> Regards,
>>  Alex.
>> 
>> Personal blog: http://blog.outerthoughts.com/
>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> - Time is the quality of nature that keeps events from happening all at
>> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
>> 
>> 
>> On Thu, Mar 7, 2013 at 3:39 PM, pavangolla <pa...@gmail.com> wrote:
>> 
>>> HI,
>>> I am new to apache solr,
>>> 
>>> I am doing a poc, where there is a folder (in sys or some repository) which
>>> has different files with diff extensions pdf, doc, xls..,
>>> 
>>> I want to search with a file name and retrieve all the files with the name
>>> matching
>>> 
>>> How do i proceed on this.
>>> 
>>> Please help me on this.
>>> 
>>> 
>>> 
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/Search-a-folder-with-File-name-and-retrieve-all-the-files-matched-tp4045629.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>> 
> 


Re: Search a folder with File name and retrieve all the files matched

Posted by Jan Høydahl <ja...@cominvent.com>.
Since this is a POC you could simply run this command with the default example schema:

cd solr/example/exampledocs
java -Dauto -Drecursive=0 -jar post.jar path/to/folder

You will get the full file name with path in field "resourcename"
If you need to search just the filename, you can achieve that through adding a new field "filename" with a copyField resourcename->filename and a custom fieldType for filename with a PatternReplaceFilterFactory to remove the path.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

7. mars 2013 kl. 22:11 skrev Alexandre Rafalovitch <ar...@gmail.com>:

> You could use DataImportHandler with FileListEntityProcessor to get the
> file names in:
> http://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor
> 
> Then, if it is recursive enumeration and not just one level, you probably
> want a tokenizer that splits on path separator characters (e.g. /). Or
> maybe you want to index filename as a separate field from full path (can do
> it in FileListEntityProcessor itself).
> 
> And if you combined the list of files with inner entity using Tika, you can
> load the file content for searching as well:
> http://wiki.apache.org/solr/DataImportHandler#Tika_Integration
> 
> Regards,
>   Alex.
> 
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
> 
> 
> On Thu, Mar 7, 2013 at 3:39 PM, pavangolla <pa...@gmail.com> wrote:
> 
>> HI,
>> I am new to apache solr,
>> 
>> I am doing a poc, where there is a folder (in sys or some repository) which
>> has different files with diff extensions pdf, doc, xls..,
>> 
>> I want to search with a file name and retrieve all the files with the name
>> matching
>> 
>> How do i proceed on this.
>> 
>> Please help me on this.
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Search-a-folder-with-File-name-and-retrieve-all-the-files-matched-tp4045629.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>> 


Re: Search a folder with File name and retrieve all the files matched

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
You could use DataImportHandler with FileListEntityProcessor to get the
file names in:
http://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor

Then, if it is recursive enumeration and not just one level, you probably
want a tokenizer that splits on path separator characters (e.g. /). Or
maybe you want to index filename as a separate field from full path (can do
it in FileListEntityProcessor itself).

And if you combined the list of files with inner entity using Tika, you can
load the file content for searching as well:
http://wiki.apache.org/solr/DataImportHandler#Tika_Integration

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Thu, Mar 7, 2013 at 3:39 PM, pavangolla <pa...@gmail.com> wrote:

> HI,
> I am new to apache solr,
>
> I am doing a poc, where there is a folder (in sys or some repository) which
> has different files with diff extensions pdf, doc, xls..,
>
> I want to search with a file name and retrieve all the files with the name
> matching
>
> How do i proceed on this.
>
> Please help me on this.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Search-a-folder-with-File-name-and-retrieve-all-the-files-matched-tp4045629.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>