You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by ahmad ajiloo <ah...@gmail.com> on 2011/09/14 05:27:25 UTC

How to serach on specific file types ?

Hello
I want to search on articles via Solr. So need to find only specific files
like doc, docx, and pdf.
I don't need any html pages. Thus the result of our search should only
consists of doc, docx, and pdf files.

I'm using Nutch to crawling web pages and sending Nutch's data to Solr for
indexing. There is an approach to search on specific file types: Put the
file extension into my index and I have no idea about the type of schema
nutch uses when indexing into Solr, wether it creates a specific field for
file extension, and/or how we can modify the nutch indexer to create a
field like that for ourselves.

Re: How to serach on specific file types ?

Posted by lewis john mcgibbney <le...@gmail.com>.

In addition to Markus' comments I would suggest that instead of getting your
users to search across specific fields, if you do not wish to store ANY html
documents then simply filter for this... This simplifies the process of
searching for your system users.

On Wed, Sep 14, 2011 at 10:24 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> Just i wrote on the Solr list. Use the index-more plugin or copyField the
> url
> to an extension field in which you can use char pattern replace filter to
> skip
> everything up to the first dot.
>
> > Hello
> > I want to search on articles via Solr. So need to find only specific
> files
> > like doc, docx, and pdf.
> > I don't need any html pages. Thus the result of our search should only
> > consists of doc, docx, and pdf files.
> >
> > I'm using Nutch to crawling web pages and sending Nutch's data to Solr
> for
> > indexing. There is an approach to search on specific file types: Put the
> > file extension into my index and I have no idea about the type of schema
> > nutch uses when indexing into Solr, wether it creates a specific field
> for
> > file extension, and/or how we can modify the nutch indexer to create a
> > field like that for ourselves.
>



-- 
*Lewis*

Re: How to serach on specific file types ?

Posted by Markus Jelsma <ma...@openindex.io>.

> I indexed my data by using index-more plugin and added my required field
> (like content_type) to schema.xml
> Now how can i search on pdf files (a kind of content_types) using this new
> index? what query should i enter to have a search on pdf files?

This is a Solr specific question and depends on your fieldType defined in Solr 
for the content type field. Refer to the Solr manual or mailing list.

> 
> On Thu, Sep 29, 2011 at 9:33 AM, ahmad ajiloo <ah...@gmail.com>wrote:
> > How can I use the Index-more plugin? I'm new to Nutch and need your help
> > in detail !
> > thanks
> > 
> > 
> > On Wed, Sep 14, 2011 at 12:54 PM, Markus Jelsma <
> > 
> > markus.jelsma@openindex.io> wrote:
> >> Just i wrote on the Solr list. Use the index-more plugin or copyField
> >> the url
> >> to an extension field in which you can use char pattern replace filter
> >> to skip
> >> everything up to the first dot.
> >> 
> >> > Hello
> >> > I want to search on articles via Solr. So need to find only specific
> >> 
> >> files
> >> 
> >> > like doc, docx, and pdf.
> >> > I don't need any html pages. Thus the result of our search should only
> >> > consists of doc, docx, and pdf files.
> >> > 
> >> > I'm using Nutch to crawling web pages and sending Nutch's data to Solr
> >> 
> >> for
> >> 
> >> > indexing. There is an approach to search on specific file types: Put
> >> > the file extension into my index and I have no idea about the type of
> >> > schema nutch uses when indexing into Solr, wether it creates a
> >> > specific field
> >> 
> >> for
> >> 
> >> > file extension, and/or how we can modify the nutch indexer to create a
> >> > field like that for ourselves.

How to serach on specific file types ?

Posted by ahmad ajiloo <ah...@gmail.com>.

I indexed my data by using index-more plugin and added my required field
(like content_type) to schema.xml
Now how can i search on pdf files (a kind of content_types) using this new
index? what query should i enter to have a search on pdf files?


On Thu, Sep 29, 2011 at 9:33 AM, ahmad ajiloo <ah...@gmail.com>wrote:

> How can I use the Index-more plugin? I'm new to Nutch and need your help in
> detail !
> thanks
>
>
> On Wed, Sep 14, 2011 at 12:54 PM, Markus Jelsma <
> markus.jelsma@openindex.io> wrote:
>
>> Just i wrote on the Solr list. Use the index-more plugin or copyField the
>> url
>> to an extension field in which you can use char pattern replace filter to
>> skip
>> everything up to the first dot.
>>
>> > Hello
>> > I want to search on articles via Solr. So need to find only specific
>> files
>> > like doc, docx, and pdf.
>> > I don't need any html pages. Thus the result of our search should only
>> > consists of doc, docx, and pdf files.
>> >
>> > I'm using Nutch to crawling web pages and sending Nutch's data to Solr
>> for
>> > indexing. There is an approach to search on specific file types: Put the
>> > file extension into my index and I have no idea about the type of schema
>> > nutch uses when indexing into Solr, wether it creates a specific field
>> for
>> > file extension, and/or how we can modify the nutch indexer to create a
>> > field like that for ourselves.
>>
>
>

Re: How to serach on specific file types ?

Posted by Markus Jelsma <ma...@openindex.io>.

Just i wrote on the Solr list. Use the index-more plugin or copyField the url 
to an extension field in which you can use char pattern replace filter to skip 
everything up to the first dot.

> Hello
> I want to search on articles via Solr. So need to find only specific files
> like doc, docx, and pdf.
> I don't need any html pages. Thus the result of our search should only
> consists of doc, docx, and pdf files.
> 
> I'm using Nutch to crawling web pages and sending Nutch's data to Solr for
> indexing. There is an approach to search on specific file types: Put the
> file extension into my index and I have no idea about the type of schema
> nutch uses when indexing into Solr, wether it creates a specific field for
> file extension, and/or how we can modify the nutch indexer to create a
> field like that for ourselves.