You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by Kunal Wku <wk...@yahoo.com> on 2007/09/07 18:49:57 UTC

Regarding Lucene & Nutch

Hello Everyone,
   
  I am using Lucene & Nutch in my project for searching content in the webpages.
For a webpage or any other document, Lucene takes all the words in the page and indexes them and returns the result when searched.
   
  Lets say, I have 2 webpages as shown below:
   
  Webpage1
----------------------------------------------------------------------
This is the course page of Computer Science Department
  Subject: Operating System I
Professor: Qi Li
  Details:
The course operating system I deals with the basics of the operating system. Mainly the three topics dealt are process management, storage management & memory mangement. etc............................................
..................................................................
----------------------------------------------------------------------
   
  Webpage2
----------------------------------------------------------------------
This is the home page of Computer Science Department
  The computer science department offers courses at undergradudate level and 
graduate level. The core courses for the graduate students are  Mathematical Foundations of Computer Science, Compilers, Advanced Database, Analysis of Algorithms and Operating Systems. etc............................
..................................................................
----------------------------------------------------------------------
   
  Now if I search using the word "operating system", the results shows both the webpages (webpage 1 & webpage2) since the word "operating system" exists in both the webpage. 
   
  But my requirement is different. If I want to search the word "Operating System" which should appear in the subject field i.e., as in the webpage1, the result should show only webpage1. How can I achieve this result ? 
   
  Please help me in this regard.
  Thanks & Regards,
Kunal Gosar


       
---------------------------------
Be a better Globetrotter. Get better travel answers from someone who knows.
Yahoo! Answers - Check it out.

Re: Regarding Lucene & Nutc

Posted by MOHIT GOYAL <mo...@gmail.com>.
I m using nutch to crawl local directory on my system.I have modified all
the conf files like default.xml,crawl-urlfilter etc.
I have also modified HttpResponse.java
but it is skipping all the URLS.please help.

Re: Regarding Lucene & Nutc

Posted by Kunal Wku <wk...@yahoo.com>.
Hello Aditya,
   
  Thank you for your reply. I just your e-mail and I will try implementing your idea.
  I think using this idea, the search results me the files in which the required word appears in the content as well as the metadata of the file. My requirement is that the search should result me the files in which the required word appears only in the metadata of the file i.e., it should search only in the metadata (the required word may appear in the content of the file too. but it need not search in the content of the file). How can I achieve this ?
   
  Thanks & Regards,
  Kunal

aditya naga hemanth kumar <ad...@gmail.com> wrote:
  Hi
You can search a file in the meta-data fields and default fields that are
indexed by the search engine.Say you have a set of files which belong to
operating system course.You can add a meta-data field "subject" with value
"operating systems" to all the files directly by using XMP.
Then when you are indexing with lucene you can add a separate field called
subject for each document.When searching you can boost the score if the
query matches with the value of subject field which brings it to the
top.Hope this helps

Cheers
Aditya V

On 9/7/07, Kunal Wku wrote:
>
> Hello Everyone,
>
> I am using Lucene & Nutch in my project for searching content in the
> webpages.
> For a webpage or any other document, Lucene takes all the words in the
> page and indexes them and returns the result when searched.
>
> Lets say, I have 2 webpages as shown below:
>
> Webpage1
> ----------------------------------------------------------------------
> This is the course page of Computer Science Department
> Subject: Operating System I
> Professor: Qi Li
> Details:
> The course operating system I deals with the basics of the operating
> system. Mainly the three topics dealt are process management, storage
> management & memory mangement.
> etc............................................
> ..................................................................
> ----------------------------------------------------------------------
>
> Webpage2
> ----------------------------------------------------------------------
> This is the home page of Computer Science Department
> The computer science department offers courses at undergradudate level
> and
> graduate level. The core courses for the graduate students
> are Mathematical Foundations of Computer Science, Compilers, Advanced
> Database, Analysis of Algorithms and Operating Systems.
> etc............................
> ..................................................................
> ----------------------------------------------------------------------
>
> Now if I search using the word "operating system", the results shows
> both the webpages (webpage 1 & webpage2) since the word "operating system"
> exists in both the webpage.
>
> But my requirement is different. If I want to search the word "Operating
> System" which should appear in the subject field i.e., as in the webpage1,
> the result should show only webpage1. How can I achieve this result ?
>
> Please help me in this regard.
> Thanks & Regards,
> Kunal Gosar
>
>
>
> ---------------------------------
> Be a better Globetrotter. Get better travel answers from someone who
> knows.
> Yahoo! Answers - Check it out.


       
---------------------------------
Sick sense of humor? Visit Yahoo! TV's Comedy with an Edge to see what's on, when. 

Re: Regarding Lucene & Nutc

Posted by aditya naga hemanth kumar <ad...@gmail.com>.
Hi
You can search a file in the meta-data fields and default fields that are
indexed by the search engine.Say you have a set of files which belong to
operating system course.You can add a  meta-data field "subject" with value
"operating systems" to all the files directly by using XMP.
Then when you are indexing with lucene you can add a separate field called
subject for each document.When searching you can boost the score if the
query matches with the value of subject field which brings it to the
top.Hope this helps

Cheers
Aditya V

On 9/7/07, Kunal Wku <wk...@yahoo.com> wrote:
>
> Hello Everyone,
>
>   I am using Lucene & Nutch in my project for searching content in the
> webpages.
> For a webpage or any other document, Lucene takes all the words in the
> page and indexes them and returns the result when searched.
>
>   Lets say, I have 2 webpages as shown below:
>
>   Webpage1
> ----------------------------------------------------------------------
> This is the course page of Computer Science Department
>   Subject: Operating System I
> Professor: Qi Li
>   Details:
> The course operating system I deals with the basics of the operating
> system. Mainly the three topics dealt are process management, storage
> management & memory mangement.
> etc............................................
> ..................................................................
> ----------------------------------------------------------------------
>
>   Webpage2
> ----------------------------------------------------------------------
> This is the home page of Computer Science Department
>   The computer science department offers courses at undergradudate level
> and
> graduate level. The core courses for the graduate students
> are  Mathematical Foundations of Computer Science, Compilers, Advanced
> Database, Analysis of Algorithms and Operating Systems.
> etc............................
> ..................................................................
> ----------------------------------------------------------------------
>
>   Now if I search using the word "operating system", the results shows
> both the webpages (webpage 1 & webpage2) since the word "operating system"
> exists in both the webpage.
>
>   But my requirement is different. If I want to search the word "Operating
> System" which should appear in the subject field i.e., as in the webpage1,
> the result should show only webpage1. How can I achieve this result ?
>
>   Please help me in this regard.
>   Thanks & Regards,
> Kunal Gosar
>
>
>
> ---------------------------------
> Be a better Globetrotter. Get better travel answers from someone who
> knows.
> Yahoo! Answers - Check it out.

Re: Regarding Lucene & Nutch

Posted by AllelB <be...@gmail.com>.
hello
can you post your code ?!! 

--
View this message in context: http://lucene.472066.n3.nabble.com/Regarding-Lucene-Nutch-tp642834p2838736.html
Sent from the Lucene - General mailing list archive at Nabble.com.