You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Matthias W." <Ma...@e-projecta.com> on 2008/10/15 11:47:50 UTC

Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

Hi,
I want to use Nutch for crawling contents and Lucene webapp to search the
Nutch-created index.
I thought nutch creates a Lucene interoperable index, but when I'm searching
the index with the Lucene webapp I get no results.
I'm using Nutch 0.9 and Lucene 2.4.0.
Should I use an older Lucene version like 2.0 or is this not crucial?

I want to use Lucene, because of its Wildcardsearch and Fuzzysearch, ...
Are there other possibilities to solve this?
-- 
View this message in context: http://www.nabble.com/Using-Nutch-for-crawling-and-Lucene-for-searching-%28Wildcard-Fuzzy%29-tp19990219p19990219.html
Sent from the Nutch - User mailing list archive at Nabble.com.


RE: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

Posted by "Matthias W." <Ma...@e-projecta.com>.

Patrick Markiewicz wrote:
> 
> I'm not sure what you're using for searching, but wherever you
> reference an analyzer in Lucene, you need to change that from
> StandardAnalyzer to
> AnalyzerFactory.get(NutchConfiguration.create().get("en")) (which may
> require importing nutch-specific classes).
> 
I changed:
Analyzer analyzer = new StandardAnalyzer();

to:
Configuration nutchConfig = NutchConfiguration.create();
AnalyzerFactory an = new AnalyzerFactory(nutchConfig);
NutchAnalyzer analyzer = an.get(nutchConfig.get("en"));

now I get following error message from tomcat:
org.apache.jasper.JasperException

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:372)
	org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
	org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
	sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	java.lang.reflect.Method.invoke(Method.java:585)
	org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243)
	java.security.AccessController.doPrivileged(Native Method)
	javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
	org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:272)

org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:161)

root cause

java.lang.NullPointerException
	java.io.Reader.<init>(Reader.java:61)
	java.io.BufferedReader.<init>(BufferedReader.java:76)
	java.io.BufferedReader.<init>(BufferedReader.java:91)
	org.apache.nutch.analysis.CommonGrams.init(CommonGrams.java:152)
	org.apache.nutch.analysis.CommonGrams.<init>(CommonGrams.java:52)

org.apache.nutch.analysis.NutchDocumentAnalyzer$ContentAnalyzer.<init>(NutchDocumentAnalyzer.java:64)

org.apache.nutch.analysis.NutchDocumentAnalyzer.<init>(NutchDocumentAnalyzer.java:55)
	org.apache.nutch.analysis.AnalyzerFactory.<init>(AnalyzerFactory.java:49)
	org.apache.jsp.results_jsp._jspService(results_jsp.java:167)
	org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324)
	org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
	org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
	sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	java.lang.reflect.Method.invoke(Method.java:585)
	org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243)
	java.security.AccessController.doPrivileged(Native Method)
	javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
	org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:272)

org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:161)


Full Sourcecode of results.jsp:
<%@ page import="org.apache.hadoop.conf.*"
  import="org.apache.nutch.util.NutchConfiguration"
  import="org.apache.nutch.analysis.*"
  import = "  javax.servlet.*, javax.servlet.http.*, java.io.*,
org.apache.lucene.document.*, org.apache.lucene.index.*,
org.apache.lucene.search.*, org.apache.lucene.queryParser.*,
org.apache.lucene.demo.*, org.apache.lucene.demo.html.Entities,
java.net.URLEncoder"
  
%>

<%
/*
        Author: Andrew C. Oliver, SuperLink Software, Inc.
(acoliver2@users.sourceforge.net)

        This jsp page is deliberatly written in the horrible java directly
embedded 
        in the page style for an easy and concise demonstration of Lucene.
        Due note...if you write pages that look like this...sooner or later
        you'll have a maintenance nightmare.  If you use jsps...use taglibs
        and beans!  That being said, this should be acceptable for a small
        page demonstrating how one uses Lucene in a web app. 

        This is also deliberately overcommented. ;-)

*/
%>
<%!
public String escapeHTML(String s) {
  s = s.replaceAll("&", "&amp;");
  s = s.replaceAll("<", "&lt;");
  s = s.replaceAll(">", "&gt;");
  s = s.replaceAll("\"", "&quot;");
  s = s.replaceAll("'", "&apos;");
  return s;
}
%>
<%@include file="header.jsp"%>
<%
        boolean error = false;                  //used to control flow for
error messages
        String indexName = indexLocation;       //local copy of the
configuration variable
        IndexSearcher searcher = null;          //the searcher used to
open/search the index
        Query query = null;                     //the Query created by the
QueryParser
        Hits hits = null;                       //the search results
        int startindex = 0;                     //the first index displayed
on this page
        int maxpage    = 50;                    //the maximum items
displayed on this page
        String queryString = null;              //the query entered in the
previous page
        String startVal    = null;              //string version of
startindex
        String maxresults  = null;              //string version of maxpage
        int thispage = 0;                       //used for the for/next
either maxpage or
                                                //hits.length() - startindex
- whichever is
                                                //less

        try {
          searcher = new IndexSearcher(indexName);      //create an
indexSearcher for our page
                                                        //NOTE: this
operation is slow for large
                                                        //indices (much
slower than the search itself)
                                                        //so you might want
to keep an IndexSearcher 
                                                        //open
                                                        
        } catch (Exception e) {                         //any error that
happens is probably due
                                                        //to a permission
problem or non-existant
                                                        //or otherwise
corrupt index
%>
                <p>ERROR opening the Index - contact sysadmin!</p>
                <p>Error message: <%=escapeHTML(e.getMessage())%></p>   
<%                error = true;                                  //don't do
anything up to the footer
        }
%>
<%
       if (error == false) {                                           //did
we open the index?
                queryString = request.getParameter("query");           //get
the search criteria
                startVal    = request.getParameter("startat");         //get
the start index
                maxresults  = request.getParameter("maxresults");      //get
max results per page
                try {
                        maxpage    = Integer.parseInt(maxresults);   
//parse the max results first
                        startindex = Integer.parseInt(startVal);      //then
the start index  
                } catch (Exception e) { } //we don't care if something
happens we'll just start at 0
                                          //or end at 50

                

                if (queryString == null)
                        throw new ServletException("no query "+       //if
you don't have a query then
                                                   "specified");      //you
probably played on the 
                                                                     
//query string so you get the 
                Configuration nutchConfig = NutchConfiguration.create();                                
//treatment
				AnalyzerFactory an = new AnalyzerFactory(nutchConfig);
                NutchAnalyzer analyzer = an.get(nutchConfig.get("en"));	 
//construct our usual analyzer
                try {
                        QueryParser qp = new QueryParser("contents",
analyzer);
                        query = qp.parse(queryString); //parse the 
                } catch (ParseException e) {                         
//query and construct the Query
                                                                     
//object
                                                                      //if
it's just "operator error"
                                                                      //send
them a nice error HTML
                                                                      
%>
                        <p>Error while parsing query:
<%=escapeHTML(e.getMessage())%></p>
<%
                        error = true;                                
//don't bother with the rest of
                                                                      //the
page
                }
        }
%>
<%
        if (error == false && searcher != null) {                     // if
we've had no errors
                                                                      //
searcher != null was to handle
                                                                      // a
weird compilation bug 
                thispage = maxpage;                                   //
default last element to maxpage
                hits = searcher.search(query);                        // run
the query 
                if (hits.length() == 0) {                             // if
we got no results tell the user
%>
					<p> I'm sorry I couldn't find what you were looking for. </p>
<%
					error = true;                                        // don't bother
with the rest of the
                                                                     // page
                }
        }

        if (error == false && searcher != null) {                   
%>
                <table>
                <tr>
                        <td>Document</td>
                        <td>Summary</td>
                </tr>
<%
                if ((startindex + maxpage) > hits.length()) {
                        thispage = hits.length() - startindex;      // set
the max index to maxpage or last
                }                                                   //
actual search result whichever is less

                for (int i = startindex; i < (thispage + startindex); i++) { 
// for each element
%>
                <tr>
<%
                        Document doc = hits.doc(i);                    //get
the next document 
                        String doctitle = doc.get("title");            //get
its title
                        String url = doc.get("path");                  //get
its path field
                        if (url != null && url.startsWith("../webapps/")) {
// strip off ../webapps prefix if present
                                url = url.substring(10);
                        }
                        if ((doctitle == null) || doctitle.equals("")) //use
the path if it has no title
                                doctitle = url;
                                                                      
//then output!
%>
                        <td> "<%=url% "><%=doctitle%> </td>
                        <td><%=doc.get("summary")%></td>
                </tr>
<%
                }
%>
<%                if ( (startindex + maxpage) < hits.length()) {   //if
there are more results...display 
                                                                   //the
more link

                        String moreurl="results.jsp?query=" + 
                                       URLEncoder.encode(queryString) + 
//construct the "more" link
                                       "&amp;maxresults=" + maxpage + 
                                       "&amp;startat=" + (startindex +
maxpage);
%>
                <tr>
                        <td></td><td> "<%=moreurl% ">More Results>> </td>
                </tr>
<%
                }
%>
                </table>

<%       }                                            //then include our
footer.
         if (searcher != null)
                searcher.close();
%>
<%@include file="footer.jsp"%> 


What can I do now?
-- 
View this message in context: http://www.nabble.com/Using-Nutch-for-crawling-and-Lucene-for-searching-%28Wildcard-Fuzzy%29-tp19990219p20303116.html
Sent from the Nutch - User mailing list archive at Nabble.com.


nutch OR again

Posted by Christopher Condit <co...@sdsc.edu>.
I've got nutch 0.9 and applied the patch from 479 as described here:
http://thethoughtlab.blogspot.com/2007/06/adding-non-required-term-patch
-to-nutch.html

However, I'm still not seeing functioning OR queries when using the
nutch web application or the NutchBean from the command line... 
1) Am I missing something else regarding the application of this patch?
2) Does the SVN trunk incorporate this patch? Should I try that?

Thanks,
-Chris


RE: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

Posted by Patrick Markiewicz <pm...@sim-gtech.com>.
Hi Matt,
     If you read the Lucene documentation you will discover that the
Analyzer used for searching needs to be the same type that indexed the
content.  I'm not sure what you're using for searching, but wherever you
reference an analyzer in Lucene, you need to change that from
StandardAnalyzer to
AnalyzerFactory.get(NutchConfiguration.create().get("en")) (which may
require importing nutch-specific classes).  In order to display the URL,
you need to reference the "url" field as opposed to the "path" field
that Lucene uses initially.  Use Luke to see what field stores the
content of the URL.  That may have to change from "content" to
"contents".
     To be honest, I never tried just changing the "path" field to
"url".  You could try that and see if the StandardAnalyzer would work,
but I don't have enough knowledge about the analyzers to know if that
would work.

Patrick

-----Original Message-----
From: Matthias W. [mailto:Matthias.Wangler@e-projecta.com] 
Sent: Wednesday, October 15, 2008 6:22 AM
To: nutch-user@lucene.apache.org
Subject: Re: Using Nutch for crawling and Lucene for searching
(Wildcard/Fuzzy)


Thanks, but what does this mean for me?
I already tried to search the index with the Lucene webapp
(lucenewebapp.war
from Lucene package) including my nutch index 'nutchcrawl/index' and
'nutchcrawl/indexes/part-00000' but with both of them I get no results.
And my index is correct, because with Luke and the nutch webapp I get
results.

Andrzej Bialecki wrote:
> 
> Matthias W. wrote:
>> Hi,
>> I want to use Nutch for crawling contents and Lucene webapp to search
the
>> Nutch-created index.
>> I thought nutch creates a Lucene interoperable index, but when I'm
>> searching
>> the index with the Lucene webapp I get no results.
>> I'm using Nutch 0.9 and Lucene 2.4.0.
>> Should I use an older Lucene version like 2.0 or is this not crucial?
>> 
>> I want to use Lucene, because of its Wildcardsearch and Fuzzysearch,
...
>> Are there other possibilities to solve this?
> 
> Nutch indexes are plain Lucene indexes. The only difference is that as
a 
> side-effect of map-reduce processing these indexes may come in several

> parts, found in subdirectories named like part-xxxxx. Each
subdirectory 
> holds a valid Lucene index.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 

-- 
View this message in context:
http://www.nabble.com/Using-Nutch-for-crawling-and-Lucene-for-searching-
%28Wildcard-Fuzzy%29-tp19990219p19990671.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

Posted by "Matthias W." <Ma...@e-projecta.com>.
Thanks, but what does this mean for me?
I already tried to search the index with the Lucene webapp (lucenewebapp.war
from Lucene package) including my nutch index 'nutchcrawl/index' and
'nutchcrawl/indexes/part-00000' but with both of them I get no results.
And my index is correct, because with Luke and the nutch webapp I get
results.

Andrzej Bialecki wrote:
> 
> Matthias W. wrote:
>> Hi,
>> I want to use Nutch for crawling contents and Lucene webapp to search the
>> Nutch-created index.
>> I thought nutch creates a Lucene interoperable index, but when I'm
>> searching
>> the index with the Lucene webapp I get no results.
>> I'm using Nutch 0.9 and Lucene 2.4.0.
>> Should I use an older Lucene version like 2.0 or is this not crucial?
>> 
>> I want to use Lucene, because of its Wildcardsearch and Fuzzysearch, ...
>> Are there other possibilities to solve this?
> 
> Nutch indexes are plain Lucene indexes. The only difference is that as a 
> side-effect of map-reduce processing these indexes may come in several 
> parts, found in subdirectories named like part-xxxxx. Each subdirectory 
> holds a valid Lucene index.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Using-Nutch-for-crawling-and-Lucene-for-searching-%28Wildcard-Fuzzy%29-tp19990219p19990671.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Matthias W. wrote:
> Hi,
> I want to use Nutch for crawling contents and Lucene webapp to search the
> Nutch-created index.
> I thought nutch creates a Lucene interoperable index, but when I'm searching
> the index with the Lucene webapp I get no results.
> I'm using Nutch 0.9 and Lucene 2.4.0.
> Should I use an older Lucene version like 2.0 or is this not crucial?
> 
> I want to use Lucene, because of its Wildcardsearch and Fuzzysearch, ...
> Are there other possibilities to solve this?

Nutch indexes are plain Lucene indexes. The only difference is that as a 
side-effect of map-reduce processing these indexes may come in several 
parts, found in subdirectories named like part-xxxxx. Each subdirectory 
holds a valid Lucene index.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

Posted by Alexander Aristov <al...@gmail.com>.
Or as an option you can modify nutch to store content in the index.

Andrzej, is it bad idea, what do you think?

Best Regards
Alexander Aristov


2009/5/14 Andrzej Bialecki <ab...@getopt.org>

> inghe wrote:
>
>>
>> Hi,
>> I want to use Nutch for crawling contents and Lucene for extract and
>> analyze
>> the contents of the index created by Nutch. I'm trying to extract from the
>> index the contents of web pages, but i don' know how to set the
>> NutchDocumentAnalyzer in my application, if i use the StandardAnalyzer of
>> Lucene, i'll get to extract the fields "title", "url" but not the
>> "content".
>> I'm using Nutch1.0 and Lucene2.4.0
>>
>
> There is no content in Lucene indexes. The original content is stored in
> Nutch segments. You can use the command bin/nutch readseg to retrieve all
> (or selected) pages.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

Posted by Andrzej Bialecki <ab...@getopt.org>.
inghe wrote:
> 
> Andrzej Bialecki wrote:
>> Page content is NOT stored in Lucene indexes that Nutch creates. It's 
>> only indexed, which is not the same. Luke can show you the text in the 
>> "content" field only because it reconstructs it from the index. This 
>> reconstruction is incomplete because some information is missing (the 
>> information discarded by NutchDocumentAnalyzer).
>>
>> As I wrote before, full content is stored in Nutch segments. That's why 
>> Nutch can show you the full content, but Luke cannot.
>>
>>
> 
> Thanks again, but is there a method to get a "content" informations through
> the libraries of Lucene? I would like to work on the content of the web
> pages extracted.
> 

As it is now - there is no method. You would have to modify Nutch to 
create indexes where "content" is both indexed and stored - but then 
performance of your index will suffer.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

Posted by inghe <in...@gmail.com>.

Andrzej Bialecki wrote:
> 
> Page content is NOT stored in Lucene indexes that Nutch creates. It's 
> only indexed, which is not the same. Luke can show you the text in the 
> "content" field only because it reconstructs it from the index. This 
> reconstruction is incomplete because some information is missing (the 
> information discarded by NutchDocumentAnalyzer).
> 
> As I wrote before, full content is stored in Nutch segments. That's why 
> Nutch can show you the full content, but Luke cannot.
> 
> 

Thanks again, but is there a method to get a "content" informations through
the libraries of Lucene? I would like to work on the content of the web
pages extracted.

-- 
View this message in context: http://www.nabble.com/Using-Nutch-for-crawling-and-Lucene-for-searching-%28Wildcard-Fuzzy%29-tp19990219p23555198.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

Posted by Andrzej Bialecki <ab...@getopt.org>.
inghe wrote:
> Thank you for answer, but i have still a doubt!
> Why can i read the filed "content" in Luke, if i load the index file created
> by nutch?
> So, i load in Luke the index file created by Nutch-1.0, then I can view the
> fields "url" "title" "host" "ecc, but not all field; if i click on an Edit
> Botton opens a window that contains other fields including the field
> "content" with the his value, but as it uses the seampleAnalyzer and the
> content is not displayed correctly. I tried to change the analyzer and
> insert NutchDocumenAnalyzer but I do not know how to do it
> 
> help :(

Page content is NOT stored in Lucene indexes that Nutch creates. It's 
only indexed, which is not the same. Luke can show you the text in the 
"content" field only because it reconstructs it from the index. This 
reconstruction is incomplete because some information is missing (the 
information discarded by NutchDocumentAnalyzer).

As I wrote before, full content is stored in Nutch segments. That's why 
Nutch can show you the full content, but Luke cannot.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

Posted by inghe <in...@gmail.com>.
Thank you for answer, but i have still a doubt!
Why can i read the filed "content" in Luke, if i load the index file created
by nutch?
So, i load in Luke the index file created by Nutch-1.0, then I can view the
fields "url" "title" "host" "ecc, but not all field; if i click on an Edit
Botton opens a window that contains other fields including the field
"content" with the his value, but as it uses the seampleAnalyzer and the
content is not displayed correctly. I tried to change the analyzer and
insert NutchDocumenAnalyzer but I do not know how to do it

help :(


Andrzej Bialecki wrote:
> 
> inghe wrote:
>> 
>> Hi,
>> I want to use Nutch for crawling contents and Lucene for extract and
>> analyze
>> the contents of the index created by Nutch. I'm trying to extract from
>> the
>> index the contents of web pages, but i don' know how to set the
>> NutchDocumentAnalyzer in my application, if i use the StandardAnalyzer of
>> Lucene, i'll get to extract the fields "title", "url" but not the
>> "content".
>> I'm using Nutch1.0 and Lucene2.4.0
> 
> There is no content in Lucene indexes. The original content is stored in 
> Nutch segments. You can use the command bin/nutch readseg to retrieve 
> all (or selected) pages.
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Using-Nutch-for-crawling-and-Lucene-for-searching-%28Wildcard-Fuzzy%29-tp19990219p23542476.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

Posted by Andrzej Bialecki <ab...@getopt.org>.
inghe wrote:
> 
> Hi,
> I want to use Nutch for crawling contents and Lucene for extract and analyze
> the contents of the index created by Nutch. I'm trying to extract from the
> index the contents of web pages, but i don' know how to set the
> NutchDocumentAnalyzer in my application, if i use the StandardAnalyzer of
> Lucene, i'll get to extract the fields "title", "url" but not the "content".
> I'm using Nutch1.0 and Lucene2.4.0

There is no content in Lucene indexes. The original content is stored in 
Nutch segments. You can use the command bin/nutch readseg to retrieve 
all (or selected) pages.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

Posted by inghe <in...@gmail.com>.

Hi,
I want to use Nutch for crawling contents and Lucene for extract and analyze
the contents of the index created by Nutch. I'm trying to extract from the
index the contents of web pages, but i don' know how to set the
NutchDocumentAnalyzer in my application, if i use the StandardAnalyzer of
Lucene, i'll get to extract the fields "title", "url" but not the "content".
I'm using Nutch1.0 and Lucene2.4.0


-- 
View this message in context: http://www.nabble.com/Using-Nutch-for-crawling-and-Lucene-for-searching-%28Wildcard-Fuzzy%29-tp19990219p23536068.html
Sent from the Nutch - User mailing list archive at Nabble.com.