You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Nils Hoeller <ni...@arcor.de> on 2005/08/08 12:35:43 UTC

Creation of a Graph File with the DB Link Graph Database

Hi,

actually my Searcher is running on my Nutch made Indexed.


Everything seems to work out:

So I go on with a main part of my app.

Before Nutch I used Arachnid as a crawler.

During Crawling I used my Method
    /**
     * Each page considered to be inserted in the sitemap graph is stored in the created directories and the page information like the URL is inserted to the graph file format .
     */
	protected void handleLink(PageInfo p) {
                task.setCurrent(task.getCurrent()+1);
                task.setMessage("Tracking...");
                //int id =  p.getUrl().hashCode();
				String link = p.getUrl().toString();
                System.out.println("Link :" + link );
				int id = p.getUrl().getPath().hashCode();
       			// String title = URLDecoder.decode(p.getTitle());
				String title = p.getTitle();
				int accessCount = (int)(logs.getAccessCountByID(""+id+""));
				if (link == null || title == null || link.length() == 0 || title.length() ==0) return;
					else{
                    try{
                        storeFile(p.getUrl());
                        out.write( "Node{"+"\r\n"+"ID="+id+"\r\n"+"Title="+title+"\r\n"+"URL="+link+"\r\n"+"Number of Request="+accessCount+"\r\n"+"}"+"\r\n" ); 
                        String parentLink ="";
                         for(int i = 0 ; i < p.getLinksIntern().size(); i++){
                         //    URL urllink = (URL)(urlnode.getLinks().get(i));
                            URL urllink = (URL)( p.getLinksIntern().get(i));
                            out.write("Edge{"+"\r\n"+"Node1="+id+"\r\n"+"Node2="+urllink.getPath().hashCode()+"\r\n"+"}"+"\r\n");
                         // System.out.println("Links :"+ urllink.toString());
                         }
                                
                      }catch(IOException ie){
                          ie.printStackTrace();
                    }
                }
	}

which build me a Graphfile looking like this

Node{
ID=2144181430
Title=Institute of Information Systems Universit�t zu L�beck Schleswig-Holstein
URL=http://www.ifis.uni-luebeck.de/index.html
Number of Request=0
}
Edge{
Node1=2144181430
Node2=-66623770
}
Edge{
Node1=2144181430
Node2=150343685
}
Edge{
Node1=2144181430
Node2=1049931629
}

and so on.....

Now my question: 

How can this be done with the help of the nutch WebDB ? 
Can I query for all Nodes (Sites) with ID, Title, URL and Number of Request
and for all Edges (Links) with parent and target Node ID?


Thank you for your help,

Nils