You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Nils Hoeller <ni...@arcor.de> on 2005/08/08 12:35:43 UTC
Creation of a Graph File with the DB Link Graph Database
Hi,
actually my Searcher is running on my Nutch made Indexed.
Everything seems to work out:
So I go on with a main part of my app.
Before Nutch I used Arachnid as a crawler.
During Crawling I used my Method
/**
* Each page considered to be inserted in the sitemap graph is stored in the created directories and the page information like the URL is inserted to the graph file format .
*/
protected void handleLink(PageInfo p) {
task.setCurrent(task.getCurrent()+1);
task.setMessage("Tracking...");
//int id = p.getUrl().hashCode();
String link = p.getUrl().toString();
System.out.println("Link :" + link );
int id = p.getUrl().getPath().hashCode();
// String title = URLDecoder.decode(p.getTitle());
String title = p.getTitle();
int accessCount = (int)(logs.getAccessCountByID(""+id+""));
if (link == null || title == null || link.length() == 0 || title.length() ==0) return;
else{
try{
storeFile(p.getUrl());
out.write( "Node{"+"\r\n"+"ID="+id+"\r\n"+"Title="+title+"\r\n"+"URL="+link+"\r\n"+"Number of Request="+accessCount+"\r\n"+"}"+"\r\n" );
String parentLink ="";
for(int i = 0 ; i < p.getLinksIntern().size(); i++){
// URL urllink = (URL)(urlnode.getLinks().get(i));
URL urllink = (URL)( p.getLinksIntern().get(i));
out.write("Edge{"+"\r\n"+"Node1="+id+"\r\n"+"Node2="+urllink.getPath().hashCode()+"\r\n"+"}"+"\r\n");
// System.out.println("Links :"+ urllink.toString());
}
}catch(IOException ie){
ie.printStackTrace();
}
}
}
which build me a Graphfile looking like this
Node{
ID=2144181430
Title=Institute of Information Systems Universit�t zu L�beck Schleswig-Holstein
URL=http://www.ifis.uni-luebeck.de/index.html
Number of Request=0
}
Edge{
Node1=2144181430
Node2=-66623770
}
Edge{
Node1=2144181430
Node2=150343685
}
Edge{
Node1=2144181430
Node2=1049931629
}
and so on.....
Now my question:
How can this be done with the help of the nutch WebDB ?
Can I query for all Nodes (Sites) with ID, Title, URL and Number of Request
and for all Edges (Links) with parent and target Node ID?
Thank you for your help,
Nils