You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by JohnRodey <ti...@yahoo.com> on 2012/03/25 18:39:53 UTC

Out-of-the-box Nutch indexing url source to Solr

I am just doing a simple project for my Information Retrieval class.  I am
currently using nutch to get a bunch of pages and it is indexing and storing
the parsed page to SOLR.  What I really want to do is have it store the page
source with HTML tags as well.  Is there an easy way to tell nutch to do
that?

If not, after I have my pages indexed if I want to retrieve there original
source from nutch what would be the command to do that?

--
View this message in context: http://lucene.472066.n3.nabble.com/Out-of-the-box-Nutch-indexing-url-source-to-Solr-tp3855918p3855918.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Out-of-the-box Nutch indexing url source to Solr

Posted by JohnRodey <ti...@yahoo.com>.

Thanks for the help!
Do you happen to know why this is failing?

/cygdrive/c/Users/me/Downloads/nutch/apache-nutch-1.4-bin/apache-nutch-1.4-bin/runtime/local
$ bin/nutch readseg -dump crawl/segments/20120325130007
http://www.imdb.com/title/tt1231460/fullcredits
cygpath: can't convert empty path
SegmentReader: dump segment: crawl/segments/20120325130007
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS:
http://www.imdb.com/title/tt1231460/fullcredits/dump, expected: file:///
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
        at
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
        at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357)
        at
org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:430)
        at
org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:231)
        at
org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:564)


--
View this message in context: http://lucene.472066.n3.nabble.com/Out-of-the-box-Nutch-indexing-url-source-to-Solr-tp3855918p3864324.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Out-of-the-box Nutch indexing url source to Solr

Posted by remi tassing <ta...@gmail.com>.

Hey,

Try the command "bin/nutch readseg -dump"[1][2].
It reads a segment (or multiple segments) and output their content
including outlinks, html content, parsed content...

I hope it helps!

Remi

[1]:
http://www.marco.bianchi.name/myPortal/using-the-binnutch-readseg-command.aspx
[2]:  http://wiki.apache.org/nutch/bin/nutch_readseg

On Mon, Mar 26, 2012 at 12:39 AM, JohnRodey <ti...@yahoo.com> wrote:

> I am just doing a simple project for my Information Retrieval class.  I am
> currently using nutch to get a bunch of pages and it is indexing and
> storing
> the parsed page to SOLR.  What I really want to do is have it store the
> page
> source with HTML tags as well.  Is there an easy way to tell nutch to do
> that?
>
> If not, after I have my pages indexed if I want to retrieve there original
> source from nutch what would be the command to do that?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Out-of-the-box-Nutch-indexing-url-source-to-Solr-tp3855918p3855918.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: Out-of-the-box Nutch indexing url source to Solr

Posted by JohnRodey <ti...@yahoo.com>.

Thanks Markus.

Oh and the problem I was having was I wanted to do -get instead of -dump

--
View this message in context: http://lucene.472066.n3.nabble.com/Out-of-the-box-Nutch-indexing-url-source-to-Solr-tp3855918p3866414.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Out-of-the-box Nutch indexing url source to Solr

Posted by Markus Jelsma <ma...@openindex.io>.

YOu can also make a parse filter that copies the raw structure to another 
field and have it indexed later by an index filter.

On Sunday 25 March 2012 18:39:53 JohnRodey wrote:
> I am just doing a simple project for my Information Retrieval class.  I am
> currently using nutch to get a bunch of pages and it is indexing and
> storing the parsed page to SOLR.  What I really want to do is have it
> store the page source with HTML tags as well.  Is there an easy way to
> tell nutch to do that?
> 
> If not, after I have my pages indexed if I want to retrieve there original
> source from nutch what would be the command to do that?
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Out-of-the-box-Nutch-indexing-url-sourc
> e-to-Solr-tp3855918p3855918.html Sent from the Nutch - User mailing list
> archive at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex