You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by peterbarretto <pe...@gmail.com> on 2013/01/29 14:46:04 UTC

Re: How to get page content of crawled pages

Hi

Is there a way i can dump the url and url content in mongodb?


Klemens Muthmann wrote
> Hi,
> 
> Super. That works. Thank you. I thereby also found the class that shows 
> how to achieve this within Java code, which is 
> org.apache.nutch.segment.SegmentReader.
> 
> Thanks again and bye
>      Klemens
> 
> Am 22.11.2010 10:49, schrieb Hannes Carl Meyer:
>> Hi Klemens,
>>
>> you should run ./bin/nutch readseg!
>>
>> For example: ./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder
>> -nofetch -nogenerate -noparse -noparsedata -noparsetex
>>
>> Kind Regards from Hannover
>>
>> Hannes
>>
>> On Mon, Nov 22, 2010 at 9:23 AM, Klemens Muthmann<
>> 

> klemens.muthmann@

>>  wrote:
>>
>>> Hi,
>>>
>>> I did a small crawl of some pages on the web and want to geht the raw
>>> HTML
>>> content of these pages now. Reading the documentation in the wiki I
>>> guess
>>> this content might be somewhere under
>>> crawl/segments/20101122071139/content/part-00000.
>>>
>>> I also guess I can access this content using the Hadoop API like
>>> described
>>> here: http://wiki.apache.org/nutch/Getting_Started
>>>
>>> However I have absolutely no idea how to configure:
>>>
>>> MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf);
>>>
>>>
>>> The Hadoop documentation is not very helpful either. May someone please
>>> point me in the right direction to get the page content?
>>>
>>> Thank you and regards
>>>     Klemens Muthmann
>>>
> 
> 
> -- 
> --------------------------------
> Dipl.-Medieninf., Klemens Muthmann
> Wissenschaftlicher Mitarbeiter
> 
> Technische Universität Dresden
> Fakultät Informatik
> Institut für Systemarchitektur
> Lehrstuhl Rechnernetze
> 01062 Dresden
> Tel.: +49 (351) 463-38214
> Fax: +49 (351) 463-38251
> E-Mail: 

> klemens.muthmann@

> --------------------------------





--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037023.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Peter,
The patch attached to the issue is for trunk.
If you were able to make a patch for 2.x and upload it to the issue that
would be great. There are API differences so I can tell you that even
though the mongodb indexer classes have been applied, it. Will most likely
be a fruitless effort.
Please see Juliens recent work on pluggable indexing architecture for Nutch
trunk. It represents better software design generally for your indexing
requirements.
Thanks
Lewis

On Tuesday, April 2, 2013, peterbarretto <pe...@gmail.com> wrote:
> Hi Lewis,
>
> I tried applying the patch on 2.1 but it gives the below error:
> patching file pom.xml
> patching file ivy/ivy.xml
> Hunk #1 succeeded at 34 with fuzz 2 (offset 4 lines).
> patching file src/bin/nutch
> Hunk #1 FAILED at 61.
> Hunk #2 succeeded at 220 with fuzz 2 (offset 2 lines).
> 1 out of 2 hunks FAILED -- saving rejects to file src/bin/nutch.rej
> patching file src/java/org/apache/nutch/indexer/mongodb/MongoDbWriter.java
> patching file
> src/java/org/apache/nutch/indexer/mongodb/MongoDbConstants.java
> patching file
src/java/org/apache/nutch/indexer/mongodb/MongoDbIndexer.java
>
>
>
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4053146.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*

Re: How to get page content of crawled pages

Posted by peterbarretto <pe...@gmail.com>.

Hi Lewis,

I tried applying the patch on 2.1 but it gives the below error:
patching file pom.xml
patching file ivy/ivy.xml
Hunk #1 succeeded at 34 with fuzz 2 (offset 4 lines).
patching file src/bin/nutch
Hunk #1 FAILED at 61.
Hunk #2 succeeded at 220 with fuzz 2 (offset 2 lines).
1 out of 2 hunks FAILED -- saving rejects to file src/bin/nutch.rej
patching file src/java/org/apache/nutch/indexer/mongodb/MongoDbWriter.java
patching file
src/java/org/apache/nutch/indexer/mongodb/MongoDbConstants.java
patching file src/java/org/apache/nutch/indexer/mongodb/MongoDbIndexer.java



--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4053146.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

Posted by peterbarretto <pe...@gmail.com>.

Hi Lewis,

I have never used a patch before but after searching a bit managed to apply
the patch in cygwin. (had to reinstall cygwin with the patch tool as the
path command was not present in the previous install)

I installed the patch by skipping pom.xml file and it worked.
I can copy all the crawled urls to the mongodb.

I can get the html content of crawled urls from the readseg -dump command in
nutch 1.6 so i guess it will be possible to get full html along with just
the text part?




lewis john mcgibbney wrote
> Hi Peter
> 
> On Saturday, February 16, 2013, peterbarretto
> &lt;peterbarretto08@gmail.&gt;
> Where do i make the pom.xml changes i cant find the file?
> 
> What are you talking about? I made a patch which pulls everything for you.
> There should be no changes required.
> 
>> I havent built the patch changes as i cant find pom.xml file.
> 
> The maven project file is in the root project. We do not build nutch with
> ?aven. Currently for development we use ant tasks and ivy for
> dependencies.
> 
>>
>>
>> lewis john mcgibbney wrote
>>> https://issues.apache.org/jira/browse/NUTCH-1528
>>>
>>> This is the mongodb indexer patch ported to trunk.
>>>
>>> Can I mention that there is usually no time line on these things e.g.
>>> feature requests.
>>> I'm sure you can appreciate that we are all extremely busy at work with
> an
>>> array of other things so if it takes a bit of time, then thats OK. The
>>> world goes on and keeps spinning. Even if we are getting bombarded by
>>> meteorites in Russia!!!
>>>
>>> Please check the patch and out comment accordingly.
>>>
>>> Regarding your issue with regards to the full page content, I am not
>>> sure
>>> if this is currently available in Nutch trunk with out you writing some
>>> code.
>>> Full html markup is certainly stored in 2.x... but I don't know whether
>>> you
>>> are prepared to move to 2.x for your operations?
>>>
>>> hth
>>> Lewis
>>>
>>> On Fri, Feb 15, 2013 at 1:58 AM, peterbarretto &lt;
>>
>>> peterbarretto08@
>>
>>> &gt;wrote:
>>>
>>>> Hi Lewis,
>>>>
>>>> Is this patch done??
>>>>
>>>>
>>>> lewis john mcgibbney wrote
>>>> > Hi,
>>>> > Once I get access to my office I am going to build the patches from
>>>> trunk.
>>>> > Is it trunk that you are using?
>>>> > Thanks
>>>> > Lewis
>>>> >
>>>> > On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto &lt;
>>>>
>>>> > peterbarretto08@
>>>>
>>>> > &gt;wrote:
>>>> >
>>>> >> Hi Lewis,
>>>> >>
>>>> >> I managed to get the code working by adding the below function to
>>>> >> MongodbWriter.java in the public class MongodbWriter  implements
>>>> >> NutchIndexWriter :-
>>>> >>
>>>> >>          public void delete(String key) throws IOException{
>>>> >>                 return;
>>>> >>         }
>>>> >>
>>>> >> And the crawled data was getting stored in mongodb.
>>>> >> The only issue was it was storing only the text of the page and not
>>>> the
>>>> >> full
>>>> >> html content of the page.
>>>> >> How do i store the full html content of the page also?
>>>> >> Hope to see the patches soon.
>>>> >> Thanks
>>>> >>
>>>> >>
>>>> >>
>>>> >> lewis john mcgibbney wrote
>>>> >> > Certainly.
>>>> >> > I am currently reviewing the code and will hopefully have patches
>>>> for
>>>> >> > Nutch trunk cooked up for tomorrow.
>>>> >> > I'll update this thread likewise.
>>>> >> > Thanks
>>>> >> > Lewis
>>>> >> >
>>>> >> > On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto
>>>> >> > &lt;
>>>> >>
>>>> >> > peterbarretto08@
>>>> >>
>>>> >> > &gt; wrote:
>>>> >> >> Hi Lewis,
>>>> >> >>
>>>> >> >> I am new to java and i dont know how to inherit all public
>>>> methods
>>>> >> from
>>>> >> >> NutchIndexWriter
>>>> >> >> Can you help me with that? Then i can rebuild and check if it
>>>> works.
>>>> >> >>
>>>> >> >>
>>>> >> >> lewis john mcgibbney wrote
>>>> >> >>> As you will see the code has not been amended in a year or so.
>>>> >> >>> The positive side is that you only seem to be getting one issue
>>>> with
>>>> >> >>> javac
>>>> >> >>>
>>>> >> >>> On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto &lt;
>>>> >> >>
>>>> >> >>> peterbarretto08@
>>>> >> >>
>>>> >> >>> &gt;wrote:
>>>> >> >>>
>>>> >> >>>>
>>>> >> >>>>
>>>> >> >>>>
>>>> >>
>>>>
> C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
>>>> >> >>>> error: MongodbWriter is not abstract and does not override
>>>> abstract
>>>> >> >>>> method
>>>> >> >>>> delete(String) in NutchIndexWriter
>>>> >> >>>>     [javac] public class MongodbWriter  implements
>>>> NutchIndexWriter{
>>>> >> >>>>
>>>> >> >>>> Sort this error out by inheriting all public methods from
>>>> >> >>>> NutchIndexWriter
>>>> >> >>> for starts. I take it you are not developing from within
>>>> Eclipse?
>>>> As
>>>> >> >>> this
>>>> >> >>> would have been flagged up immediately. This should at least
>>>> enable
>>>> >> you
>>>> >> >>> to
>>>> >> >>> compile the code.
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>>
>>>> >> >>>> I have already crawled some urls now and i need to move those
>>>> to
>>>> >> >>>> mongodb.
>>>> >> >>>> Is
>> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040944.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
> 
> -- 
> *Lewis*





--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4041066.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Peter

On Saturday, February 16, 2013, peterbarretto <pe...@gmail.>
Where do i make the pom.xml changes i cant find the file?

What are you talking about? I made a patch which pulls everything for you.
There should be no changes required.

> I havent built the patch changes as i cant find pom.xml file.

The maven project file is in the root project. We do not build nutch with
?aven. Currently for development we use ant tasks and ivy for dependencies.

>
>
> lewis john mcgibbney wrote
>> https://issues.apache.org/jira/browse/NUTCH-1528
>>
>> This is the mongodb indexer patch ported to trunk.
>>
>> Can I mention that there is usually no time line on these things e.g.
>> feature requests.
>> I'm sure you can appreciate that we are all extremely busy at work with
an
>> array of other things so if it takes a bit of time, then thats OK. The
>> world goes on and keeps spinning. Even if we are getting bombarded by
>> meteorites in Russia!!!
>>
>> Please check the patch and out comment accordingly.
>>
>> Regarding your issue with regards to the full page content, I am not sure
>> if this is currently available in Nutch trunk with out you writing some
>> code.
>> Full html markup is certainly stored in 2.x... but I don't know whether
>> you
>> are prepared to move to 2.x for your operations?
>>
>> hth
>> Lewis
>>
>> On Fri, Feb 15, 2013 at 1:58 AM, peterbarretto &lt;
>
>> peterbarretto08@
>
>> &gt;wrote:
>>
>>> Hi Lewis,
>>>
>>> Is this patch done??
>>>
>>>
>>> lewis john mcgibbney wrote
>>> > Hi,
>>> > Once I get access to my office I am going to build the patches from
>>> trunk.
>>> > Is it trunk that you are using?
>>> > Thanks
>>> > Lewis
>>> >
>>> > On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto &lt;
>>>
>>> > peterbarretto08@
>>>
>>> > &gt;wrote:
>>> >
>>> >> Hi Lewis,
>>> >>
>>> >> I managed to get the code working by adding the below function to
>>> >> MongodbWriter.java in the public class MongodbWriter  implements
>>> >> NutchIndexWriter :-
>>> >>
>>> >>          public void delete(String key) throws IOException{
>>> >>                 return;
>>> >>         }
>>> >>
>>> >> And the crawled data was getting stored in mongodb.
>>> >> The only issue was it was storing only the text of the page and not
>>> the
>>> >> full
>>> >> html content of the page.
>>> >> How do i store the full html content of the page also?
>>> >> Hope to see the patches soon.
>>> >> Thanks
>>> >>
>>> >>
>>> >>
>>> >> lewis john mcgibbney wrote
>>> >> > Certainly.
>>> >> > I am currently reviewing the code and will hopefully have patches
>>> for
>>> >> > Nutch trunk cooked up for tomorrow.
>>> >> > I'll update this thread likewise.
>>> >> > Thanks
>>> >> > Lewis
>>> >> >
>>> >> > On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto
>>> >> > &lt;
>>> >>
>>> >> > peterbarretto08@
>>> >>
>>> >> > &gt; wrote:
>>> >> >> Hi Lewis,
>>> >> >>
>>> >> >> I am new to java and i dont know how to inherit all public methods
>>> >> from
>>> >> >> NutchIndexWriter
>>> >> >> Can you help me with that? Then i can rebuild and check if it
>>> works.
>>> >> >>
>>> >> >>
>>> >> >> lewis john mcgibbney wrote
>>> >> >>> As you will see the code has not been amended in a year or so.
>>> >> >>> The positive side is that you only seem to be getting one issue
>>> with
>>> >> >>> javac
>>> >> >>>
>>> >> >>> On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto &lt;
>>> >> >>
>>> >> >>> peterbarretto08@
>>> >> >>
>>> >> >>> &gt;wrote:
>>> >> >>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >>
>>>
C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
>>> >> >>>> error: MongodbWriter is not abstract and does not override
>>> abstract
>>> >> >>>> method
>>> >> >>>> delete(String) in NutchIndexWriter
>>> >> >>>>     [javac] public class MongodbWriter  implements
>>> NutchIndexWriter{
>>> >> >>>>
>>> >> >>>> Sort this error out by inheriting all public methods from
>>> >> >>>> NutchIndexWriter
>>> >> >>> for starts. I take it you are not developing from within Eclipse?
>>> As
>>> >> >>> this
>>> >> >>> would have been flagged up immediately. This should at least
>>> enable
>>> >> you
>>> >> >>> to
>>> >> >>> compile the code.
>>> >> >>>
>>> >> >>>
>>> >> >>>>
>>> >> >>>> I have already crawled some urls now and i need to move those to
>>> >> >>>> mongodb.
>>> >> >>>> Is
> View this message in context:
http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040944.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*

Re: How to get page content of crawled pages

Posted by peterbarretto <pe...@gmail.com>.

Thanks for the patch Lewis.

Where do i make the pom.xml changes i cant find the file?

Also in 1.6 if i give the below command it returns the html content
./bin/nutch readseg -dump crawl/segments/20090903121951 toto -nofetch
-nogenerate -noparse -noparsedata -noparsetext

I havent built the patch changes as i cant find pom.xml file.


 

lewis john mcgibbney wrote
> https://issues.apache.org/jira/browse/NUTCH-1528
> 
> This is the mongodb indexer patch ported to trunk.
> 
> Can I mention that there is usually no time line on these things e.g.
> feature requests.
> I'm sure you can appreciate that we are all extremely busy at work with an
> array of other things so if it takes a bit of time, then thats OK. The
> world goes on and keeps spinning. Even if we are getting bombarded by
> meteorites in Russia!!!
> 
> Please check the patch and out comment accordingly.
> 
> Regarding your issue with regards to the full page content, I am not sure
> if this is currently available in Nutch trunk with out you writing some
> code.
> Full html markup is certainly stored in 2.x... but I don't know whether
> you
> are prepared to move to 2.x for your operations?
> 
> hth
> Lewis
> 
> On Fri, Feb 15, 2013 at 1:58 AM, peterbarretto &lt;

> peterbarretto08@

> &gt;wrote:
> 
>> Hi Lewis,
>>
>> Is this patch done??
>>
>>
>> lewis john mcgibbney wrote
>> > Hi,
>> > Once I get access to my office I am going to build the patches from
>> trunk.
>> > Is it trunk that you are using?
>> > Thanks
>> > Lewis
>> >
>> > On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto &lt;
>>
>> > peterbarretto08@
>>
>> > &gt;wrote:
>> >
>> >> Hi Lewis,
>> >>
>> >> I managed to get the code working by adding the below function to
>> >> MongodbWriter.java in the public class MongodbWriter  implements
>> >> NutchIndexWriter :-
>> >>
>> >>          public void delete(String key) throws IOException{
>> >>                 return;
>> >>         }
>> >>
>> >> And the crawled data was getting stored in mongodb.
>> >> The only issue was it was storing only the text of the page and not
>> the
>> >> full
>> >> html content of the page.
>> >> How do i store the full html content of the page also?
>> >> Hope to see the patches soon.
>> >> Thanks
>> >>
>> >>
>> >>
>> >> lewis john mcgibbney wrote
>> >> > Certainly.
>> >> > I am currently reviewing the code and will hopefully have patches
>> for
>> >> > Nutch trunk cooked up for tomorrow.
>> >> > I'll update this thread likewise.
>> >> > Thanks
>> >> > Lewis
>> >> >
>> >> > On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto
>> >> > &lt;
>> >>
>> >> > peterbarretto08@
>> >>
>> >> > &gt; wrote:
>> >> >> Hi Lewis,
>> >> >>
>> >> >> I am new to java and i dont know how to inherit all public methods
>> >> from
>> >> >> NutchIndexWriter
>> >> >> Can you help me with that? Then i can rebuild and check if it
>> works.
>> >> >>
>> >> >>
>> >> >> lewis john mcgibbney wrote
>> >> >>> As you will see the code has not been amended in a year or so.
>> >> >>> The positive side is that you only seem to be getting one issue
>> with
>> >> >>> javac
>> >> >>>
>> >> >>> On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto &lt;
>> >> >>
>> >> >>> peterbarretto08@
>> >> >>
>> >> >>> &gt;wrote:
>> >> >>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >>
>> C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
>> >> >>>> error: MongodbWriter is not abstract and does not override
>> abstract
>> >> >>>> method
>> >> >>>> delete(String) in NutchIndexWriter
>> >> >>>>     [javac] public class MongodbWriter  implements
>> NutchIndexWriter{
>> >> >>>>
>> >> >>>> Sort this error out by inheriting all public methods from
>> >> >>>> NutchIndexWriter
>> >> >>> for starts. I take it you are not developing from within Eclipse?
>> As
>> >> >>> this
>> >> >>> would have been flagged up immediately. This should at least
>> enable
>> >> you
>> >> >>> to
>> >> >>> compile the code.
>> >> >>>
>> >> >>>
>> >> >>>>
>> >> >>>> I have already crawled some urls now and i need to move those to
>> >> >>>> mongodb.
>> >> >>>> Is
>> >> >>>> there a easy to use code to do that?
>> >> >>>
>> >> >>>
>> >> >>> Not apart from hacking the code as you are already doing. The code
>> >> you
>> >> >>> are
>> >> >>> pulling is not part of the official nutch codebase and to be
>> honest
>> a
>> >> >>> few
>> >> >>> of us didn't even know about it until you brought it to our
>> attention
>> >> >>> :0)
>> >> >>>
>> >> >>> There is no silver bullet here, just take your time and we will
>> get
>> >> it
>> >> >>> working.
>> >> >>> Lewis
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> View this message in context:
>> >> >>
>> >>
>> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html
>> >> >> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Lewis
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039401.html
>> >> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >>
>> >
>> >
>> >
>> > --
>> > *Lewis*
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040596.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
> 
> 
> 
> -- 
> *Lewis*





--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040944.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

Posted by Lewis John Mcgibbney <le...@gmail.com>.

https://issues.apache.org/jira/browse/NUTCH-1528

This is the mongodb indexer patch ported to trunk.

Can I mention that there is usually no time line on these things e.g.
feature requests.
I'm sure you can appreciate that we are all extremely busy at work with an
array of other things so if it takes a bit of time, then thats OK. The
world goes on and keeps spinning. Even if we are getting bombarded by
meteorites in Russia!!!

Please check the patch and out comment accordingly.

Regarding your issue with regards to the full page content, I am not sure
if this is currently available in Nutch trunk with out you writing some
code.
Full html markup is certainly stored in 2.x... but I don't know whether you
are prepared to move to 2.x for your operations?

hth
Lewis

On Fri, Feb 15, 2013 at 1:58 AM, peterbarretto <pe...@gmail.com>wrote:

> Hi Lewis,
>
> Is this patch done??
>
>
> lewis john mcgibbney wrote
> > Hi,
> > Once I get access to my office I am going to build the patches from
> trunk.
> > Is it trunk that you are using?
> > Thanks
> > Lewis
> >
> > On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto &lt;
>
> > peterbarretto08@
>
> > &gt;wrote:
> >
> >> Hi Lewis,
> >>
> >> I managed to get the code working by adding the below function to
> >> MongodbWriter.java in the public class MongodbWriter  implements
> >> NutchIndexWriter :-
> >>
> >>          public void delete(String key) throws IOException{
> >>                 return;
> >>         }
> >>
> >> And the crawled data was getting stored in mongodb.
> >> The only issue was it was storing only the text of the page and not the
> >> full
> >> html content of the page.
> >> How do i store the full html content of the page also?
> >> Hope to see the patches soon.
> >> Thanks
> >>
> >>
> >>
> >> lewis john mcgibbney wrote
> >> > Certainly.
> >> > I am currently reviewing the code and will hopefully have patches for
> >> > Nutch trunk cooked up for tomorrow.
> >> > I'll update this thread likewise.
> >> > Thanks
> >> > Lewis
> >> >
> >> > On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto
> >> > &lt;
> >>
> >> > peterbarretto08@
> >>
> >> > &gt; wrote:
> >> >> Hi Lewis,
> >> >>
> >> >> I am new to java and i dont know how to inherit all public methods
> >> from
> >> >> NutchIndexWriter
> >> >> Can you help me with that? Then i can rebuild and check if it works.
> >> >>
> >> >>
> >> >> lewis john mcgibbney wrote
> >> >>> As you will see the code has not been amended in a year or so.
> >> >>> The positive side is that you only seem to be getting one issue with
> >> >>> javac
> >> >>>
> >> >>> On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto &lt;
> >> >>
> >> >>> peterbarretto08@
> >> >>
> >> >>> &gt;wrote:
> >> >>>
> >> >>>>
> >> >>>>
> >> >>>>
> >>
> C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
> >> >>>> error: MongodbWriter is not abstract and does not override abstract
> >> >>>> method
> >> >>>> delete(String) in NutchIndexWriter
> >> >>>>     [javac] public class MongodbWriter  implements
> NutchIndexWriter{
> >> >>>>
> >> >>>> Sort this error out by inheriting all public methods from
> >> >>>> NutchIndexWriter
> >> >>> for starts. I take it you are not developing from within Eclipse? As
> >> >>> this
> >> >>> would have been flagged up immediately. This should at least enable
> >> you
> >> >>> to
> >> >>> compile the code.
> >> >>>
> >> >>>
> >> >>>>
> >> >>>> I have already crawled some urls now and i need to move those to
> >> >>>> mongodb.
> >> >>>> Is
> >> >>>> there a easy to use code to do that?
> >> >>>
> >> >>>
> >> >>> Not apart from hacking the code as you are already doing. The code
> >> you
> >> >>> are
> >> >>> pulling is not part of the official nutch codebase and to be honest
> a
> >> >>> few
> >> >>> of us didn't even know about it until you brought it to our
> attention
> >> >>> :0)
> >> >>>
> >> >>> There is no silver bullet here, just take your time and we will get
> >> it
> >> >>> working.
> >> >>> Lewis
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> View this message in context:
> >> >>
> >>
> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html
> >> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >> >
> >> >
> >> >
> >> > --
> >> > Lewis
> >>
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039401.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >
> >
> >
> > --
> > *Lewis*
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040596.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: How to get page content of crawled pages

Posted by peterbarretto <pe...@gmail.com>.

Hi Lewis,

Is this patch done??


lewis john mcgibbney wrote
> Hi,
> Once I get access to my office I am going to build the patches from trunk.
> Is it trunk that you are using?
> Thanks
> Lewis
> 
> On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto &lt;

> peterbarretto08@

> &gt;wrote:
> 
>> Hi Lewis,
>>
>> I managed to get the code working by adding the below function to
>> MongodbWriter.java in the public class MongodbWriter  implements
>> NutchIndexWriter :-
>>
>>          public void delete(String key) throws IOException{
>>                 return;
>>         }
>>
>> And the crawled data was getting stored in mongodb.
>> The only issue was it was storing only the text of the page and not the
>> full
>> html content of the page.
>> How do i store the full html content of the page also?
>> Hope to see the patches soon.
>> Thanks
>>
>>
>>
>> lewis john mcgibbney wrote
>> > Certainly.
>> > I am currently reviewing the code and will hopefully have patches for
>> > Nutch trunk cooked up for tomorrow.
>> > I'll update this thread likewise.
>> > Thanks
>> > Lewis
>> >
>> > On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto
>> > &lt;
>>
>> > peterbarretto08@
>>
>> > &gt; wrote:
>> >> Hi Lewis,
>> >>
>> >> I am new to java and i dont know how to inherit all public methods
>> from
>> >> NutchIndexWriter
>> >> Can you help me with that? Then i can rebuild and check if it works.
>> >>
>> >>
>> >> lewis john mcgibbney wrote
>> >>> As you will see the code has not been amended in a year or so.
>> >>> The positive side is that you only seem to be getting one issue with
>> >>> javac
>> >>>
>> >>> On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto &lt;
>> >>
>> >>> peterbarretto08@
>> >>
>> >>> &gt;wrote:
>> >>>
>> >>>>
>> >>>>
>> >>>>
>> C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
>> >>>> error: MongodbWriter is not abstract and does not override abstract
>> >>>> method
>> >>>> delete(String) in NutchIndexWriter
>> >>>>     [javac] public class MongodbWriter  implements NutchIndexWriter{
>> >>>>
>> >>>> Sort this error out by inheriting all public methods from
>> >>>> NutchIndexWriter
>> >>> for starts. I take it you are not developing from within Eclipse? As
>> >>> this
>> >>> would have been flagged up immediately. This should at least enable
>> you
>> >>> to
>> >>> compile the code.
>> >>>
>> >>>
>> >>>>
>> >>>> I have already crawled some urls now and i need to move those to
>> >>>> mongodb.
>> >>>> Is
>> >>>> there a easy to use code to do that?
>> >>>
>> >>>
>> >>> Not apart from hacking the code as you are already doing. The code
>> you
>> >>> are
>> >>> pulling is not part of the official nutch codebase and to be honest a
>> >>> few
>> >>> of us didn't even know about it until you brought it to our attention
>> >>> :0)
>> >>>
>> >>> There is no silver bullet here, just take your time and we will get
>> it
>> >>> working.
>> >>> Lewis
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html
>> >> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >
>> >
>> >
>> > --
>> > Lewis
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039401.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
> 
> 
> 
> -- 
> *Lewis*





--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040596.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

Posted by peterbarretto <pe...@gmail.com>.

Hi Lewis,

I downloaded the nutch copy from
http://apache.techartifact.com/mirror/nutch/1.6/


lewis john mcgibbney wrote
> Hi,
> Once I get access to my office I am going to build the patches from trunk.
> Is it trunk that you are using?
> Thanks
> Lewis
> 
> On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto &lt;

> peterbarretto08@

> &gt;wrote:
> 
>> Hi Lewis,
>>
>> I managed to get the code working by adding the below function to
>> MongodbWriter.java in the public class MongodbWriter  implements
>> NutchIndexWriter :-
>>
>>          public void delete(String key) throws IOException{
>>                 return;
>>         }
>>
>> And the crawled data was getting stored in mongodb.
>> The only issue was it was storing only the text of the page and not the
>> full
>> html content of the page.
>> How do i store the full html content of the page also?
>> Hope to see the patches soon.
>> Thanks
>>
>>
>>
>> lewis john mcgibbney wrote
>> > Certainly.
>> > I am currently reviewing the code and will hopefully have patches for
>> > Nutch trunk cooked up for tomorrow.
>> > I'll update this thread likewise.
>> > Thanks
>> > Lewis
>> >
>> > On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto
>> > &lt;
>>
>> > peterbarretto08@
>>
>> > &gt; wrote:
>> >> Hi Lewis,
>> >>
>> >> I am new to java and i dont know how to inherit all public methods
>> from
>> >> NutchIndexWriter
>> >> Can you help me with that? Then i can rebuild and check if it works.
>> >>
>> >>
>> >> lewis john mcgibbney wrote
>> >>> As you will see the code has not been amended in a year or so.
>> >>> The positive side is that you only seem to be getting one issue with
>> >>> javac
>> >>>
>> >>> On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto &lt;
>> >>
>> >>> peterbarretto08@
>> >>
>> >>> &gt;wrote:
>> >>>
>> >>>>
>> >>>>
>> >>>>
>> C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
>> >>>> error: MongodbWriter is not abstract and does not override abstract
>> >>>> method
>> >>>> delete(String) in NutchIndexWriter
>> >>>>     [javac] public class MongodbWriter  implements NutchIndexWriter{
>> >>>>
>> >>>> Sort this error out by inheriting all public methods from
>> >>>> NutchIndexWriter
>> >>> for starts. I take it you are not developing from within Eclipse? As
>> >>> this
>> >>> would have been flagged up immediately. This should at least enable
>> you
>> >>> to
>> >>> compile the code.
>> >>>
>> >>>
>> >>>>
>> >>>> I have already crawled some urls now and i need to move those to
>> >>>> mongodb.
>> >>>> Is
>> >>>> there a easy to use code to do that?
>> >>>
>> >>>
>> >>> Not apart from hacking the code as you are already doing. The code
>> you
>> >>> are
>> >>> pulling is not part of the official nutch codebase and to be honest a
>> >>> few
>> >>> of us didn't even know about it until you brought it to our attention
>> >>> :0)
>> >>>
>> >>> There is no silver bullet here, just take your time and we will get
>> it
>> >>> working.
>> >>> Lewis
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html
>> >> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >
>> >
>> >
>> > --
>> > Lewis
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039401.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
> 
> 
> 
> -- 
> *Lewis*





--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039613.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi,
Once I get access to my office I am going to build the patches from trunk.
Is it trunk that you are using?
Thanks
Lewis

On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto <pe...@gmail.com>wrote:

> Hi Lewis,
>
> I managed to get the code working by adding the below function to
> MongodbWriter.java in the public class MongodbWriter  implements
> NutchIndexWriter :-
>
>          public void delete(String key) throws IOException{
>                 return;
>         }
>
> And the crawled data was getting stored in mongodb.
> The only issue was it was storing only the text of the page and not the
> full
> html content of the page.
> How do i store the full html content of the page also?
> Hope to see the patches soon.
> Thanks
>
>
>
> lewis john mcgibbney wrote
> > Certainly.
> > I am currently reviewing the code and will hopefully have patches for
> > Nutch trunk cooked up for tomorrow.
> > I'll update this thread likewise.
> > Thanks
> > Lewis
> >
> > On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto
> > &lt;
>
> > peterbarretto08@
>
> > &gt; wrote:
> >> Hi Lewis,
> >>
> >> I am new to java and i dont know how to inherit all public methods from
> >> NutchIndexWriter
> >> Can you help me with that? Then i can rebuild and check if it works.
> >>
> >>
> >> lewis john mcgibbney wrote
> >>> As you will see the code has not been amended in a year or so.
> >>> The positive side is that you only seem to be getting one issue with
> >>> javac
> >>>
> >>> On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto &lt;
> >>
> >>> peterbarretto08@
> >>
> >>> &gt;wrote:
> >>>
> >>>>
> >>>>
> >>>>
> C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
> >>>> error: MongodbWriter is not abstract and does not override abstract
> >>>> method
> >>>> delete(String) in NutchIndexWriter
> >>>>     [javac] public class MongodbWriter  implements NutchIndexWriter{
> >>>>
> >>>> Sort this error out by inheriting all public methods from
> >>>> NutchIndexWriter
> >>> for starts. I take it you are not developing from within Eclipse? As
> >>> this
> >>> would have been flagged up immediately. This should at least enable you
> >>> to
> >>> compile the code.
> >>>
> >>>
> >>>>
> >>>> I have already crawled some urls now and i need to move those to
> >>>> mongodb.
> >>>> Is
> >>>> there a easy to use code to do that?
> >>>
> >>>
> >>> Not apart from hacking the code as you are already doing. The code you
> >>> are
> >>> pulling is not part of the official nutch codebase and to be honest a
> >>> few
> >>> of us didn't even know about it until you brought it to our attention
> >>> :0)
> >>>
> >>> There is no silver bullet here, just take your time and we will get it
> >>> working.
> >>> Lewis
> >>
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
> >
> > --
> > Lewis
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039401.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: How to get page content of crawled pages

Posted by peterbarretto <pe...@gmail.com>.

Hi Lewis,

I managed to get the code working by adding the below function to
MongodbWriter.java in the public class MongodbWriter  implements
NutchIndexWriter :-

	 public void delete(String key) throws IOException{
		return;
	}

And the crawled data was getting stored in mongodb.
The only issue was it was storing only the text of the page and not the full
html content of the page.
How do i store the full html content of the page also? 
Hope to see the patches soon.
Thanks



lewis john mcgibbney wrote
> Certainly.
> I am currently reviewing the code and will hopefully have patches for
> Nutch trunk cooked up for tomorrow.
> I'll update this thread likewise.
> Thanks
> Lewis
> 
> On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto
> &lt;

> peterbarretto08@

> &gt; wrote:
>> Hi Lewis,
>>
>> I am new to java and i dont know how to inherit all public methods from
>> NutchIndexWriter
>> Can you help me with that? Then i can rebuild and check if it works.
>>
>>
>> lewis john mcgibbney wrote
>>> As you will see the code has not been amended in a year or so.
>>> The positive side is that you only seem to be getting one issue with
>>> javac
>>>
>>> On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto &lt;
>>
>>> peterbarretto08@
>>
>>> &gt;wrote:
>>>
>>>>
>>>>
>>>> C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
>>>> error: MongodbWriter is not abstract and does not override abstract
>>>> method
>>>> delete(String) in NutchIndexWriter
>>>>     [javac] public class MongodbWriter  implements NutchIndexWriter{
>>>>
>>>> Sort this error out by inheriting all public methods from
>>>> NutchIndexWriter
>>> for starts. I take it you are not developing from within Eclipse? As
>>> this
>>> would have been flagged up immediately. This should at least enable you
>>> to
>>> compile the code.
>>>
>>>
>>>>
>>>> I have already crawled some urls now and i need to move those to
>>>> mongodb.
>>>> Is
>>>> there a easy to use code to do that?
>>>
>>>
>>> Not apart from hacking the code as you are already doing. The code you
>>> are
>>> pulling is not part of the official nutch codebase and to be honest a
>>> few
>>> of us didn't even know about it until you brought it to our attention
>>> :0)
>>>
>>> There is no silver bullet here, just take your time and we will get it
>>> working.
>>> Lewis
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
> 
> 
> 
> -- 
> Lewis





--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039401.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Certainly.
I am currently reviewing the code and will hopefully have patches for
Nutch trunk cooked up for tomorrow.
I'll update this thread likewise.
Thanks
Lewis

On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto
<pe...@gmail.com> wrote:
> Hi Lewis,
>
> I am new to java and i dont know how to inherit all public methods from
> NutchIndexWriter
> Can you help me with that? Then i can rebuild and check if it works.
>
>
> lewis john mcgibbney wrote
>> As you will see the code has not been amended in a year or so.
>> The positive side is that you only seem to be getting one issue with javac
>>
>> On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto &lt;
>
>> peterbarretto08@
>
>> &gt;wrote:
>>
>>>
>>>
>>> C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
>>> error: MongodbWriter is not abstract and does not override abstract
>>> method
>>> delete(String) in NutchIndexWriter
>>>     [javac] public class MongodbWriter  implements NutchIndexWriter{
>>>
>>> Sort this error out by inheriting all public methods from
>>> NutchIndexWriter
>> for starts. I take it you are not developing from within Eclipse? As this
>> would have been flagged up immediately. This should at least enable you to
>> compile the code.
>>
>>
>>>
>>> I have already crawled some urls now and i need to move those to mongodb.
>>> Is
>>> there a easy to use code to do that?
>>
>>
>> Not apart from hacking the code as you are already doing. The code you are
>> pulling is not part of the official nutch codebase and to be honest a few
>> of us didn't even know about it until you brought it to our attention :0)
>>
>> There is no silver bullet here, just take your time and we will get it
>> working.
>> Lewis
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html
> Sent from the Nutch - User mailing list archive at Nabble.com.



-- 
Lewis

Re: How to get page content of crawled pages

Posted by peterbarretto <pe...@gmail.com>.

Hi Lewis,

I am new to java and i dont know how to inherit all public methods from
NutchIndexWriter
Can you help me with that? Then i can rebuild and check if it works.


lewis john mcgibbney wrote
> As you will see the code has not been amended in a year or so.
> The positive side is that you only seem to be getting one issue with javac
> 
> On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto &lt;

> peterbarretto08@

> &gt;wrote:
> 
>>
>>
>> C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
>> error: MongodbWriter is not abstract and does not override abstract
>> method
>> delete(String) in NutchIndexWriter
>>     [javac] public class MongodbWriter  implements NutchIndexWriter{
>>
>> Sort this error out by inheriting all public methods from
>> NutchIndexWriter
> for starts. I take it you are not developing from within Eclipse? As this
> would have been flagged up immediately. This should at least enable you to
> compile the code.
> 
> 
>>
>> I have already crawled some urls now and i need to move those to mongodb.
>> Is
>> there a easy to use code to do that?
> 
> 
> Not apart from hacking the code as you are already doing. The code you are
> pulling is not part of the official nutch codebase and to be honest a few
> of us didn't even know about it until you brought it to our attention :0)
> 
> There is no silver bullet here, just take your time and we will get it
> working.
> Lewis





--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hey Guys,

I'm working on a tool to grab the file content of the crawled pages. I
created a JIRA ticket and Review Board for this:

https://issues.apache.org/jira/browse/NUTCH-1526

https://reviews.apache.org/r/9119/


Am still working on finishing the patch but you can see the sketch on my
Github, and also from my conversation on the Nutch ML and from the JIRA
ticket interface spec, etc.

Hopefully will have this done before next week.

Cheers,
Chris

On 1/30/13 11:05 PM, "Lewis John Mcgibbney" <le...@gmail.com>
wrote:

>As you will see the code has not been amended in a year or so.
>The positive side is that you only seem to be getting one issue with javac
>
>On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto
><pe...@gmail.com>wrote:
>
>>
>>
>> 
>>C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:
>>18:
>> error: MongodbWriter is not abstract and does not override abstract
>>method
>> delete(String) in NutchIndexWriter
>>     [javac] public class MongodbWriter  implements NutchIndexWriter{
>>
>> Sort this error out by inheriting all public methods from
>>NutchIndexWriter
>for starts. I take it you are not developing from within Eclipse? As this
>would have been flagged up immediately. This should at least enable you to
>compile the code.
>
>
>>
>> I have already crawled some urls now and i need to move those to
>>mongodb.
>> Is
>> there a easy to use code to do that?
>
>
>Not apart from hacking the code as you are already doing. The code you are
>pulling is not part of the official nutch codebase and to be honest a few
>of us didn't even know about it until you brought it to our attention :0)
>
>There is no silver bullet here, just take your time and we will get it
>working.
>Lewis

Re: How to get page content of crawled pages

Posted by Lewis John Mcgibbney <le...@gmail.com>.

As you will see the code has not been amended in a year or so.
The positive side is that you only seem to be getting one issue with javac

On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto <pe...@gmail.com>wrote:

>
>
> C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
> error: MongodbWriter is not abstract and does not override abstract method
> delete(String) in NutchIndexWriter
>     [javac] public class MongodbWriter  implements NutchIndexWriter{
>
> Sort this error out by inheriting all public methods from NutchIndexWriter
for starts. I take it you are not developing from within Eclipse? As this
would have been flagged up immediately. This should at least enable you to
compile the code.

>
> I have already crawled some urls now and i need to move those to mongodb.
> Is
> there a easy to use code to do that?

Not apart from hacking the code as you are already doing. The code you are
pulling is not part of the official nutch codebase and to be honest a few
of us didn't even know about it until you brought it to our attention :0)

There is no silver bullet here, just take your time and we will get it
working.
Lewis

Re: How to get page content of crawled pages

Posted by peterbarretto <pe...@gmail.com>.

I have tried the repo https://github.com/ctjmorgan/nutch-mongodb-indexer and
it does not work
I guess this is not working as it is mentioned it is for nutch 1.3 and i am
using 1.6

I get the below output when i try to rebuild :-

Buildfile: C:\nutch-16\build.xml
  [taskdef] Could not load definitions from resource
org/sonar/ant/antlib.xml. It could not be found.

ivy-probe-antlib:

ivy-download:
  [taskdef] Could not load definitions from resource
org/sonar/ant/antlib.xml. It could not be found.

ivy-download-unchecked:

ivy-init-antlib:

ivy-init:

init:

clean-lib:
   [delete] Deleting directory C:\nutch-16\build\lib

resolve-default:
[ivy:resolve] :: Ivy 2.2.0 - 20100923230623 :: http://ant.apache.org/ivy/ ::
[ivy:resolve] :: loading settings :: file = C:\nutch-16\ivy\ivysettings.xml
  [taskdef] Could not load definitions from resource
org/sonar/ant/antlib.xml. It could not be found.

copy-libs:

compile-core:
    [javac] C:\nutch-16\build.xml:96: warning: 'includeantruntime' was not
set, defaulting to build.sysclasspath=last; set to false for repeatable
builds
    [javac] Compiling 1 source file to C:\nutch-16\build\classes
    [javac] warning: [path] bad path element
"C:\nutch-16\build\lib\activation.jar": no such file or directory
    [javac] warning: [options] bootstrap class path not set in conjunction
with -source 1.6
    [javac]
C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:7:
warning: [deprecation] JobConf in org.apache.hadoop.mapred has been
deprecated
    [javac] import org.apache.hadoop.mapred.JobConf;
    [javac]                                ^
    [javac]
C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
error: MongodbWriter is not abstract and does not override abstract method
delete(String) in NutchIndexWriter
    [javac] public class MongodbWriter  implements NutchIndexWriter{
    [javac]        ^
    [javac]
C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:23:
warning: [deprecation] JobConf in org.apache.hadoop.mapred has been
deprecated
    [javac] 	public void open(JobConf job, String name) throws IOException {
    [javac] 	                 ^
    [javac] 1 error
    [javac] 4 warnings


I have already crawled some urls now and i need to move those to mongodb. Is
there a easy to use code to do that? I am new to java so will require all
the steps of how to add the code and all.



Jorge Luis Betancourt Gonzalez wrote
> I suppose you can write a custom indexer, to store the data in mongodb
> instead of solr, I think there is an open repo on github about this.
> 
> ----- Mensaje original -----
> De: "peterbarretto" &lt;

> peterbarretto08@

> &gt;
> Para: 

> user@.apache

> Enviados: Martes, 29 de Enero 2013 8:46:04
> Asunto: Re: How to get page content of crawled pages
> 
> Hi
> 
> Is there a way i can dump the url and url content in mongodb?
> 
> 
> Klemens Muthmann wrote
>> Hi,
>>
>> Super. That works. Thank you. I thereby also found the class that shows
>> how to achieve this within Java code, which is
>> org.apache.nutch.segment.SegmentReader.
>>
>> Thanks again and bye
>>      Klemens
>>
>> Am 22.11.2010 10:49, schrieb Hannes Carl Meyer:
>>> Hi Klemens,
>>>
>>> you should run ./bin/nutch readseg!
>>>
>>> For example: ./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder
>>> -nofetch -nogenerate -noparse -noparsedata -noparsetex
>>>
>>> Kind Regards from Hannover
>>>
>>> Hannes
>>>
>>> On Mon, Nov 22, 2010 at 9:23 AM, Klemens Muthmann<
>>>
> 
>> klemens.muthmann@
> 
>>>  wrote:
>>>
>>>> Hi,
>>>>
>>>> I did a small crawl of some pages on the web and want to geht the raw
>>>> HTML
>>>> content of these pages now. Reading the documentation in the wiki I
>>>> guess
>>>> this content might be somewhere under
>>>> crawl/segments/20101122071139/content/part-00000.
>>>>
>>>> I also guess I can access this content using the Hadoop API like
>>>> described
>>>> here: http://wiki.apache.org/nutch/Getting_Started
>>>>
>>>> However I have absolutely no idea how to configure:
>>>>
>>>> MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf);
>>>>
>>>>
>>>> The Hadoop documentation is not very helpful either. May someone please
>>>> point me in the right direction to get the page content?
>>>>
>>>> Thank you and regards
>>>>     Klemens Muthmann
>>>>
>>
>>
>> --
>> --------------------------------
>> Dipl.-Medieninf., Klemens Muthmann
>> Wissenschaftlicher Mitarbeiter
>>
>> Technische Universität Dresden
>> Fakultät Informatik
>> Institut für Systemarchitektur
>> Lehrstuhl Rechnernetze
>> 01062 Dresden
>> Tel.: +49 (351) 463-38214
>> Fax: +49 (351) 463-38251
>> E-Mail:
> 
>> klemens.muthmann@
> 
>> --------------------------------
> 
> 
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037023.html
> Sent from the Nutch - User mailing list archive at Nabble.com.





--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037283.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

Posted by Jorge Luis Betancourt Gonzalez <jl...@uci.cu>.

I suppose you can write a custom indexer, to store the data in mongodb instead of solr, I think there is an open repo on github about this.

----- Mensaje original -----
De: "peterbarretto" <pe...@gmail.com>
Para: user@nutch.apache.org
Enviados: Martes, 29 de Enero 2013 8:46:04
Asunto: Re: How to get page content of crawled pages

Hi

Is there a way i can dump the url and url content in mongodb?


Klemens Muthmann wrote
> Hi,
> 
> Super. That works. Thank you. I thereby also found the class that shows 
> how to achieve this within Java code, which is 
> org.apache.nutch.segment.SegmentReader.
> 
> Thanks again and bye
>      Klemens
> 
> Am 22.11.2010 10:49, schrieb Hannes Carl Meyer:
>> Hi Klemens,
>>
>> you should run ./bin/nutch readseg!
>>
>> For example: ./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder
>> -nofetch -nogenerate -noparse -noparsedata -noparsetex
>>
>> Kind Regards from Hannover
>>
>> Hannes
>>
>> On Mon, Nov 22, 2010 at 9:23 AM, Klemens Muthmann<
>> 

> klemens.muthmann@

>>  wrote:
>>
>>> Hi,
>>>
>>> I did a small crawl of some pages on the web and want to geht the raw
>>> HTML
>>> content of these pages now. Reading the documentation in the wiki I
>>> guess
>>> this content might be somewhere under
>>> crawl/segments/20101122071139/content/part-00000.
>>>
>>> I also guess I can access this content using the Hadoop API like
>>> described
>>> here: http://wiki.apache.org/nutch/Getting_Started
>>>
>>> However I have absolutely no idea how to configure:
>>>
>>> MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf);
>>>
>>>
>>> The Hadoop documentation is not very helpful either. May someone please
>>> point me in the right direction to get the page content?
>>>
>>> Thank you and regards
>>>     Klemens Muthmann
>>>
> 
> 
> -- 
> --------------------------------
> Dipl.-Medieninf., Klemens Muthmann
> Wissenschaftlicher Mitarbeiter
> 
> Technische Universität Dresden
> Fakultät Informatik
> Institut für Systemarchitektur
> Lehrstuhl Rechnernetze
> 01062 Dresden
> Tel.: +49 (351) 463-38214
> Fax: +49 (351) 463-38251
> E-Mail: 

> klemens.muthmann@

> --------------------------------





--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037023.html
Sent from the Nutch - User mailing list archive at Nabble.com.