You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Xavier Morera <xa...@familiamorera.com> on 2014/12/09 22:35:07 UTC

Re: Crawling a site and saving the page html exactly as is in a database

Hi Chris Mattmann,

We will soon test it out. Is it ok if I let you know if I have questions or
comments?

Thanks,
Xavier

On Fri, Sep 19, 2014 at 12:31 AM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Please check out NUTCH-1526 [1] which I am currently targeting for
> contribution to 1.10-trunk and the 2.x branch. I'd be happy to
> discuss. Thank you!
>
> Please try the patch out - it will dump out the web pages, images,
> etc. all content that is stored in the segments as the original
> files that were crawled.
>
> There is a review board link here:
>
> https://reviews.apache.org/r/9119/
>
>
> Cheers,
> Chris
>
> [1] https://issues.apache.org/jira/browse/NUTCH-1526
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Xavier Morera <xa...@familiamorera.com>
> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Date: Thursday, September 18, 2014 3:21 PM
> To: dev <de...@nutch.apache.org>
> Subject: Crawling a site and saving the page html exactly as is in a
> database
>
> >Hi,
> >
> >
> >I have a requirement to crawl a site and save the crawled html pages into
> >a database exactly as is. How complicated can this be? I need for it to
> >keep all html tags.
> >
> >
> >Also, are there any examples available that I could use as a base?
> >
> >
> >Regards,
> >Xavier
> >
> >
> >--
> >Xavier Morera
> >email: xavier@familiamorera.com
> >CR: +(506) 8849 8866
> >US: +1 (305) 600 4919skype: xmorera
> >
> >
> >
> >
> >
>
>


-- 

*Xavier Morera*

Entrepreneur | Author & Trainer | Consultant | Developer & Scrum Master

*www.xaviermorera.com <http://www.xaviermorera.com/>*

office:  (305) 600-4919

cel:     +506 8849-8866

skype: xmorera
Twitter <https://twitter.com/xmorera> | LinkedIn
<https://www.linkedin.com/in/xmorera> | Pluralsight Author
<http://www.pluralsight.com/author/xavier-morera>

Re: Crawling a site and saving the page html exactly as is in a database

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Dear Xavier yes please contact me I’d be happy to help!
So would some of the other devs here who have used it
like Lewis, etc.

THanks!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Xavier Morera <xa...@familiamorera.com>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Tuesday, December 9, 2014 at 1:35 PM
To: dev <de...@nutch.apache.org>
Subject: Re: Crawling a site and saving the page html exactly as is in a
database

>Hi Chris Mattmann,
>
>
>We will soon test it out. Is it ok if I let you know if I have questions
>or comments?
>
>
>Thanks,
>Xavier
>
>
>On Fri, Sep 19, 2014 at 12:31 AM, Mattmann, Chris A (3980)
><ch...@jpl.nasa.gov> wrote:
>
>Please check out NUTCH-1526 [1] which I am currently targeting for
>contribution to 1.10-trunk and the 2.x branch. I'd be happy to
>discuss. Thank you!
>
>Please try the patch out - it will dump out the web pages, images,
>etc. all content that is stored in the segments as the original
>files that were crawled.
>
>There is a review board link here:
>
>https://reviews.apache.org/r/9119/
>
>
>Cheers,
>Chris
>
>[1] https://issues.apache.org/jira/browse/NUTCH-1526
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Xavier Morera <xa...@familiamorera.com>
>Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>Date: Thursday, September 18, 2014 3:21 PM
>To: dev <de...@nutch.apache.org>
>Subject: Crawling a site and saving the page html exactly as is in a
>database
>
>>Hi,
>>
>>
>>I have a requirement to crawl a site and save the crawled html pages into
>>a database exactly as is. How complicated can this be? I need for it to
>>keep all html tags.
>>
>>
>>Also, are there any examples available that I could use as a base?
>>
>>
>>Regards,
>>Xavier
>>
>>
>>--
>>Xavier Morera
>>email: xavier@familiamorera.com
>>CR: +(506) 8849 8866 <tel:%2B%28506%29%208849%208866>
>>US: +1 (305) 600 4919skype: xmorera
>>
>>
>>
>>
>>
>
>
>
>
>
>
>
>
>
>
>-- 
>Xavier Morera
>Entrepreneur | Author
> & Trainer | Consultant | Developer
> & Scrum Master
>www.xaviermorera.com <http://www.xaviermorera.com/>
>office:  (305) 600-4919
>cel:     +506 8849-8866
>
>skype: xmorera
>Twitter <https://twitter.com/xmorera> | LinkedIn
><https://www.linkedin.com/in/xmorera> | Pluralsight
> Author <http://www.pluralsight.com/author/xavier-morera>
>
>
>
>
>
>
>
>