You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Musshorn, Kris T CTR USARMY CECOM (US)" <kr...@mail.mil> on 2018/09/28 11:19:20 UTC

RE: [Non-DoD Source] Re: Include parent URL in pdf data - nutch

Please remove me from this list

-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com.INVALID] 
Sent: Friday, September 28, 2018 2:25 AM
To: user@nutch.apache.org
Subject: [Non-DoD Source] Re: Include parent URL in pdf data - nutch

All active links contained in this email were disabled.  Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser.  




----

Hi,

could you explain in detail what is meant by "parent URL"?
- the page the PDF document is linked from
- a redirect pointing to the PDF doc
- the "directory" of the PDF URL (clip URL after last "/")
- ...

Nutch indexes all successfully fetched pages but not redirects, 404s, etc. Of course, pages not crawled cannot be indexed.

Best,
Sebastian

On 09/27/2018 11:58 AM, UMA MAHESWAR wrote:
> I am using nutch1.x for website cawing and indexing in solr(5.5.0). 
> I am trying to include the parent URL along with pdf data . 
> Can someone please suggest me some way to do it ?
> 
> Thanks in advance for your comments and suggestions
> 
> 
> 
> --
> Sent from: 
> Caution-http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
> 


Re: [Non-DoD Source] Re: Include parent URL in pdf data - nutch

Posted by Jorge Betancourt <be...@gmail.com>.
Hi Musshorn,

You can take a look at http://nutch.apache.org/mailing_lists.html on how to
unsubscribe from the mailing list. Send an email to
user-unsubscribe@nutch.apache.org.

Best Regards,
Jorge

On Fri, Sep 28, 2018 at 1:24 PM Musshorn, Kris T CTR USARMY CECOM (US) <
kris.t.musshorn.ctr@mail.mil> wrote:

> Please remove me from this list
>
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com.INVALID]
> Sent: Friday, September 28, 2018 2:25 AM
> To: user@nutch.apache.org
> Subject: [Non-DoD Source] Re: Include parent URL in pdf data - nutch
>
> All active links contained in this email were disabled.  Please verify the
> identity of the sender, and confirm the authenticity of all links contained
> within the message prior to copying and pasting the address to a Web
> browser.
>
>
>
>
> ----
>
> Hi,
>
> could you explain in detail what is meant by "parent URL"?
> - the page the PDF document is linked from
> - a redirect pointing to the PDF doc
> - the "directory" of the PDF URL (clip URL after last "/")
> - ...
>
> Nutch indexes all successfully fetched pages but not redirects, 404s, etc.
> Of course, pages not crawled cannot be indexed.
>
> Best,
> Sebastian
>
> On 09/27/2018 11:58 AM, UMA MAHESWAR wrote:
> > I am using nutch1.x for website cawing and indexing in solr(5.5.0).
> > I am trying to include the parent URL along with pdf data .
> > Can someone please suggest me some way to do it ?
> >
> > Thanks in advance for your comments and suggestions
> >
> >
> >
> > --
> > Sent from:
> > Caution-http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
> >
>
>