You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by ma...@provinzial.com on 2005/11/28 10:37:28 UTC

Need metadata transport.

Hi dear nutchers,

I have implemented http session support for nutch. A patch will be
released, as soon as i switched to mapreduce.
I am crawling an intranet CMS. I was succesfull in indexing the PDFs.
If I follow the link in the search result pane, the PDFs are not retrieved
by the clients browser, because a session cookie is not set. I need some
kind of metadata in the PDF refering to the original HTML-URL, were this
session cookie is set before the page is redirekted to the url of the PDF.
This information is only availible when this HTML-URL is parsed.

Any ideas?

Thanks for your help.

Marcel Schnippe

Re: Need metadata transport.

Posted by Stefan Groschupf <sg...@media-style.com>.

Hi Marcel,

for version 0.7.x you can use a patch I had uploaded to the jira.
http://issues.apache.org/jira/browse/NUTCH-59

For version 0.8 this will not work anymore.
I already discussed the meta data issue with Doug and how we can  
solve it in 0.8 but I haven't  found any time to write something, but  
it is definitely on my todo list.

Stefan




Am 28.11.2005 um 10:37 schrieb marcel.schnippe@provinzial.com:

>
> Hi dear nutchers,
>
> I have implemented http session support for nutch. A patch will be
> released, as soon as i switched to mapreduce.
> I am crawling an intranet CMS. I was succesfull in indexing the PDFs.
> If I follow the link in the search result pane, the PDFs are not  
> retrieved
> by the clients browser, because a session cookie is not set. I need  
> some
> kind of metadata in the PDF refering to the original HTML-URL, were  
> this
> session cookie is set before the page is redirekted to the url of  
> the PDF.
> This information is only availible when this HTML-URL is parsed.
>
> Any ideas?
>
> Thanks for your help.
>
> Marcel Schnippe
>
>