You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by ma...@provinzial.com on 2005/11/28 10:37:28 UTC
Need metadata transport.
Hi dear nutchers,
I have implemented http session support for nutch. A patch will be
released, as soon as i switched to mapreduce.
I am crawling an intranet CMS. I was succesfull in indexing the PDFs.
If I follow the link in the search result pane, the PDFs are not retrieved
by the clients browser, because a session cookie is not set. I need some
kind of metadata in the PDF refering to the original HTML-URL, were this
session cookie is set before the page is redirekted to the url of the PDF.
This information is only availible when this HTML-URL is parsed.
Any ideas?
Thanks for your help.
Marcel Schnippe
Re: Need metadata transport.
Posted by Stefan Groschupf <sg...@media-style.com>.
Hi Marcel,
for version 0.7.x you can use a patch I had uploaded to the jira.
http://issues.apache.org/jira/browse/NUTCH-59
For version 0.8 this will not work anymore.
I already discussed the meta data issue with Doug and how we can
solve it in 0.8 but I haven't found any time to write something, but
it is definitely on my todo list.
Stefan
Am 28.11.2005 um 10:37 schrieb marcel.schnippe@provinzial.com:
>
> Hi dear nutchers,
>
> I have implemented http session support for nutch. A patch will be
> released, as soon as i switched to mapreduce.
> I am crawling an intranet CMS. I was succesfull in indexing the PDFs.
> If I follow the link in the search result pane, the PDFs are not
> retrieved
> by the clients browser, because a session cookie is not set. I need
> some
> kind of metadata in the PDF refering to the original HTML-URL, were
> this
> session cookie is set before the page is redirekted to the url of
> the PDF.
> This information is only availible when this HTML-URL is parsed.
>
> Any ideas?
>
> Thanks for your help.
>
> Marcel Schnippe
>
>