You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by ritika jain <ri...@gmail.com> on 2020/05/11 13:35:17 UTC

Re: Extraction and storing parent URL while crawling

Hello Users,

Can anybody please revert on this. It would be highly appreciated.

On Fri, Apr 3, 2020 at 2:28 PM ritika jain <ri...@gmail.com> wrote:

> Hi All,
> I am using Manifoldcf 2.14 to crawl data from a website using Web as Repo
> connector and Elastic Search as output connector,
> I want to get some knowledge about the crawling framework/hierarchy used
> by the webcrawler.
> As far as I know or I understand the crawling of the URL's works in the
> manner of tree structure.
>
> I want to know if there is any functionality supported by manifoldcf as of
> now to store parent URL of a document
> For example seed URL is: www.example.com. and at document queue 80th
> number our document identifier is
> www.example.com/education/univeristy/234.html.
>
> Is there any way manifolcf is storing the back traced URL's, that means by
> following which hierarchy level the 80th document has came from.
> Like to store 79th, 78th,77th level of document crawl to reach 80th number
> of documents followed by seed document.
>
> Is this crawling hierarchy (if only level also), is being stored somewhere
> in manifoldcf code yet. If yes does this framework code is present in the
> form of jar.?? helpful or if not in jar any clue to which Java file this
> logic is being implemented, will be really.
>
> Any kind of clue or help will be really appreciated.
>
> Many Thanks
> Ritika
>
>