You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Paul Libbrecht <pa...@activemath.org> on 2009/03/09 21:57:14 UTC
DIH with a list of changed documents?
Hello List,
how would I implement entity-processor if I were able to get the list
of recently changed documents of our sites?
thanks for hints.
paul
Re: DIH with a list of changed documents?
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Re file vs. URL - can't both be hidden behind an URL object (file:// vs. http:// schema)?
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
> From: Fergus McMenemie <fe...@twig.me.uk>
> To: solr-user@lucene.apache.org
> Sent: Monday, March 9, 2009 7:00:43 PM
> Subject: Re: DIH with a list of changed documents?
>
> >Le 09-mars-09 à 22:29, Fergus McMenemie a écrit :
> >>> how would I implement entity-processor if I were able to get the list
> >>> of recently changed documents of our sites?
> >>
> >> Hmmmm, this sounds like a job for my manifestEnityProcessor
> >> see if you can find the thread titled:-
> >>
> >> "a new DIH manifestEnityProcessor"
> >>
> >> is your list of changed documents a list of additions and
> >> updates only, or does it contain deletes as well?
> >
> >Fergus,
> >
> >I think you should then rename it... Manifest is not the right name to
> >me (manifest refers to something such as the manifest of a jar or of
> >an IMS-content-package, both are a metadata of the data).
>
> Its all in the jargon, I guess. Our content repositories are changed
> by update kits, some of the kits come with manifests or in other cases
> we capture the output from un-tar or un-zip commands and we call these
> manifests. The name is up for grabs if a better suggestion comes along;
> I would have used FileListEntityProcessor except the name was taken;-)
>
>
> >I looked at your original description and I could not read anything
> >about the changed files.
> >The regex approach is a nice one for sure...
>
> Yep, our "manifest"s quite often include jpegs, avis etc which we
> do not want indexed. And if it's a tar output it will contain
> directory stubs as well.
>
> >I think a useful DIH Entity-processor that would maintain its deltas
> >well would have as parameters, url to a list of recently updated urls,
> >url to a list of recently deleted urls. Is this yours?
>
> urls hu! Never thought of that, i was just assuming it would be a local
> file. However I guess that could be added... so "manifestFileName" would
> become "manifestURL"? In my use cases some of the "manifests" are along
> the lines of
>
> ADD xxxx-checksum-xxx --pathname_1--
> DEL --pathname_b--
>
> Hence "manifestAddRegex" and "manifestDelRegex". I also, in other
> cases, have separate files, one for adding another for deleting.
> This I was going to deal with as two separate DIH imports.
>
> >I would have one for URLs with the list of recent things basically
> >from an RSS; the transformer is custom in all cases.
>
> The output from my manifestEnityProcessor is fed to an
> XPathEntityProcessor
>
> >
> >paul
> >
> Fergus.
> --
>
> ===============================================================
> Fergus McMenemie Email:fergus@twig.me.uk
> Techmore Ltd Phone:(UK) 07721 376021
>
> Unix/Mac/Intranets Analyst Programmer
> ===============================================================
Re: DIH with a list of changed documents?
Posted by Fergus McMenemie <fe...@twig.me.uk>.
>Le 09-mars-09 à 22:29, Fergus McMenemie a écrit :
>>> how would I implement entity-processor if I were able to get the list
>>> of recently changed documents of our sites?
>>
>> Hmmmm, this sounds like a job for my manifestEnityProcessor
>> see if you can find the thread titled:-
>>
>> "a new DIH manifestEnityProcessor"
>>
>> is your list of changed documents a list of additions and
>> updates only, or does it contain deletes as well?
>
>Fergus,
>
>I think you should then rename it... Manifest is not the right name to
>me (manifest refers to something such as the manifest of a jar or of
>an IMS-content-package, both are a metadata of the data).
Its all in the jargon, I guess. Our content repositories are changed
by update kits, some of the kits come with manifests or in other cases
we capture the output from un-tar or un-zip commands and we call these
manifests. The name is up for grabs if a better suggestion comes along;
I would have used FileListEntityProcessor except the name was taken;-)
>I looked at your original description and I could not read anything
>about the changed files.
>The regex approach is a nice one for sure...
Yep, our "manifest"s quite often include jpegs, avis etc which we
do not want indexed. And if it's a tar output it will contain
directory stubs as well.
>I think a useful DIH Entity-processor that would maintain its deltas
>well would have as parameters, url to a list of recently updated urls,
>url to a list of recently deleted urls. Is this yours?
urls hu! Never thought of that, i was just assuming it would be a local
file. However I guess that could be added... so "manifestFileName" would
become "manifestURL"? In my use cases some of the "manifests" are along
the lines of
ADD xxxx-checksum-xxx --pathname_1--
DEL --pathname_b--
Hence "manifestAddRegex" and "manifestDelRegex". I also, in other
cases, have separate files, one for adding another for deleting.
This I was going to deal with as two separate DIH imports.
>I would have one for URLs with the list of recent things basically
>from an RSS; the transformer is custom in all cases.
The output from my manifestEnityProcessor is fed to an
XPathEntityProcessor
>
>paul
>
Fergus.
--
===============================================================
Fergus McMenemie Email:fergus@twig.me.uk
Techmore Ltd Phone:(UK) 07721 376021
Unix/Mac/Intranets Analyst Programmer
===============================================================
Re: DIH with a list of changed documents?
Posted by Paul Libbrecht <pa...@activemath.org>.
Le 09-mars-09 à 22:29, Fergus McMenemie a écrit :
>> how would I implement entity-processor if I were able to get the list
>> of recently changed documents of our sites?
>
> Hmmmm, this sounds like a job for my manifestEnityProcessor
> see if you can find the thread titled:-
>
> "a new DIH manifestEnityProcessor"
>
> is your list of changed documents a list of additions and
> updates only, or does it contain deletes as well?
Fergus,
I think you should then rename it... Manifest is not the right name to
me (manifest refers to something such as the manifest of a jar or of
an IMS-content-package, both are a metadata of the data).
I looked at your original description and I could not read anything
about the changed files.
The regex approach is a nice one for sure...
I think a useful DIH Entity-processor that would maintain its deltas
well would have as parameters, url to a list of recently updated urls,
url to a list of recently deleted urls. Is this yours?
I would have one for URLs with the list of recent things basically
from an RSS; the transformer is custom in all cases.
paul
Re: DIH with a list of changed documents?
Posted by Fergus McMenemie <fe...@twig.me.uk>.
>Hello List,
>
>how would I implement entity-processor if I were able to get the list
>of recently changed documents of our sites?
>
>thanks for hints.
>
>paul
>
>Attachment converted: OSX:smime 65.p7s ( / ) (00213A09)
Hmmmm, this sounds like a job for my manifestEnityProcessor
see if you can find the thread titled:-
"a new DIH manifestEnityProcessor"
is your list of changed documents a list of additions and
updates only, or does it contain deletes as well?
Fergus.
--
===============================================================
Fergus McMenemie Email:fergus@twig.me.uk
Techmore Ltd Phone:(UK) 07721 376021
Unix/Mac/Intranets Analyst Programmer
===============================================================