You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Paul Libbrecht <pa...@activemath.org> on 2009/03/09 21:57:14 UTC

DIH with a list of changed documents?

Hello List,

how would I implement entity-processor if I were able to get the list  
of recently changed documents of our sites?

thanks for hints.

paul

Re: DIH with a list of changed documents?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Re file vs. URL - can't both be hidden behind an URL object (file:// vs. http:// schema)?


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Fergus McMenemie <fe...@twig.me.uk>
> To: solr-user@lucene.apache.org
> Sent: Monday, March 9, 2009 7:00:43 PM
> Subject: Re: DIH with a list of changed documents?
> 
> >Le 09-mars-09 à 22:29, Fergus McMenemie a écrit :
> >>> how would I implement entity-processor if I were able to get the list
> >>> of recently changed documents of our sites?
> >>
> >> Hmmmm, this sounds like a job for my manifestEnityProcessor
> >> see if you can find the thread titled:-
> >>
> >>   "a new DIH manifestEnityProcessor"
> >>
> >> is your list of changed documents a list of additions and
> >> updates only, or does it contain deletes as well?
> >
> >Fergus,
> >
> >I think you should then rename it... Manifest is not the right name to  
> >me (manifest refers to something such as the manifest of a jar or of  
> >an IMS-content-package, both are a metadata of the data).
> 
> Its all in the jargon, I guess. Our content repositories are changed
> by update kits, some of the kits come with manifests or in other cases
> we capture the output from un-tar or un-zip commands and we call these
> manifests. The name is up for grabs if a better suggestion comes along;
> I would have used FileListEntityProcessor except the name was taken;-)
> 
> 
> >I looked at your original description and I could not read anything  
> >about the changed files.
> >The regex approach is a nice one for sure...
> 
> Yep, our "manifest"s quite often include jpegs, avis etc which we
> do not want indexed. And if it's a tar output it will contain
> directory stubs as well.
> 
> >I think a useful DIH Entity-processor that would maintain its deltas  
> >well would have as parameters, url to a list of recently updated urls,  
> >url to a list of recently deleted urls. Is this yours?
> 
> urls hu! Never thought of that, i was just assuming it would be a local
> file. However I guess that could be added... so "manifestFileName" would
> become "manifestURL"? In my use cases some of the "manifests" are along
> the  lines of 
> 
>    ADD xxxx-checksum-xxx  --pathname_1--
>    DEL --pathname_b--
> 
> Hence "manifestAddRegex" and "manifestDelRegex". I also, in other 
> cases, have separate files, one for adding another for deleting.
> This I was going to deal with as two separate DIH imports.
> 
> >I would have one for URLs with the list of recent things basically  
> >from an RSS; the transformer is custom in all cases.
> 
> The output from my manifestEnityProcessor is fed to an
> XPathEntityProcessor
> 
> >
> >paul
> >
> Fergus.
> -- 
> 
> ===============================================================
> Fergus McMenemie               Email:fergus@twig.me.uk
> Techmore Ltd                   Phone:(UK) 07721 376021
> 
> Unix/Mac/Intranets             Analyst Programmer
> ===============================================================


Re: DIH with a list of changed documents?

Posted by Fergus McMenemie <fe...@twig.me.uk>.
>Le 09-mars-09 à 22:29, Fergus McMenemie a écrit :
>>> how would I implement entity-processor if I were able to get the list
>>> of recently changed documents of our sites?
>>
>> Hmmmm, this sounds like a job for my manifestEnityProcessor
>> see if you can find the thread titled:-
>>
>>   "a new DIH manifestEnityProcessor"
>>
>> is your list of changed documents a list of additions and
>> updates only, or does it contain deletes as well?
>
>Fergus,
>
>I think you should then rename it... Manifest is not the right name to  
>me (manifest refers to something such as the manifest of a jar or of  
>an IMS-content-package, both are a metadata of the data).

Its all in the jargon, I guess. Our content repositories are changed
by update kits, some of the kits come with manifests or in other cases
we capture the output from un-tar or un-zip commands and we call these
manifests. The name is up for grabs if a better suggestion comes along;
I would have used FileListEntityProcessor except the name was taken;-)


>I looked at your original description and I could not read anything  
>about the changed files.
>The regex approach is a nice one for sure...

Yep, our "manifest"s quite often include jpegs, avis etc which we
do not want indexed. And if it's a tar output it will contain
directory stubs as well.

>I think a useful DIH Entity-processor that would maintain its deltas  
>well would have as parameters, url to a list of recently updated urls,  
>url to a list of recently deleted urls. Is this yours?

urls hu! Never thought of that, i was just assuming it would be a local
file. However I guess that could be added... so "manifestFileName" would
become "manifestURL"? In my use cases some of the "manifests" are along
the  lines of 

   ADD xxxx-checksum-xxx  --pathname_1--
   DEL --pathname_b--

Hence "manifestAddRegex" and "manifestDelRegex". I also, in other 
cases, have separate files, one for adding another for deleting.
This I was going to deal with as two separate DIH imports.

>I would have one for URLs with the list of recent things basically  
>from an RSS; the transformer is custom in all cases.

The output from my manifestEnityProcessor is fed to an
XPathEntityProcessor

>
>paul
>
Fergus.
-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Re: DIH with a list of changed documents?

Posted by Paul Libbrecht <pa...@activemath.org>.
Le 09-mars-09 à 22:29, Fergus McMenemie a écrit :
>> how would I implement entity-processor if I were able to get the list
>> of recently changed documents of our sites?
>
> Hmmmm, this sounds like a job for my manifestEnityProcessor
> see if you can find the thread titled:-
>
>   "a new DIH manifestEnityProcessor"
>
> is your list of changed documents a list of additions and
> updates only, or does it contain deletes as well?

Fergus,

I think you should then rename it... Manifest is not the right name to  
me (manifest refers to something such as the manifest of a jar or of  
an IMS-content-package, both are a metadata of the data).

I looked at your original description and I could not read anything  
about the changed files.
The regex approach is a nice one for sure...
I think a useful DIH Entity-processor that would maintain its deltas  
well would have as parameters, url to a list of recently updated urls,  
url to a list of recently deleted urls. Is this yours?

I would have one for URLs with the list of recent things basically  
from an RSS; the transformer is custom in all cases.

paul

Re: DIH with a list of changed documents?

Posted by Fergus McMenemie <fe...@twig.me.uk>.
>Hello List,
>
>how would I implement entity-processor if I were able to get the list  
>of recently changed documents of our sites?
>
>thanks for hints.
>
>paul
>
>Attachment converted: OSX:smime 65.p7s (    /    ) (00213A09)



Hmmmm, this sounds like a job for my manifestEnityProcessor
see if you can find the thread titled:-
 
   "a new DIH manifestEnityProcessor"

is your list of changed documents a list of additions and
updates only, or does it contain deletes as well?

Fergus.

-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================