You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Paul Libbrecht <pa...@activemath.org> on 2009/01/23 00:09:53 UTC

URL-import field type?

Hello list,

after searching around for quite a while, including in the  
DataImportHandler documentation on the wiki (which looks amazing), I  
couldn't find a way to indicate to solr that the tokens of that field  
should be the result of analyzing the tokens of the stream at URL-xxx.

I know I was able to imitate that in plain-lucene by crafting a  
particular analyzer-filter who was only given the URL as content and  
who gave further the tokens of the stream.

Is this the right way in solr?

thanks in advance.

paul

Re: URL-import field type?

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
On Fri, Jan 23, 2009 at 2:55 PM, Paul Libbrecht <pa...@activemath.org> wrote:
>
> Le 23-janv.-09 à 10:10, Noble Paul നോബിള്‍ नोब्ळ् a écrit :
>>
>> if the response is not XML ,then  there is no EntityProcessor that can
>> consume this. We may need to add one.
>
> well, even binary data such as word documents (base64-encoded for example)
> run the risk of appearing here. They sure need a pile of filters!
>
>>> What bothers me with the HttpDataSource example is that, for now, at
>>> least,
>>> it is configured to pull a single URL while what is needed (and would
>>> provide delta ability) is really to index a list of URLs (for which one
>>> would pull regularly the list of recently update URLs or simply use
>>> GET-if-modified-since on all of them).
>>
>> The if-modified since is not supported by HttpdataSource. However you
>> can write a transformer which pings the URL w/ a if-modified-since
>> header an skip the document using the $skipDoc option
>
> I still don't understand how you give several documents to the
> HttpDataSource.
> The configuration seems only to allow a single URL.
> Am I missing something?
The DataSource is like a helper class. The only intelligent piece here
is an EntityProcessor.
>
> paul
>
> PS: would it be worth chatting about that on irc.freenode.net#solr ?



-- 
--Noble Paul

Re: URL-import field type?

Posted by Paul Libbrecht <pa...@activemath.org>.
Le 23-janv.-09 à 10:10, Noble Paul നോബിള്‍  
नोब्ळ् a écrit :
> if the response is not XML ,then  there is no EntityProcessor that can
> consume this. We may need to add one.

well, even binary data such as word documents (base64-encoded for  
example) run the risk of appearing here. They sure need a pile of  
filters!

>> What bothers me with the HttpDataSource example is that, for now,  
>> at least,
>> it is configured to pull a single URL while what is needed (and would
>> provide delta ability) is really to index a list of URLs (for which  
>> one
>> would pull regularly the list of recently update URLs or simply use
>> GET-if-modified-since on all of them).
> The if-modified since is not supported by HttpdataSource. However you
> can write a transformer which pings the URL w/ a if-modified-since
> header an skip the document using the $skipDoc option

I still don't understand how you give several documents to the  
HttpDataSource.
The configuration seems only to allow a single URL.
Am I missing something?

paul

PS: would it be worth chatting about that on irc.freenode.net#solr ?

Re: URL-import field type?

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
On Fri, Jan 23, 2009 at 2:28 PM, Paul Libbrecht <pa...@activemath.org> wrote:
> Well,
>
> the idea is that the solr engine indexes the contents of a web platform.
>
> Each document is a user-side-URL out of which several fields would be
> fetched through various URL-get-documents (e.g. the full-text-view, e.g. the
> future openmath representation, e.g. the topics (URIs in an ontology), ...).

if the response of these are URLs are well formed xpaths they can be
channeled to an XPathEntityProcessor (one per field) and they can be
processed

if the response is not XML ,then  there is no EntityProcessor that can
consume this. We may need to add one.
>
> Would the alternate (and maybe equivalent) way to stream all documents into
> one XML document and let the XPath triage act through all fields? That would
> also work would take advantage of the XPathEntityProcessor's nice
> configuration.

>
> What bothers me with the HttpDataSource example is that, for now, at least,
> it is configured to pull a single URL while what is needed (and would
> provide delta ability) is really to index a list of URLs (for which one
> would pull regularly the list of recently update URLs or simply use
> GET-if-modified-since on all of them).
The if-modified since is not supported by HttpdataSource. However you
can write a transformer which pings the URL w/ a if-modified-since
header an skip the document using the $skipDoc option
>
> I didn't think that modifying the XPathEntityProcessor was the right thing
> since it seems based on a single stream.
>
> Hints for altnernative eagerly welcome.
>
> paul
>
>
> Le 23-janv.-09 à 05:45, Noble Paul നോബിള്‍ नोब्ळ् a écrit :
>
>> where is this url coming from? what is the content type of the stream?
>> is it plain text or html?
>>
>> if yes, this is a possible enhancement to DIH
>>
>>
>>
>> On Fri, Jan 23, 2009 at 4:39 AM, Paul Libbrecht <pa...@activemath.org>
>> wrote:
>>>
>>> Hello list,
>>>
>>> after searching around for quite a while, including in the
>>> DataImportHandler
>>> documentation on the wiki (which looks amazing), I couldn't find a way to
>>> indicate to solr that the tokens of that field should be the result of
>>> analyzing the tokens of the stream at URL-xxx.
>>>
>>> I know I was able to imitate that in plain-lucene by crafting a
>>> particular
>>> analyzer-filter who was only given the URL as content and who gave
>>> further
>>> the tokens of the stream.
>>>
>>> Is this the right way in solr?
>>>
>>> thanks in advance.
>>>
>>> paul
>>
>>
>>
>> --
>> --Noble Paul
>
>



-- 
--Noble Paul

Re: URL-import field type?

Posted by Paul Libbrecht <pa...@activemath.org>.
Well,

the idea is that the solr engine indexes the contents of a web platform.

Each document is a user-side-URL out of which several fields would be  
fetched through various URL-get-documents (e.g. the full-text-view,  
e.g. the future openmath representation, e.g. the topics (URIs in an  
ontology), ...).

Would the alternate (and maybe equivalent) way to stream all documents  
into one XML document and let the XPath triage act through all fields?  
That would also work would take advantage of the  
XPathEntityProcessor's nice configuration.

What bothers me with the HttpDataSource example is that, for now, at  
least, it is configured to pull a single URL while what is needed (and  
would provide delta ability) is really to index a list of URLs (for  
which one would pull regularly the list of recently update URLs or  
simply use GET-if-modified-since on all of them).

I didn't think that modifying the XPathEntityProcessor was the right  
thing since it seems based on a single stream.

Hints for altnernative eagerly welcome.

paul


Le 23-janv.-09 à 05:45, Noble Paul നോബിള്‍  
नोब्ळ् a écrit :

> where is this url coming from? what is the content type of the stream?
> is it plain text or html?
>
> if yes, this is a possible enhancement to DIH
>
>
>
> On Fri, Jan 23, 2009 at 4:39 AM, Paul Libbrecht  
> <pa...@activemath.org> wrote:
>>
>> Hello list,
>>
>> after searching around for quite a while, including in the  
>> DataImportHandler
>> documentation on the wiki (which looks amazing), I couldn't find a  
>> way to
>> indicate to solr that the tokens of that field should be the result  
>> of
>> analyzing the tokens of the stream at URL-xxx.
>>
>> I know I was able to imitate that in plain-lucene by crafting a  
>> particular
>> analyzer-filter who was only given the URL as content and who gave  
>> further
>> the tokens of the stream.
>>
>> Is this the right way in solr?
>>
>> thanks in advance.
>>
>> paul
>
>
>
> -- 
> --Noble Paul


Re: URL-import field type?

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
where is this url coming from? what is the content type of the stream?
is it plain text or html?

if yes, this is a possible enhancement to DIH



On Fri, Jan 23, 2009 at 4:39 AM, Paul Libbrecht <pa...@activemath.org> wrote:
>
> Hello list,
>
> after searching around for quite a while, including in the DataImportHandler
> documentation on the wiki (which looks amazing), I couldn't find a way to
> indicate to solr that the tokens of that field should be the result of
> analyzing the tokens of the stream at URL-xxx.
>
> I know I was able to imitate that in plain-lucene by crafting a particular
> analyzer-filter who was only given the URL as content and who gave further
> the tokens of the stream.
>
> Is this the right way in solr?
>
> thanks in advance.
>
> paul



-- 
--Noble Paul

Re: URL-import field type?

Posted by Chris Hostetter <ho...@fucit.org>.
: But we do not have an inbuilt TokenFilter which does that. Nor does
: DIH support it now . I have opened an issue for DIH
: (https://issues.apache.org/jira/browse/SOLR-980)
: Is it desirable to have  TokenFilter which offers similar functionality?

Probably not (you would have to have a way of configuring what kind of 
analysis would be done on the file)

My point was specificly about the original posters use case: he said he 
already had a TokenFilter that parsed the URL target the way he wanted -- 
in which case it's easy for him to to keep using that TokenFilter by 
writing a factory for it.



-Hoss


Re: URL-import field type?

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
On Tue, Jan 27, 2009 at 4:47 AM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : I know I was able to imitate that in plain-lucene by crafting a particular
> : analyzer-filter who was only given the URL as content and who gave further the
> : tokens of the stream.
>
> FWIW: while taking advantage of DIH and some of it's plugin APIs to deal
> with this is probaly a better way to -- anything you could do in a
> TokenFilter with a homegrown Lucene app can also be done in a TokenFilter
> in Solr -- all you need is a simple TokenFilterFactory to initialize your
> TokenFilter.
>
> From a purist standpoint: the decision about where to hook in a feature
> like this depends on the mental model you have of your index vs the
> differnet ways you can get data into your index.  if every document should
> have an "extendedText" field, and docs you post via xml or csv will have
> thta field verbatim, but documents you index using DIH will get it by
> fetching a URL, then a DIH plugin is the way to go -- if you want every
> client sending you docs to provide a URL and you *always* fetch that URL
> to get the content, then a TokenFilter is hte way to go.

Hoss , makes sense.

But we do not have an inbuilt TokenFilter which does that. Nor does
DIH support it now . I have opened an issue for DIH
(https://issues.apache.org/jira/browse/SOLR-980)
Is it desirable to have  TokenFilter which offers similar functionality?


>
>
>
>
> -Hoss
>
>



-- 
--Noble Paul

Re: URL-import field type?

Posted by Chris Hostetter <ho...@fucit.org>.
: I know I was able to imitate that in plain-lucene by crafting a particular
: analyzer-filter who was only given the URL as content and who gave further the
: tokens of the stream.

FWIW: while taking advantage of DIH and some of it's plugin APIs to deal 
with this is probaly a better way to -- anything you could do in a 
TokenFilter with a homegrown Lucene app can also be done in a TokenFilter 
in Solr -- all you need is a simple TokenFilterFactory to initialize your 
TokenFilter.

>From a purist standpoint: the decision about where to hook in a feature 
like this depends on the mental model you have of your index vs the 
differnet ways you can get data into your index.  if every document should 
have an "extendedText" field, and docs you post via xml or csv will have 
thta field verbatim, but documents you index using DIH will get it by 
fetching a URL, then a DIH plugin is the way to go -- if you want every 
client sending you docs to provide a URL and you *always* fetch that URL 
to get the content, then a TokenFilter is hte way to go.




-Hoss