You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mike Thomsen <mi...@gmail.com> on 2016/09/16 15:03:45 UTC

Best way to generate multivalue fields from streaming API

Read this article and thought it could be interesting as a way to do
ingestion:

https://dzone.com/articles/solr-streaming-expressions-for-collection-auto-upd-1

Example from the article:

daemon(id="12345",

 runInterval="60000",

 update(users,

 batchSize=10,

 jdbc(connection="jdbc:mysql://localhost/users?user=root&password=solr",
sql="SELECT id, name FROM users", sort="id asc",
driver="com.mysql.jdbc.Driver")

)

What's the best way to handle a multivalue field using this API? Is
there a way to tokenize something returned in a database field?

Thanks,

Mike

Re: Best way to generate multivalue fields from streaming API

Posted by Gus Heck <gu...@gmail.com>.
Hi Mike,

Bit late on this, but just saw it...

Using streaming to ingest has occurred to me too but I think it's not
really right for that except in fairly trivial cases. The very first big
problem you will have in the example you give is that you won't be able to
mark things as already ingested, so you have to read the whole thing every
time, one could eventually add enough features to it, but that's probably
going to feature bloat it, and change the focus from processing data
originating in solr to processing data from external sources. At that point
I think it's better for it to be a separate system, and to be set up in a
way that can be managed. Any non-trivial ingestion process using streaming
is going to be configured as a large deeply nested streaming expression,
which I fear would be very hard to read and maintain. I did a talk a while
back that went through a wishlist for document ingestion... slides here:
https://docs.google.com/presentation/d/17NhL-nfYa-d2Vx_DleXo_JC1SwiBMlfP5Zm4IEiZOYY/pub?start=false&loop=false&delayms=5000


I do presently have a case where I use streaming to create summary records
for some data once it's in solr.

-Gus

On Fri, Sep 16, 2016 at 11:52 AM, Joel Bernstein <jo...@gmail.com> wrote:

> Unfortunately there currently isn't a way to split a field. But this would
> be nice functionality to add.
>
> The approach would be to an add a split operation that would be used by the
> select() function. It would look like this:
>
> select(jdbc(...), split(fieldA, delim=","), ...)
>
> This would make a good jira issue.
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Sep 16, 2016 at 11:03 AM, Mike Thomsen <mi...@gmail.com>
> wrote:
>
> > Read this article and thought it could be interesting as a way to do
> > ingestion:
> >
> > https://dzone.com/articles/solr-streaming-expressions-
> > for-collection-auto-upd-1
> >
> > Example from the article:
> >
> > daemon(id="12345",
> >
> >  runInterval="60000",
> >
> >  update(users,
> >
> >  batchSize=10,
> >
> >  jdbc(connection="jdbc:mysql://localhost/users?user=root&password=solr",
> > sql="SELECT id, name FROM users", sort="id asc",
> > driver="com.mysql.jdbc.Driver")
> >
> > )
> >
> > What's the best way to handle a multivalue field using this API? Is
> > there a way to tokenize something returned in a database field?
> >
> > Thanks,
> >
> > Mike
> >
>



-- 
http://www.the111shift.com

Re: Best way to generate multivalue fields from streaming API

Posted by Joel Bernstein <jo...@gmail.com>.
Unfortunately there currently isn't a way to split a field. But this would
be nice functionality to add.

The approach would be to an add a split operation that would be used by the
select() function. It would look like this:

select(jdbc(...), split(fieldA, delim=","), ...)

This would make a good jira issue.






Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Sep 16, 2016 at 11:03 AM, Mike Thomsen <mi...@gmail.com>
wrote:

> Read this article and thought it could be interesting as a way to do
> ingestion:
>
> https://dzone.com/articles/solr-streaming-expressions-
> for-collection-auto-upd-1
>
> Example from the article:
>
> daemon(id="12345",
>
>  runInterval="60000",
>
>  update(users,
>
>  batchSize=10,
>
>  jdbc(connection="jdbc:mysql://localhost/users?user=root&password=solr",
> sql="SELECT id, name FROM users", sort="id asc",
> driver="com.mysql.jdbc.Driver")
>
> )
>
> What's the best way to handle a multivalue field using this API? Is
> there a way to tokenize something returned in a database field?
>
> Thanks,
>
> Mike
>