You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Joel Bernstein (JIRA)" <ji...@apache.org> on 2016/07/18 02:03:21 UTC
[jira] [Comment Edited] (SOLR-9240) Support parallel ETL with the topic expression

    [ https://issues.apache.org/jira/browse/SOLR-9240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371958#comment-15371958 ] 

Joel Bernstein edited comment on SOLR-9240 at 7/18/16 2:03 AM:
---------------------------------------------------------------

This ticket is looking fairly good. I did a round of manual testing with the expression below which worked as expected.

{code}
parallel(
         workerCollection, 
         workers="2", 
         sort="DaemonOp desc", 
         daemon(
                  update(
                        updateCollection, 
                        batchSize=200, 
                        topic(
                            checkpointCollection,
                            topicCollection, 
                            q=*:*, 
                             id="topic40",
                             fl="id, to , from", 
                             partitionKeys="id",
                             initialCheckpoint="0")), 
               runInterval="1000", 
               id="test3"))
{code}

This expression sends a daemon expression to two worker nodes. The daemon is wrapping an update expression which is wrapping a topic() expression. The topic has the new  *initialCheckpoint* parameter so it starts pulling records from checkpoint 0, which includes every record that matches the topic query in the index. The topic also has the *partitionKeys* parameter so each worker pulls a partition of records that match the topic query.

The daemon function will run the update() function iteratively. Each run will update the topic checkpoints for each worker.

The effect of this is that each worker will iterate though it's partition of the topic query, reindexing all the records that match the topic in another collection.



was (Author: joel.bernstein):
This ticket is looking fairly good. I did a round of manual testing with the expression below which worked as expected.

{code}
parallel(
         workerCollection, 
         workers="2", 
         sort="_version_ desc", 
         daemon(
                  update(
                        updateCollection, 
                        batchSize=200, 
                        topic(
                            checkpointCollection,
                            topicCollection, 
                            q=*:*, 
                             id="topic40",
                             fl="id, to , from", 
                             partitionKeys="id",
                             initialCheckpoint="0")), 
               runInterval="1000", 
               id="test3"))
{code}

This expression sends a daemon expression to two worker nodes. The daemon is wrapping an update expression which is wrapping a topic() expression. The topic has the new  *initialCheckpoint* parameter so it starts pulling records from checkpoint 0, which includes every record that matches the topic query in the index. The topic also has the *partitionKeys* parameter so each worker pulls a partition of records that match the topic query.

The daemon function will run the update() function iteratively. Each run will update the topic checkpoints for each worker.

The effect of this is that each worker will iterate though it's partition of the topic query, reindexing all the records that match the topic in another collection.


> Support parallel ETL with the topic expression
> ----------------------------------------------
>
>                 Key: SOLR-9240
>                 URL: https://issues.apache.org/jira/browse/SOLR-9240
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Joel Bernstein
>            Assignee: Joel Bernstein
>             Fix For: 6.2
>
>         Attachments: SOLR-9240.patch, SOLR-9240.patch
>
>
> It would be useful for SolrCloud to support large scale *Extract, Transform and Load* work loads with streaming expressions. Instead of using MapReduce for ETL, the topic expression can be used which allows SolrCloud to be treated like a distributed message queue filled with data to be processed. The topic expression works in batches and supports retrieval of stored fields, so large scale *text ETL* will work perfectly with this approach.
> This ticket makes two small changes to the topic() expression that makes this possible:
> 1) Changes the topic expression so it can operate in parallel.
> 2) Adds the initialCheckpoint parameter to the topic expression so a topic can start pulling records from anywhere in the queue.
> Daemons can be sent to worker nodes that each work on processing a partition of the data from the same topic. The daemon() function's natural behavior is perfect for iteratively calling a topic until all records in the topic have been processed.
> The sample code below pulls all records from one collection and indexes them into another collection. A Transform function could be wrapped around the topic() to transform the records before loading. Custom functions can also be built to load the data in parallel to any outside system. 
> {code}
> parallel(
>          workerCollection, 
>          workers="2", 
>          sort="DaemonOp desc", 
>          daemon(
>                   update(
>                         updateCollection, 
>                         batchSize=200, 
>                         topic(
>                             checkpointCollection,
>                             topicCollection, 
>                             q=*:*, 
>                              id="topic1",
>                              fl="id, to , from, body", 
>                              partitionKeys="id",
>                              initialCheckpoint="0")), 
>                runInterval="1000", 
>                id="daemon1"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org