You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Raymond Xie <xi...@gmail.com> on 2018/05/20 11:23:28 UTC

How to do parallel indexing on files (not on HDFS)

I know how to do indexing on file system like single file or folder, but
how do I do that in a parallel way? The data I need to index is of huge
volume and can't be put on HDFS.

Thank you

*------------------------------------------------*
*Sincerely yours,*


*Raymond*

Re: How to do parallel indexing on files (not on HDFS)

Posted by Rahul Singh <ra...@anant.us>.

Right,
That’s why you need a place to persist the task list / graph. If you use a table, you can set “processed” / “unprocessed” value … or a queue, then its delivered only once .. otherwise you have to check indexed date from solr, and waste a solr call.

--
Rahul Singh
rahul.singh@anant.us

Anant Corporation

On May 24, 2018, 12:54 PM -0500, Adhyan Arizki <a....@gmail.com>, wrote:
> You will still need to devise a way to partition the data source even if you are scheduling multiple jobs otherwise, you might end up digesting the same data again and again.
>
> > On Fri, May 25, 2018 at 12:46 AM, Raymond Xie <xi...@gmail.com> wrote:
> > > Thank you all for the suggestions. I'm now tending to not using a
> > > traditional parallel indexing my data are json files with meta data
> > > extracted from raw data received and archived into our data server cluster.
> > > Those data come in various flows and reside in their respective folders,
> > > splitting them might introduce unnecessary extra work and could end up with
> > > trouble. So instead of that, maybe it would be easier to simply schedule
> > > multiple indexing jobs separately.?
> > >
> > > Thanks.
> > >
> > > Raymond
> > >
> > >
> > > Rahul Singh <ra...@gmail.com> 于 2018年5月24日周四 上午11:23写道：
> > >
> > > > Resending to list to help more people..
> > > >
> > > > This is an architectural pattern to solve the same issue that arises over
> > > > and over again.. The queue can be anything — a table in a database, even a
> > > > collection solr.
> > > >
> > > > And yes I have implemented it —  I did it in C# before using a SQL Server
> > > > table based queue -- (http://github.com/appleseed/search-stack) — and
> > > > then made the indexer be able to write to lucene, elastic or solr depending
> > > > config. Im not actively maintaining this right now ,but will consider
> > > > porting it to Kafka + Spark + Kafka Connect based system when I find time.
> > > >
> > > > In Kafka however, you have a lot of potential with Kafka Connect . Here is
> > > > an example using Cassandra..
> > > > But the premise is the same Kafka Connect has libraries of connectors for
> > > > different source / sinks … may not work for files but for pure raw data,
> > > > Kafka Connect is good.
> > > >
> > > > Here’s a project that may guide you best.
> > > >
> > > >
> > > > http://saumitra.me/blog/tweet-search-and-analysis-with-kafka-solr-cassandra/
> > > >
> > > > I dont know where this guys code went.. but the content is there with code
> > > > samples.
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > On May 23, 2018, 8:37 PM -0500, Raymond Xie <xi...@gmail.com>, wrote:
> > > >
> > > > Thank you Rahul despite that's very high level.
> > > >
> > > > With no offense, do you have a successful implementation or it is just
> > > > your unproven idea? I never used Rabbit nor Kafka before but would be very
> > > > interested in knowing more detail on the Kafka idea as Kafka is available
> > > > in my environment.
> > > >
> > > > Thank you again and look forward to hearing more from you or anyone in
> > > > this Solr community.
> > > >
> > > >
> > > > *------------------------------------------------*
> > > > *Sincerely yours,*
> > > >
> > > >
> > > > *Raymond*
> > > >
> > > > On Wed, May 23, 2018 at 8:15 AM, Rahul Singh <rahul.xavier.singh@gmail.com
> > > > > wrote:
> > > >
> > > >> Enumerate the file locations (map) , put them in a queue like rabbit or
> > > >> Kafka (Persist the map), have a bunch of threads , workers, containers,
> > > >> whatever pop off the queue , process the item (reduce).
> > > >>
> > > >>
> > > >> --
> > > >> Rahul Singh
> > > >> rahul.singh@anant.us
> > > >>
> > > >> Anant Corporation
> > > >>
> > > >> On May 20, 2018, 7:24 AM -0400, Raymond Xie <xi...@gmail.com>,
> > > >> wrote:
> > > >>
> > > >> I know how to do indexing on file system like single file or folder, but
> > > >> how do I do that in a parallel way? The data I need to index is of huge
> > > >> volume and can't be put on HDFS.
> > > >>
> > > >> Thank you
> > > >>
> > > >> *------------------------------------------------*
> > > >> *Sincerely yours,*
> > > >>
> > > >>
> > > >> *Raymond*
> > > >>
> > > >>
> > > >
>
>
>
> --
>
> Best regards,
> Adhyan Arizki

Re: How to do parallel indexing on files (not on HDFS)

Posted by Adhyan Arizki <a....@gmail.com>.

You will still need to devise a way to partition the data source even if
you are scheduling multiple jobs otherwise, you might end up digesting the
same data again and again.

On Fri, May 25, 2018 at 12:46 AM, Raymond Xie <xi...@gmail.com> wrote:

> Thank you all for the suggestions. I'm now tending to not using a
> traditional parallel indexing my data are json files with meta data
> extracted from raw data received and archived into our data server cluster.
> Those data come in various flows and reside in their respective folders,
> splitting them might introduce unnecessary extra work and could end up with
> trouble. So instead of that, maybe it would be easier to simply schedule
> multiple indexing jobs separately.?
>
> Thanks.
>
> Raymond
>
>
> Rahul Singh <ra...@gmail.com> 于 2018年5月24日周四 上午11:23写道：
>
> > Resending to list to help more people..
> >
> > This is an architectural pattern to solve the same issue that arises over
> > and over again.. The queue can be anything — a table in a database, even
> a
> > collection solr.
> >
> > And yes I have implemented it —  I did it in C# before using a SQL Server
> > table based queue -- (http://github.com/appleseed/search-stack) — and
> > then made the indexer be able to write to lucene, elastic or solr
> depending
> > config. Im not actively maintaining this right now ,but will consider
> > porting it to Kafka + Spark + Kafka Connect based system when I find
> time.
> >
> > In Kafka however, you have a lot of potential with Kafka Connect . Here
> is
> > an example using Cassandra..
> > But the premise is the same Kafka Connect has libraries of connectors for
> > different source / sinks … may not work for files but for pure raw data,
> > Kafka Connect is good.
> >
> > Here’s a project that may guide you best.
> >
> >
> > http://saumitra.me/blog/tweet-search-and-analysis-with-
> kafka-solr-cassandra/
> >
> > I dont know where this guys code went.. but the content is there with
> code
> > samples.
> >
> >
> >
> >
> > --
> >
> > On May 23, 2018, 8:37 PM -0500, Raymond Xie <xi...@gmail.com>,
> wrote:
> >
> > Thank you Rahul despite that's very high level.
> >
> > With no offense, do you have a successful implementation or it is just
> > your unproven idea? I never used Rabbit nor Kafka before but would be
> very
> > interested in knowing more detail on the Kafka idea as Kafka is available
> > in my environment.
> >
> > Thank you again and look forward to hearing more from you or anyone in
> > this Solr community.
> >
> >
> > *------------------------------------------------*
> > *Sincerely yours,*
> >
> >
> > *Raymond*
> >
> > On Wed, May 23, 2018 at 8:15 AM, Rahul Singh <
> rahul.xavier.singh@gmail.com
> > > wrote:
> >
> >> Enumerate the file locations (map) , put them in a queue like rabbit or
> >> Kafka (Persist the map), have a bunch of threads , workers, containers,
> >> whatever pop off the queue , process the item (reduce).
> >>
> >>
> >> --
> >> Rahul Singh
> >> rahul.singh@anant.us
> >>
> >> Anant Corporation
> >>
> >> On May 20, 2018, 7:24 AM -0400, Raymond Xie <xi...@gmail.com>,
> >> wrote:
> >>
> >> I know how to do indexing on file system like single file or folder, but
> >> how do I do that in a parallel way? The data I need to index is of huge
> >> volume and can't be put on HDFS.
> >>
> >> Thank you
> >>
> >> *------------------------------------------------*
> >> *Sincerely yours,*
> >>
> >>
> >> *Raymond*
> >>
> >>
> >
>



-- 

Best regards,
Adhyan Arizki

Re: How to do parallel indexing on files (not on HDFS)

Posted by Raymond Xie <xi...@gmail.com>.

Thank you all for the suggestions. I'm now tending to not using a
traditional parallel indexing my data are json files with meta data
extracted from raw data received and archived into our data server cluster.
Those data come in various flows and reside in their respective folders,
splitting them might introduce unnecessary extra work and could end up with
trouble. So instead of that, maybe it would be easier to simply schedule
multiple indexing jobs separately.?

Thanks.

Raymond


Rahul Singh <ra...@gmail.com> 于 2018年5月24日周四 上午11:23写道：

> Resending to list to help more people..
>
> This is an architectural pattern to solve the same issue that arises over
> and over again.. The queue can be anything — a table in a database, even a
> collection solr.
>
> And yes I have implemented it —  I did it in C# before using a SQL Server
> table based queue -- (http://github.com/appleseed/search-stack) — and
> then made the indexer be able to write to lucene, elastic or solr depending
> config. Im not actively maintaining this right now ,but will consider
> porting it to Kafka + Spark + Kafka Connect based system when I find time.
>
> In Kafka however, you have a lot of potential with Kafka Connect . Here is
> an example using Cassandra..
> But the premise is the same Kafka Connect has libraries of connectors for
> different source / sinks … may not work for files but for pure raw data,
> Kafka Connect is good.
>
> Here’s a project that may guide you best.
>
>
> http://saumitra.me/blog/tweet-search-and-analysis-with-kafka-solr-cassandra/
>
> I dont know where this guys code went.. but the content is there with code
> samples.
>
>
>
>
> --
>
> On May 23, 2018, 8:37 PM -0500, Raymond Xie <xi...@gmail.com>, wrote:
>
> Thank you Rahul despite that's very high level.
>
> With no offense, do you have a successful implementation or it is just
> your unproven idea? I never used Rabbit nor Kafka before but would be very
> interested in knowing more detail on the Kafka idea as Kafka is available
> in my environment.
>
> Thank you again and look forward to hearing more from you or anyone in
> this Solr community.
>
>
> *------------------------------------------------*
> *Sincerely yours,*
>
>
> *Raymond*
>
> On Wed, May 23, 2018 at 8:15 AM, Rahul Singh <rahul.xavier.singh@gmail.com
> > wrote:
>
>> Enumerate the file locations (map) , put them in a queue like rabbit or
>> Kafka (Persist the map), have a bunch of threads , workers, containers,
>> whatever pop off the queue , process the item (reduce).
>>
>>
>> --
>> Rahul Singh
>> rahul.singh@anant.us
>>
>> Anant Corporation
>>
>> On May 20, 2018, 7:24 AM -0400, Raymond Xie <xi...@gmail.com>,
>> wrote:
>>
>> I know how to do indexing on file system like single file or folder, but
>> how do I do that in a parallel way? The data I need to index is of huge
>> volume and can't be put on HDFS.
>>
>> Thank you
>>
>> *------------------------------------------------*
>> *Sincerely yours,*
>>
>>
>> *Raymond*
>>
>>
>

Re: How to do parallel indexing on files (not on HDFS)

Posted by Rahul Singh <ra...@gmail.com>.

Resending to list to help more people..

This is an architectural pattern to solve the same issue that arises over and over again.. The queue can be anything — a table in a database, even a collection solr.

And yes I have implemented it —  I did it in C# before using a SQL Server table based queue -- (http://github.com/appleseed/search-stack) — and then made the indexer be able to write to lucene, elastic or solr depending config. Im not actively maintaining this right now ,but will consider porting it to Kafka + Spark + Kafka Connect based system when I find time.

In Kafka however, you have a lot of potential with Kafka Connect . Here is an example using Cassandra..
But the premise is the same Kafka Connect has libraries of connectors for different source / sinks … may not work for files but for pure raw data, Kafka Connect is good.

Here’s a project that may guide you best.

http://saumitra.me/blog/tweet-search-and-analysis-with-kafka-solr-cassandra/

I dont know where this guys code went.. but the content is there with code samples.

--

On May 23, 2018, 8:37 PM -0500, Raymond Xie <xi...@gmail.com>, wrote:
> Thank you Rahul despite that's very high level.
>
> With no offense, do you have a successful implementation or it is just your unproven idea? I never used Rabbit nor Kafka before but would be very interested in knowing more detail on the Kafka idea as Kafka is available in my environment.
>
> Thank you again and look forward to hearing more from you or anyone in this Solr community.
>
>
> ------------------------------------------------
> Sincerely yours,
>
>
> Raymond
>
> > On Wed, May 23, 2018 at 8:15 AM, Rahul Singh <ra...@gmail.com> wrote:
> > > Enumerate the file locations (map) , put them in a queue like rabbit or Kafka (Persist the map), have a bunch of threads , workers, containers, whatever pop off the queue , process the item (reduce).
> > >
> > >
> > > --
> > > Rahul Singh
> > > rahul.singh@anant.us
> > >
> > > Anant Corporation
> > >
> > > On May 20, 2018, 7:24 AM -0400, Raymond Xie <xi...@gmail.com>, wrote:
> > > > I know how to do indexing on file system like single file or folder, but
> > > > how do I do that in a parallel way? The data I need to index is of huge
> > > > volume and can't be put on HDFS.
> > > >
> > > > Thank you
> > > >
> > > > *------------------------------------------------*
> > > > *Sincerely yours,*
> > > >
> > > >
> > > > *Raymond*
>

Re: How to do parallel indexing on files (not on HDFS)

Posted by Adhyan Arizki <a....@gmail.com>.

Raymond,

Running parallel index might be trickier than it looks if the scale is big.
For instance, you can easily partition your data (let's say into 5 chunks)
and run 5 processes to index them. However, you will need to be aware if
there will be choke in the pipeline along the way (e.g. I/O of database, or
even commits at the core). If you think your infrastructure can handle the
load, you can try what i aforementioned.

On Thu, May 24, 2018 at 9:36 AM, Raymond Xie <xi...@gmail.com> wrote:

> Thank you Rahul despite that's very high level.
>
> With no offense, do you have a successful implementation or it is just your
> unproven idea? I never used Rabbit nor Kafka before but would be very
> interested in knowing more detail on the Kafka idea as Kafka is available
> in my environment.
>
> Thank you again and look forward to hearing more from you or anyone in this
> Solr community.
>
>
> *------------------------------------------------*
> *Sincerely yours,*
>
>
> *Raymond*
>
> On Wed, May 23, 2018 at 8:15 AM, Rahul Singh <rahul.xavier.singh@gmail.com
> >
> wrote:
>
> > Enumerate the file locations (map) , put them in a queue like rabbit or
> > Kafka (Persist the map), have a bunch of threads , workers, containers,
> > whatever pop off the queue , process the item (reduce).
> >
> >
> > --
> > Rahul Singh
> > rahul.singh@anant.us
> >
> > Anant Corporation
> >
> > On May 20, 2018, 7:24 AM -0400, Raymond Xie <xi...@gmail.com>,
> wrote:
> >
> > I know how to do indexing on file system like single file or folder, but
> > how do I do that in a parallel way? The data I need to index is of huge
> > volume and can't be put on HDFS.
> >
> > Thank you
> >
> > *------------------------------------------------*
> > *Sincerely yours,*
> >
> >
> > *Raymond*
> >
> >
>



-- 

Best regards,
Adhyan Arizki

Re: How to do parallel indexing on files (not on HDFS)

Posted by Raymond Xie <xi...@gmail.com>.

Thank you Rahul despite that's very high level.

With no offense, do you have a successful implementation or it is just your
unproven idea? I never used Rabbit nor Kafka before but would be very
interested in knowing more detail on the Kafka idea as Kafka is available
in my environment.

Thank you again and look forward to hearing more from you or anyone in this
Solr community.

*------------------------------------------------*
*Sincerely yours,*

*Raymond*

On Wed, May 23, 2018 at 8:15 AM, Rahul Singh <ra...@gmail.com>
wrote:

> Enumerate the file locations (map) , put them in a queue like rabbit or
> Kafka (Persist the map), have a bunch of threads , workers, containers,
> whatever pop off the queue , process the item (reduce).
>
>
> --
> Rahul Singh
> rahul.singh@anant.us
>
> Anant Corporation
>
> On May 20, 2018, 7:24 AM -0400, Raymond Xie <xi...@gmail.com>, wrote:
>
> I know how to do indexing on file system like single file or folder, but
> how do I do that in a parallel way? The data I need to index is of huge
> volume and can't be put on HDFS.
>
> Thank you
>
> *------------------------------------------------*
> *Sincerely yours,*
>
>
> *Raymond*
>
>

Re: How to do parallel indexing on files (not on HDFS)

Posted by Rahul Singh <ra...@gmail.com>.

Enumerate the file locations (map) , put them in a queue like rabbit or Kafka (Persist the map), have a bunch of threads , workers, containers, whatever pop off the queue , process the item (reduce).

--
Rahul Singh
rahul.singh@anant.us

Anant Corporation

On May 20, 2018, 7:24 AM -0400, Raymond Xie <xi...@gmail.com>, wrote:
> I know how to do indexing on file system like single file or folder, but
> how do I do that in a parallel way? The data I need to index is of huge
> volume and can't be put on HDFS.
>
> Thank you
>
> *------------------------------------------------*
> *Sincerely yours,*
>
>
> *Raymond*