You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Daniel, Ronald (ELS-SDG)" <R....@elsevier.com> on 2014/09/03 19:33:24 UTC

Accessing neighboring elements in an RDD

Hi all,

Assume I have read the lines of a text file into an RDD:

    textFile = sc.textFile("SomeArticle.txt")

Also assume that the sentence breaks in SomeArticle.txt were done by machine and have some errors, such as the break at Fig. in the sample text below.

Index	Text
N	 ...as shown in Fig.
N+1	1.
N+2	The figure shows...

What I want is an RDD with:

N	... as shown in Fig. 1.
N+1	The figure shows...

Is there some way a filter() can look at neighboring elements in an RDD? That way I could look, in parallel, at neighboring elements in an RDD and come up with a new RDD that may have a different number of elements.  Or do I just have to sequentially iterate through the RDD?

Thanks,
Ron

RE: Accessing neighboring elements in an RDD

Posted by "Daniel, Ronald (ELS-SDG)" <R....@elsevier.com>.

Thanks Xiangrui, that looks very helpful.

Best regards,
Ron


> -----Original Message-----
> From: Xiangrui Meng [mailto:mengxr@gmail.com]
> Sent: Wednesday, September 03, 2014 1:19 PM
> To: Daniel, Ronald (ELS-SDG)
> Cc: Victor Tso-Guillen; user@spark.apache.org
> Subject: Re: Accessing neighboring elements in an RDD
> 
> There is a sliding method implemented in MLlib
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/a
> pache/spark/mllib/rdd/SlidingRDD.scala),
> which is used in computing Area Under Curve:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/a
> pache/spark/mllib/evaluation/AreaUnderCurve.scala#L45
> 
> With it, you can process neighbor lines by
> 
> rdd.sliding(3).map { case Seq(l0, l1, l2) => ... }
> 
> -Xiangrui
> 
> On Wed, Sep 3, 2014 at 11:30 AM, Daniel, Ronald (ELS-SDG)
> <R....@elsevier.com> wrote:
> > Thanks for the pointer to that thread. Looks like there is some demand
> > for this capability, but not a lot yet. Also doesn't look like there
> > is an easy answer right now.
> >
> >
> >
> > Thanks,
> >
> > Ron
> >
> >
> >
> >
> >
> > From: Victor Tso-Guillen [mailto:vtso@paxata.com]
> > Sent: Wednesday, September 03, 2014 10:40 AM
> > To: Daniel, Ronald (ELS-SDG)
> > Cc: user@spark.apache.org
> > Subject: Re: Accessing neighboring elements in an RDD
> >
> >
> >
> > Interestingly, there was an almost identical question posed on Aug 22
> > by cjwang. Here's the link to the archive:
> > http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-a
> > nd-next-element-in-a-sorted-RDD-td12621.html#a12664
> >
> >
> >
> > On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG)
> > <R....@elsevier.com> wrote:
> >
> > Hi all,
> >
> > Assume I have read the lines of a text file into an RDD:
> >
> >     textFile = sc.textFile("SomeArticle.txt")
> >
> > Also assume that the sentence breaks in SomeArticle.txt were done by
> > machine and have some errors, such as the break at Fig. in the sample text
> below.
> >
> > Index   Text
> > N        ...as shown in Fig.
> > N+1     1.
> > N+2     The figure shows...
> >
> > What I want is an RDD with:
> >
> > N       ... as shown in Fig. 1.
> > N+1     The figure shows...
> >
> > Is there some way a filter() can look at neighboring elements in an RDD?
> > That way I could look, in parallel, at neighboring elements in an RDD
> > and come up with a new RDD that may have a different number of
> > elements.  Or do I just have to sequentially iterate through the RDD?
> >
> > Thanks,
> > Ron
> >
> >

Re: Accessing neighboring elements in an RDD

Posted by Xiangrui Meng <me...@gmail.com>.

There is a sliding method implemented in MLlib
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala),
which is used in computing Area Under Curve:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/AreaUnderCurve.scala#L45

With it, you can process neighbor lines by

rdd.sliding(3).map { case Seq(l0, l1, l2) => ... }

-Xiangrui

On Wed, Sep 3, 2014 at 11:30 AM, Daniel, Ronald (ELS-SDG)
<R....@elsevier.com> wrote:
> Thanks for the pointer to that thread. Looks like there is some demand for
> this capability, but not a lot yet. Also doesn't look like there is an easy
> answer right now.
>
>
>
> Thanks,
>
> Ron
>
>
>
>
>
> From: Victor Tso-Guillen [mailto:vtso@paxata.com]
> Sent: Wednesday, September 03, 2014 10:40 AM
> To: Daniel, Ronald (ELS-SDG)
> Cc: user@spark.apache.org
> Subject: Re: Accessing neighboring elements in an RDD
>
>
>
> Interestingly, there was an almost identical question posed on Aug 22 by
> cjwang. Here's the link to the archive:
> http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664
>
>
>
> On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG)
> <R....@elsevier.com> wrote:
>
> Hi all,
>
> Assume I have read the lines of a text file into an RDD:
>
>     textFile = sc.textFile("SomeArticle.txt")
>
> Also assume that the sentence breaks in SomeArticle.txt were done by machine
> and have some errors, such as the break at Fig. in the sample text below.
>
> Index   Text
> N        ...as shown in Fig.
> N+1     1.
> N+2     The figure shows...
>
> What I want is an RDD with:
>
> N       ... as shown in Fig. 1.
> N+1     The figure shows...
>
> Is there some way a filter() can look at neighboring elements in an RDD?
> That way I could look, in parallel, at neighboring elements in an RDD and
> come up with a new RDD that may have a different number of elements.  Or do
> I just have to sequentially iterate through the RDD?
>
> Thanks,
> Ron
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: Accessing neighboring elements in an RDD

Posted by "Daniel, Ronald (ELS-SDG)" <R....@elsevier.com>.

Thanks for the pointer to that thread. Looks like there is some demand for this capability, but not a lot yet. Also doesn't look like there is an easy answer right now.

Thanks,
Ron

From: Victor Tso-Guillen [mailto:vtso@paxata.com]
Sent: Wednesday, September 03, 2014 10:40 AM
To: Daniel, Ronald (ELS-SDG)
Cc: user@spark.apache.org
Subject: Re: Accessing neighboring elements in an RDD

Interestingly, there was an almost identical question posed on Aug 22 by cjwang. Here's the link to the archive: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664

On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG) <R....@elsevier.com>> wrote:
Hi all,

Assume I have read the lines of a text file into an RDD:

    textFile = sc.textFile("SomeArticle.txt")

Also assume that the sentence breaks in SomeArticle.txt were done by machine and have some errors, such as the break at Fig. in the sample text below.

Index   Text
N        ...as shown in Fig.
N+1     1.
N+2     The figure shows...

What I want is an RDD with:

N       ... as shown in Fig. 1.
N+1     The figure shows...

Is there some way a filter() can look at neighboring elements in an RDD? That way I could look, in parallel, at neighboring elements in an RDD and come up with a new RDD that may have a different number of elements.  Or do I just have to sequentially iterate through the RDD?

Thanks,
Ron

Re: Accessing neighboring elements in an RDD

Posted by Chris Gore <cd...@cdgore.com>.

There is support for Spark in ElasticSearch’s Hadoop integration package.

http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html

Maybe you could split and insert all of your documents from Spark and then query for “MoreLikeThis” on the ElasticSearch index.  I haven’t tried it, but maybe someone else has more experience using Spark with ElasticSearch.  At some point, maybe there could be an information retrieval package for Spark with locality sensitive hashing and other similar functions.

On Sep 3, 2014, at 10:40 AM, Victor Tso-Guillen <vt...@paxata.com> wrote:

> Interestingly, there was an almost identical question posed on Aug 22 by cjwang. Here's the link to the archive: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664
> 
> 
> On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG) <R....@elsevier.com> wrote:
> Hi all,
> 
> Assume I have read the lines of a text file into an RDD:
> 
>     textFile = sc.textFile("SomeArticle.txt")
> 
> Also assume that the sentence breaks in SomeArticle.txt were done by machine and have some errors, such as the break at Fig. in the sample text below.
> 
> Index   Text
> N        ...as shown in Fig.
> N+1     1.
> N+2     The figure shows...
> 
> What I want is an RDD with:
> 
> N       ... as shown in Fig. 1.
> N+1     The figure shows...
> 
> Is there some way a filter() can look at neighboring elements in an RDD? That way I could look, in parallel, at neighboring elements in an RDD and come up with a new RDD that may have a different number of elements.  Or do I just have to sequentially iterate through the RDD?
> 
> Thanks,
> Ron
> 
> 
>

Re: Accessing neighboring elements in an RDD

Posted by Victor Tso-Guillen <vt...@paxata.com>.

Interestingly, there was an almost identical question posed on Aug 22 by
cjwang. Here's the link to the archive:
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664


On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG) <
R.Daniel@elsevier.com> wrote:

> Hi all,
>
> Assume I have read the lines of a text file into an RDD:
>
>     textFile = sc.textFile("SomeArticle.txt")
>
> Also assume that the sentence breaks in SomeArticle.txt were done by
> machine and have some errors, such as the break at Fig. in the sample text
> below.
>
> Index   Text
> N        ...as shown in Fig.
> N+1     1.
> N+2     The figure shows...
>
> What I want is an RDD with:
>
> N       ... as shown in Fig. 1.
> N+1     The figure shows...
>
> Is there some way a filter() can look at neighboring elements in an RDD?
> That way I could look, in parallel, at neighboring elements in an RDD and
> come up with a new RDD that may have a different number of elements.  Or do
> I just have to sequentially iterate through the RDD?
>
> Thanks,
> Ron
>
>
>