You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zeppelin.apache.org by anish singh <an...@gmail.com> on 2016/07/04 10:00:43 UTC

[GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

Hello,

(everything outside Zeppelin)
I had started work on the common crawl datasets, and tried to first have a
look at only the data for May 2016. Out of the three formats available, I
chose the WET(plain text format). The data only for May is divided into
segments and there are 24492 such segments. I downloaded only the first
segment for May and got 432MB of data. Now the problem is that my laptop is
a very modest machine with core 2 duo processor and 3GB of RAM such that
even opening the downloaded data file in LibreWriter filled the RAM
completely and hung the machine and bringing the data directly into
zeppelin or analyzing it inside zeppelin seems impossible. As good as I
know, there are two ways in which I can proceed :

1) Buying a new laptop with more RAM and processor.   OR
2) Choosing another dataset

I have no problem with either of the above ways or anything that you might
suggest but please let me know which way to proceed so that I may be able
to work in speed. Meanwhile, I will read more papers and publications on
possibilities of analyzing common crawl data.

Thanks,
Anish.

Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

Posted by Alexander Bezzubov <bz...@apache.org>.

That sounds great Anish!

Please keep it up :)

--
Alex

On Wed, Jul 20, 2016, 18:07 anish singh <an...@gmail.com> wrote:

> Alex, some good news!
>
> I just tried the first option you mentioned in the previous mail, increased
> the driver memory to 16g, reduced caching space to 0.1% of total memory and
> additionally trimmed the warc content to include only three domains and its
> working (everything including reduceByKey()). Although, I had tried this
> earlier, few days ago but it had not worked then.
>
> I even understood the core problem : the original rdd( ~ 2GB) contained
> exactly 53307 rdd elements and when I ran 'flatMap(
> r => ExtractLinks(r.getUrl(), r.getContentstring())) on the this rdd it
> resulted in explosion of data extracted from these many elements(web pages)
> which the available memory was perhaps unable to handle. This also means
> that the rest of the analysis in the notebook must be done on domains
> extracted from the original warc files so it reduces the size of data to be
> processed. In case, more RAM is needed I will try to use m4.2xlarge (32GB)
> instance.
>
> Thrilled to have it working after struggling for so many days, so now I can
> proceed with the notebook.
>
> Thanks again,
> Anish.
>
> On Wed, Jul 20, 2016 at 7:08 AM, Alexander Bezzubov <bz...@apache.org>
> wrote:
>
> > Hi Anish,
> >
> > thank you for sharing your progress and totally know what you mean -
> that's
> > an expected pain of working with real BigData.
> >
> > I would advise to conduct a series of experiments:
> >
> > *1 moderate machine*, Spark 1.6 in local mode, 1 WARC input file (1Gb)
> >  - Spark in local mode is a single JVM process, so fine-tune it and make
> > sure it uses ALL available memory (i.e 16Gb)
> >  - We are not going to use in-memory caching, so storage part can be
> turned
> > off [1]  and [2]
> >  - AFIAK DataFrames use memory more efficient than RDDs but not sure if
> we
> > can benefit from it here
> >  - Start with something simple, like `val mayBegLinks =
> > mayBegData.keepValidPages().count()` and make sure it works
> >  - Proceed further until few more complex queries work
> >
> > *Cluster of N machines*, Spark 1.6 in standalone cluster mode
> >  - process fraction of the whole dataset i.e 1 segment
> >
> >
> > I know that is not easy, but it's worth to try for 1 more week and see if
> > the approach outlined above works.
> > Last, but not least - do not hesitate to reach out to CommonCrawl
> community
> > [3] for an advice, there are people using Apache Spark there as well.
> >
> > Please keep us posted!
> >
> >  1.
> >
> http://spark.apache.org/docs/latest/tuning.html#memory-management-overview
> >  2.
> >
> >
> http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
> >  3. https://groups.google.com/forum/#!forum/common-crawl
> >
> > --
> > Alex
> >
> >
> > On Wed, Jul 20, 2016 at 2:27 AM, anish singh <an...@gmail.com>
> wrote:
> >
> > > Hello,
> > >
> > > The last two weeks have been tough and full of learning, the code in
> the
> > > previous mail which performed only simple transformation and
> > reduceByKey()
> > > to count similar domain links did not work even on the first
> segment(1005
> > > MB) of data. So I studied and read extensively on the web :
> > blogs(cloudera,
> > > databricks and stack overflow) and books on Spark, tried all the
> options
> > > and configurations on memory and performance tuning but the code did
> not
> > > run. My current configurations to SPARK_SUBMIT_OPTIONS are set to
> > > "--driver-memory 9g --driver-java-options -XX:+UseG1GC
> > > -XX:+UseCompressedOops --conf spark.storage.memoryFraction=0.1" and
> even
> > > this does not work. Even simple operations such as rdd.count() after
> the
> > > transformations in the previous mail does not work. All this on an
> > > m4.xlarge machine.
> > >
> > > Moreover, in trying to set up standalone cluster on single machine by
> > > following instructions in the book 'Learning Spark', I messed with file
> > > '~/.ssh/authorized_keys' file which cut me out of the instance so I had
> > to
> > > terminate it and start all over again after losing all the work done in
> > one
> > > week.
> > >
> > > Today, I performed a comparison of memory and cpu load values using the
> > > size of data and the machine configurations between two conditions:
> > (when I
> > > worked on my local machine) vs. (m4.xlarge single instance), where
> > >
> > > memory load = (data size) / (memory available for processing),
> > > cpu load = (data size) / (cores available for processing)
> > >
> > > the results of the comparison indicate that with the amount of data,
> the
> > > AWS instance is 100 times more constrained than the analysis that I
> > > previously did on my machine (for calculations, please see sheet [0] ).
> > > This has completely stalled work as I'm unable to perform any further
> > > operations on the data sets. Further, choosing another instance (such
> as
> > 32
> > > GiB) may also not be sufficient (as per calculations in [0]). Please
> let
> > me
> > > know if I'm missing something or how to proceed with this.
> > >
> > > [0]. https://drive.google.com/open?id=0ByXTtaL2yHBuYnJSNGt6T2U2RjQ
> > >
> > > Thanks,
> > > Anish.
> > >
> > >
> > >
> > > On Tue, Jul 12, 2016 at 12:35 PM, anish singh <an...@gmail.com>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I had been able to setup zeppelin with spark on aws ec2 m4.xlarge
> > > instance
> > > > a few days ago. In designing the notebook, I was trying to visualize
> > the
> > > > link structure by the following code :
> > > >
> > > > val mayBegLinks = mayBegData.keepValidPages()
> > > >                             .flatMap(r => ExtractLinks(r.getUrl,
> > > > r.getContentString))
> > > >                             .map(r => (ExtractDomain(r._1),
> > > > ExtractDomain(r._2)))
> > > >                             .filter(r => (r._1.equals("
> > www.fangraphs.com
> > > ")
> > > > || r._1.equals("www.osnews.com") ||   r._1.equals("www.dailytech.com
> > ")))
> > > >
> > > > val linkWtMap = mayBegLinks.map(r => (r, 1)).reduceByKey((x, y) => x
> +
> > y)
> > > > linkWtMap.toDF().registerTempTable("LnkWtTbl")
> > > >
> > > > where 'mayBegData' is some 2GB of WARC for the first two segments of
> > May.
> > > > This paragraph runs smoothly but in the next paragraph using %sql and
> > the
> > > > following statement :-
> > > >
> > > > select W._1 as Links, W._2 as Weight from LnkWtTbl W
> > > >
> > > > I get errors which are always java.lang.OutOfMemoryError because of
> > > > Garbage Collection space exceeded or heap space exceeded and the most
> > > > recent one is the following:
> > > >
> > > > org.apache.thrift.transport.TTransportException at
> > > >
> > >
> >
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
> > > > at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
> > at
> > > >
> > >
> >
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
> > > > at
> > > >
> > >
> >
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
> > > > at
> > > >
> > >
> >
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
> > > > at
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
> > > at
> > > >
> > >
> >
> org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:261)
> > > > at
> > > >
> > >
> >
> org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:245)
> > > > at
> > > >
> > >
> >
> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:312)
> > > > at
> > > >
> > >
> >
> org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
> > > > at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:271)
> at
> > > > org.apache.zeppelin.scheduler.Job.run(Job.java:176) at
> > > >
> > >
> >
> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
> > > > at
> > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> > > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
> > > >
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> > > > at
> > > >
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> > > > at
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > > > at
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > > > at java.lang.Thread.run(Thread.java:745)
> > > >
> > > > I just wanted to know that even with m4.xlarge instance, is it not
> > > > possible to process such large(~ 2GB) of data because the above code
> is
> > > > relatively simple, I guess. This is restricting the flexibility with
> > > which
> > > > the notebook can be designed. Please provide some hints/suggestions
> > since
> > > > I'm stuck on this since yesterday.
> > > >
> > > > Thanks,
> > > > Anish.
> > > >
> > > >
> > > > On Tue, Jul 5, 2016 at 12:28 PM, Alexander Bezzubov <bz...@apache.org>
> > > > wrote:
> > > >
> > > >> That sounds great, Anish!
> > > >> Congratulations on getting a new machine.
> > > >>
> > > >> No worries, please take your time and keep us posted on your
> > > exploration!
> > > >> Quality is more important than quantity here.
> > > >>
> > > >> --
> > > >> Alex
> > > >>
> > > >> On Mon, Jul 4, 2016 at 10:40 PM, anish singh <an...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > Hello,
> > > >> >
> > > >> > Thanks Alex, I'm so glad that you helped. Here's update : I've
> > ordered
> > > >> new
> > > >> > machine with more RAM and processor that should come by tomorrow.
> I
> > > will
> > > >> > attempt to use it for the common crawl data and the AWS solution
> > that
> > > >> you
> > > >> > provided in the previous mail. I'm presently reading papers and
> > > >> > publications regarding analysis of common crawl data. Warcbase
> tool
> > > will
> > > >> > definitely be used. I understand that common crawl datasets are
> > > >> important
> > > >> > and I will do everything it takes to make notebooks on them, the
> > only
> > > >> > tension is that it may take more time than the previous notebooks.
> > > >> >
> > > >> > Anish.
> > > >> >
> > > >> > On Mon, Jul 4, 2016 at 6:30 PM, Alexander Bezzubov <
> bzz@apache.org>
> > > >> wrote:
> > > >> >
> > > >> > > Hi Anish,
> > > >> > >
> > > >> > > thanks for keeping us posted about a progress!
> > > >> > >
> > > >> > > CommonCrawl is important dataset and it would be awesome if we
> > could
> > > >> > > find a way for you to build some notebooks for it though this
> this
> > > >> > > years GSoC program.
> > > >> > >
> > > >> > > How about running Zeppelin on a single big enough node in AWS
> for
> > > the
> > > >> > > sake of this notebook?
> > > >> > > If you use spot instance you could get even big instances for
> > really
> > > >> > > affordable price of 2-4$ a day, just need to make sure your
> > persist
> > > >> > > notebooks on S3 [1] to avoid loosing the data and shut down it
> for
> > > the
> > > >> > > night.
> > > >> > >
> > > >> > > AFAIK We do not have free any AWS credits for now, even for a
> GSoC
> > > >> > > students. If somebody knows a way to provide\get some - please
> > feel
> > > >> > > free to chime in, I know there are some Amazonian people on the
> > list
> > > >> > > :)
> > > >> > >
> > > >> > > But so far AWS spot instances is the most cost-effective
> solution
> > I
> > > >> > > could imagine of. Bonus: if you host your instance in region
> > > us-east-1
> > > >> > > - transfer from\to S3 will be free, as that's where CommonCrawl
> > > >> > > dataset is living.
> > > >> > >
> > > >> > > One more thing - please check out awesome WarcBase library [2]
> > build
> > > >> > > by internet preservation community. I find it really helpful,
> > > working
> > > >> > > with web archives.
> > > >> > >
> > > >> > > On the notebook design:
> > > >> > >  - to understand the context of this dataset better - please do
> > some
> > > >> > > research how other people use it. What for, etc.
> > > >> > >    Would be a great material for the blog post
> > > >> > >  - try provide examples of all available formats: WARC, WET, WAT
> > (in
> > > >> > > may be in same or different notebooks, it's up to you)
> > > >> > >  - while using warcbase - mind that RDD persistence will not
> work
> > > >> > > until [3] is resolved, so avoid using if for now
> > > >> > >
> > > >> > > I understand that this can be a big task, so do not worry if
> that
> > > >> > > takes time (learning AWS, etc) - just keep us posted on your
> > > progress
> > > >> > > weekly and I'll be glad to help!
> > > >> > >
> > > >> > >
> > > >> > >  1.
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> http://zeppelin.apache.org/docs/0.6.0-SNAPSHOT/storage/storage.html#notebook-storage-in-s3
> > > >> > >  2. https://github.com/lintool/warcbase
> > > >> > >  3. https://github.com/lintool/warcbase/issues/227
> > > >> > >
> > > >> > > On Mon, Jul 4, 2016 at 7:00 PM, anish singh <
> anish18sun@gmail.com
> > >
> > > >> > wrote:
> > > >> > > > Hello,
> > > >> > > >
> > > >> > > > (everything outside Zeppelin)
> > > >> > > > I had started work on the common crawl datasets, and tried to
> > > first
> > > >> > have
> > > >> > > a
> > > >> > > > look at only the data for May 2016. Out of the three formats
> > > >> > available, I
> > > >> > > > chose the WET(plain text format). The data only for May is
> > divided
> > > >> into
> > > >> > > > segments and there are 24492 such segments. I downloaded only
> > the
> > > >> first
> > > >> > > > segment for May and got 432MB of data. Now the problem is that
> > my
> > > >> > laptop
> > > >> > > is
> > > >> > > > a very modest machine with core 2 duo processor and 3GB of RAM
> > > such
> > > >> > that
> > > >> > > > even opening the downloaded data file in LibreWriter filled
> the
> > > RAM
> > > >> > > > completely and hung the machine and bringing the data directly
> > > into
> > > >> > > > zeppelin or analyzing it inside zeppelin seems impossible. As
> > good
> > > >> as I
> > > >> > > > know, there are two ways in which I can proceed :
> > > >> > > >
> > > >> > > > 1) Buying a new laptop with more RAM and processor.   OR
> > > >> > > > 2) Choosing another dataset
> > > >> > > >
> > > >> > > > I have no problem with either of the above ways or anything
> that
> > > you
> > > >> > > might
> > > >> > > > suggest but please let me know which way to proceed so that I
> > may
> > > be
> > > >> > able
> > > >> > > > to work in speed. Meanwhile, I will read more papers and
> > > >> publications
> > > >> > on
> > > >> > > > possibilities of analyzing common crawl data.
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > > Anish.
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

Posted by anish singh <an...@gmail.com>.

Alex, some good news!

I just tried the first option you mentioned in the previous mail, increased
the driver memory to 16g, reduced caching space to 0.1% of total memory and
additionally trimmed the warc content to include only three domains and its
working (everything including reduceByKey()). Although, I had tried this
earlier, few days ago but it had not worked then.

I even understood the core problem : the original rdd( ~ 2GB) contained
exactly 53307 rdd elements and when I ran 'flatMap(
r => ExtractLinks(r.getUrl(), r.getContentstring())) on the this rdd it
resulted in explosion of data extracted from these many elements(web pages)
which the available memory was perhaps unable to handle. This also means
that the rest of the analysis in the notebook must be done on domains
extracted from the original warc files so it reduces the size of data to be
processed. In case, more RAM is needed I will try to use m4.2xlarge (32GB)
instance.

Thrilled to have it working after struggling for so many days, so now I can
proceed with the notebook.

Thanks again,
Anish.

On Wed, Jul 20, 2016 at 7:08 AM, Alexander Bezzubov <bz...@apache.org> wrote:

> Hi Anish,
>
> thank you for sharing your progress and totally know what you mean - that's
> an expected pain of working with real BigData.
>
> I would advise to conduct a series of experiments:
>
> *1 moderate machine*, Spark 1.6 in local mode, 1 WARC input file (1Gb)
>  - Spark in local mode is a single JVM process, so fine-tune it and make
> sure it uses ALL available memory (i.e 16Gb)
>  - We are not going to use in-memory caching, so storage part can be turned
> off [1]  and [2]
>  - AFIAK DataFrames use memory more efficient than RDDs but not sure if we
> can benefit from it here
>  - Start with something simple, like `val mayBegLinks =
> mayBegData.keepValidPages().count()` and make sure it works
>  - Proceed further until few more complex queries work
>
> *Cluster of N machines*, Spark 1.6 in standalone cluster mode
>  - process fraction of the whole dataset i.e 1 segment
>
>
> I know that is not easy, but it's worth to try for 1 more week and see if
> the approach outlined above works.
> Last, but not least - do not hesitate to reach out to CommonCrawl community
> [3] for an advice, there are people using Apache Spark there as well.
>
> Please keep us posted!
>
>  1.
> http://spark.apache.org/docs/latest/tuning.html#memory-management-overview
>  2.
>
> http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
>  3. https://groups.google.com/forum/#!forum/common-crawl
>
> --
> Alex
>
>
> On Wed, Jul 20, 2016 at 2:27 AM, anish singh <an...@gmail.com> wrote:
>
> > Hello,
> >
> > The last two weeks have been tough and full of learning, the code in the
> > previous mail which performed only simple transformation and
> reduceByKey()
> > to count similar domain links did not work even on the first segment(1005
> > MB) of data. So I studied and read extensively on the web :
> blogs(cloudera,
> > databricks and stack overflow) and books on Spark, tried all the options
> > and configurations on memory and performance tuning but the code did not
> > run. My current configurations to SPARK_SUBMIT_OPTIONS are set to
> > "--driver-memory 9g --driver-java-options -XX:+UseG1GC
> > -XX:+UseCompressedOops --conf spark.storage.memoryFraction=0.1" and even
> > this does not work. Even simple operations such as rdd.count() after the
> > transformations in the previous mail does not work. All this on an
> > m4.xlarge machine.
> >
> > Moreover, in trying to set up standalone cluster on single machine by
> > following instructions in the book 'Learning Spark', I messed with file
> > '~/.ssh/authorized_keys' file which cut me out of the instance so I had
> to
> > terminate it and start all over again after losing all the work done in
> one
> > week.
> >
> > Today, I performed a comparison of memory and cpu load values using the
> > size of data and the machine configurations between two conditions:
> (when I
> > worked on my local machine) vs. (m4.xlarge single instance), where
> >
> > memory load = (data size) / (memory available for processing),
> > cpu load = (data size) / (cores available for processing)
> >
> > the results of the comparison indicate that with the amount of data, the
> > AWS instance is 100 times more constrained than the analysis that I
> > previously did on my machine (for calculations, please see sheet [0] ).
> > This has completely stalled work as I'm unable to perform any further
> > operations on the data sets. Further, choosing another instance (such as
> 32
> > GiB) may also not be sufficient (as per calculations in [0]). Please let
> me
> > know if I'm missing something or how to proceed with this.
> >
> > [0]. https://drive.google.com/open?id=0ByXTtaL2yHBuYnJSNGt6T2U2RjQ
> >
> > Thanks,
> > Anish.
> >
> >
> >
> > On Tue, Jul 12, 2016 at 12:35 PM, anish singh <an...@gmail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > I had been able to setup zeppelin with spark on aws ec2 m4.xlarge
> > instance
> > > a few days ago. In designing the notebook, I was trying to visualize
> the
> > > link structure by the following code :
> > >
> > > val mayBegLinks = mayBegData.keepValidPages()
> > >                             .flatMap(r => ExtractLinks(r.getUrl,
> > > r.getContentString))
> > >                             .map(r => (ExtractDomain(r._1),
> > > ExtractDomain(r._2)))
> > >                             .filter(r => (r._1.equals("
> www.fangraphs.com
> > ")
> > > || r._1.equals("www.osnews.com") ||   r._1.equals("www.dailytech.com
> ")))
> > >
> > > val linkWtMap = mayBegLinks.map(r => (r, 1)).reduceByKey((x, y) => x +
> y)
> > > linkWtMap.toDF().registerTempTable("LnkWtTbl")
> > >
> > > where 'mayBegData' is some 2GB of WARC for the first two segments of
> May.
> > > This paragraph runs smoothly but in the next paragraph using %sql and
> the
> > > following statement :-
> > >
> > > select W._1 as Links, W._2 as Weight from LnkWtTbl W
> > >
> > > I get errors which are always java.lang.OutOfMemoryError because of
> > > Garbage Collection space exceeded or heap space exceeded and the most
> > > recent one is the following:
> > >
> > > org.apache.thrift.transport.TTransportException at
> > >
> >
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
> > > at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
> at
> > >
> >
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
> > > at
> > >
> >
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
> > > at
> > >
> >
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
> > > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
> > at
> > >
> >
> org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:261)
> > > at
> > >
> >
> org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:245)
> > > at
> > >
> >
> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:312)
> > > at
> > >
> >
> org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
> > > at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:271) at
> > > org.apache.zeppelin.scheduler.Job.run(Job.java:176) at
> > >
> >
> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
> > > at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> > > at
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> > > at
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > > at
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > > at java.lang.Thread.run(Thread.java:745)
> > >
> > > I just wanted to know that even with m4.xlarge instance, is it not
> > > possible to process such large(~ 2GB) of data because the above code is
> > > relatively simple, I guess. This is restricting the flexibility with
> > which
> > > the notebook can be designed. Please provide some hints/suggestions
> since
> > > I'm stuck on this since yesterday.
> > >
> > > Thanks,
> > > Anish.
> > >
> > >
> > > On Tue, Jul 5, 2016 at 12:28 PM, Alexander Bezzubov <bz...@apache.org>
> > > wrote:
> > >
> > >> That sounds great, Anish!
> > >> Congratulations on getting a new machine.
> > >>
> > >> No worries, please take your time and keep us posted on your
> > exploration!
> > >> Quality is more important than quantity here.
> > >>
> > >> --
> > >> Alex
> > >>
> > >> On Mon, Jul 4, 2016 at 10:40 PM, anish singh <an...@gmail.com>
> > >> wrote:
> > >>
> > >> > Hello,
> > >> >
> > >> > Thanks Alex, I'm so glad that you helped. Here's update : I've
> ordered
> > >> new
> > >> > machine with more RAM and processor that should come by tomorrow. I
> > will
> > >> > attempt to use it for the common crawl data and the AWS solution
> that
> > >> you
> > >> > provided in the previous mail. I'm presently reading papers and
> > >> > publications regarding analysis of common crawl data. Warcbase tool
> > will
> > >> > definitely be used. I understand that common crawl datasets are
> > >> important
> > >> > and I will do everything it takes to make notebooks on them, the
> only
> > >> > tension is that it may take more time than the previous notebooks.
> > >> >
> > >> > Anish.
> > >> >
> > >> > On Mon, Jul 4, 2016 at 6:30 PM, Alexander Bezzubov <bz...@apache.org>
> > >> wrote:
> > >> >
> > >> > > Hi Anish,
> > >> > >
> > >> > > thanks for keeping us posted about a progress!
> > >> > >
> > >> > > CommonCrawl is important dataset and it would be awesome if we
> could
> > >> > > find a way for you to build some notebooks for it though this this
> > >> > > years GSoC program.
> > >> > >
> > >> > > How about running Zeppelin on a single big enough node in AWS for
> > the
> > >> > > sake of this notebook?
> > >> > > If you use spot instance you could get even big instances for
> really
> > >> > > affordable price of 2-4$ a day, just need to make sure your
> persist
> > >> > > notebooks on S3 [1] to avoid loosing the data and shut down it for
> > the
> > >> > > night.
> > >> > >
> > >> > > AFAIK We do not have free any AWS credits for now, even for a GSoC
> > >> > > students. If somebody knows a way to provide\get some - please
> feel
> > >> > > free to chime in, I know there are some Amazonian people on the
> list
> > >> > > :)
> > >> > >
> > >> > > But so far AWS spot instances is the most cost-effective solution
> I
> > >> > > could imagine of. Bonus: if you host your instance in region
> > us-east-1
> > >> > > - transfer from\to S3 will be free, as that's where CommonCrawl
> > >> > > dataset is living.
> > >> > >
> > >> > > One more thing - please check out awesome WarcBase library [2]
> build
> > >> > > by internet preservation community. I find it really helpful,
> > working
> > >> > > with web archives.
> > >> > >
> > >> > > On the notebook design:
> > >> > >  - to understand the context of this dataset better - please do
> some
> > >> > > research how other people use it. What for, etc.
> > >> > >    Would be a great material for the blog post
> > >> > >  - try provide examples of all available formats: WARC, WET, WAT
> (in
> > >> > > may be in same or different notebooks, it's up to you)
> > >> > >  - while using warcbase - mind that RDD persistence will not work
> > >> > > until [3] is resolved, so avoid using if for now
> > >> > >
> > >> > > I understand that this can be a big task, so do not worry if that
> > >> > > takes time (learning AWS, etc) - just keep us posted on your
> > progress
> > >> > > weekly and I'll be glad to help!
> > >> > >
> > >> > >
> > >> > >  1.
> > >> > >
> > >> >
> > >>
> >
> http://zeppelin.apache.org/docs/0.6.0-SNAPSHOT/storage/storage.html#notebook-storage-in-s3
> > >> > >  2. https://github.com/lintool/warcbase
> > >> > >  3. https://github.com/lintool/warcbase/issues/227
> > >> > >
> > >> > > On Mon, Jul 4, 2016 at 7:00 PM, anish singh <anish18sun@gmail.com
> >
> > >> > wrote:
> > >> > > > Hello,
> > >> > > >
> > >> > > > (everything outside Zeppelin)
> > >> > > > I had started work on the common crawl datasets, and tried to
> > first
> > >> > have
> > >> > > a
> > >> > > > look at only the data for May 2016. Out of the three formats
> > >> > available, I
> > >> > > > chose the WET(plain text format). The data only for May is
> divided
> > >> into
> > >> > > > segments and there are 24492 such segments. I downloaded only
> the
> > >> first
> > >> > > > segment for May and got 432MB of data. Now the problem is that
> my
> > >> > laptop
> > >> > > is
> > >> > > > a very modest machine with core 2 duo processor and 3GB of RAM
> > such
> > >> > that
> > >> > > > even opening the downloaded data file in LibreWriter filled the
> > RAM
> > >> > > > completely and hung the machine and bringing the data directly
> > into
> > >> > > > zeppelin or analyzing it inside zeppelin seems impossible. As
> good
> > >> as I
> > >> > > > know, there are two ways in which I can proceed :
> > >> > > >
> > >> > > > 1) Buying a new laptop with more RAM and processor.   OR
> > >> > > > 2) Choosing another dataset
> > >> > > >
> > >> > > > I have no problem with either of the above ways or anything that
> > you
> > >> > > might
> > >> > > > suggest but please let me know which way to proceed so that I
> may
> > be
> > >> > able
> > >> > > > to work in speed. Meanwhile, I will read more papers and
> > >> publications
> > >> > on
> > >> > > > possibilities of analyzing common crawl data.
> > >> > > >
> > >> > > > Thanks,
> > >> > > > Anish.
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

Posted by Alexander Bezzubov <bz...@apache.org>.

Hi Anish,

thank you for sharing your progress and totally know what you mean - that's
an expected pain of working with real BigData.

I would advise to conduct a series of experiments:

*1 moderate machine*, Spark 1.6 in local mode, 1 WARC input file (1Gb)
 - Spark in local mode is a single JVM process, so fine-tune it and make
sure it uses ALL available memory (i.e 16Gb)
 - We are not going to use in-memory caching, so storage part can be turned
off [1]  and [2]
 - AFIAK DataFrames use memory more efficient than RDDs but not sure if we
can benefit from it here
 - Start with something simple, like `val mayBegLinks =
mayBegData.keepValidPages().count()` and make sure it works
 - Proceed further until few more complex queries work

*Cluster of N machines*, Spark 1.6 in standalone cluster mode
 - process fraction of the whole dataset i.e 1 segment


I know that is not easy, but it's worth to try for 1 more week and see if
the approach outlined above works.
Last, but not least - do not hesitate to reach out to CommonCrawl community
[3] for an advice, there are people using Apache Spark there as well.

Please keep us posted!

 1.
http://spark.apache.org/docs/latest/tuning.html#memory-management-overview
 2.
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
 3. https://groups.google.com/forum/#!forum/common-crawl

--
Alex


On Wed, Jul 20, 2016 at 2:27 AM, anish singh <an...@gmail.com> wrote:

> Hello,
>
> The last two weeks have been tough and full of learning, the code in the
> previous mail which performed only simple transformation and reduceByKey()
> to count similar domain links did not work even on the first segment(1005
> MB) of data. So I studied and read extensively on the web : blogs(cloudera,
> databricks and stack overflow) and books on Spark, tried all the options
> and configurations on memory and performance tuning but the code did not
> run. My current configurations to SPARK_SUBMIT_OPTIONS are set to
> "--driver-memory 9g --driver-java-options -XX:+UseG1GC
> -XX:+UseCompressedOops --conf spark.storage.memoryFraction=0.1" and even
> this does not work. Even simple operations such as rdd.count() after the
> transformations in the previous mail does not work. All this on an
> m4.xlarge machine.
>
> Moreover, in trying to set up standalone cluster on single machine by
> following instructions in the book 'Learning Spark', I messed with file
> '~/.ssh/authorized_keys' file which cut me out of the instance so I had to
> terminate it and start all over again after losing all the work done in one
> week.
>
> Today, I performed a comparison of memory and cpu load values using the
> size of data and the machine configurations between two conditions: (when I
> worked on my local machine) vs. (m4.xlarge single instance), where
>
> memory load = (data size) / (memory available for processing),
> cpu load = (data size) / (cores available for processing)
>
> the results of the comparison indicate that with the amount of data, the
> AWS instance is 100 times more constrained than the analysis that I
> previously did on my machine (for calculations, please see sheet [0] ).
> This has completely stalled work as I'm unable to perform any further
> operations on the data sets. Further, choosing another instance (such as 32
> GiB) may also not be sufficient (as per calculations in [0]). Please let me
> know if I'm missing something or how to proceed with this.
>
> [0]. https://drive.google.com/open?id=0ByXTtaL2yHBuYnJSNGt6T2U2RjQ
>
> Thanks,
> Anish.
>
>
>
> On Tue, Jul 12, 2016 at 12:35 PM, anish singh <an...@gmail.com>
> wrote:
>
> > Hello,
> >
> > I had been able to setup zeppelin with spark on aws ec2 m4.xlarge
> instance
> > a few days ago. In designing the notebook, I was trying to visualize the
> > link structure by the following code :
> >
> > val mayBegLinks = mayBegData.keepValidPages()
> >                             .flatMap(r => ExtractLinks(r.getUrl,
> > r.getContentString))
> >                             .map(r => (ExtractDomain(r._1),
> > ExtractDomain(r._2)))
> >                             .filter(r => (r._1.equals("www.fangraphs.com
> ")
> > || r._1.equals("www.osnews.com") ||   r._1.equals("www.dailytech.com")))
> >
> > val linkWtMap = mayBegLinks.map(r => (r, 1)).reduceByKey((x, y) => x + y)
> > linkWtMap.toDF().registerTempTable("LnkWtTbl")
> >
> > where 'mayBegData' is some 2GB of WARC for the first two segments of May.
> > This paragraph runs smoothly but in the next paragraph using %sql and the
> > following statement :-
> >
> > select W._1 as Links, W._2 as Weight from LnkWtTbl W
> >
> > I get errors which are always java.lang.OutOfMemoryError because of
> > Garbage Collection space exceeded or heap space exceeded and the most
> > recent one is the following:
> >
> > org.apache.thrift.transport.TTransportException at
> >
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
> > at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at
> >
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
> > at
> >
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
> > at
> >
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
> > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
> at
> >
> org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:261)
> > at
> >
> org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:245)
> > at
> >
> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:312)
> > at
> >
> org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
> > at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:271) at
> > org.apache.zeppelin.scheduler.Job.run(Job.java:176) at
> >
> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
> > at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> > at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > at java.lang.Thread.run(Thread.java:745)
> >
> > I just wanted to know that even with m4.xlarge instance, is it not
> > possible to process such large(~ 2GB) of data because the above code is
> > relatively simple, I guess. This is restricting the flexibility with
> which
> > the notebook can be designed. Please provide some hints/suggestions since
> > I'm stuck on this since yesterday.
> >
> > Thanks,
> > Anish.
> >
> >
> > On Tue, Jul 5, 2016 at 12:28 PM, Alexander Bezzubov <bz...@apache.org>
> > wrote:
> >
> >> That sounds great, Anish!
> >> Congratulations on getting a new machine.
> >>
> >> No worries, please take your time and keep us posted on your
> exploration!
> >> Quality is more important than quantity here.
> >>
> >> --
> >> Alex
> >>
> >> On Mon, Jul 4, 2016 at 10:40 PM, anish singh <an...@gmail.com>
> >> wrote:
> >>
> >> > Hello,
> >> >
> >> > Thanks Alex, I'm so glad that you helped. Here's update : I've ordered
> >> new
> >> > machine with more RAM and processor that should come by tomorrow. I
> will
> >> > attempt to use it for the common crawl data and the AWS solution that
> >> you
> >> > provided in the previous mail. I'm presently reading papers and
> >> > publications regarding analysis of common crawl data. Warcbase tool
> will
> >> > definitely be used. I understand that common crawl datasets are
> >> important
> >> > and I will do everything it takes to make notebooks on them, the only
> >> > tension is that it may take more time than the previous notebooks.
> >> >
> >> > Anish.
> >> >
> >> > On Mon, Jul 4, 2016 at 6:30 PM, Alexander Bezzubov <bz...@apache.org>
> >> wrote:
> >> >
> >> > > Hi Anish,
> >> > >
> >> > > thanks for keeping us posted about a progress!
> >> > >
> >> > > CommonCrawl is important dataset and it would be awesome if we could
> >> > > find a way for you to build some notebooks for it though this this
> >> > > years GSoC program.
> >> > >
> >> > > How about running Zeppelin on a single big enough node in AWS for
> the
> >> > > sake of this notebook?
> >> > > If you use spot instance you could get even big instances for really
> >> > > affordable price of 2-4$ a day, just need to make sure your persist
> >> > > notebooks on S3 [1] to avoid loosing the data and shut down it for
> the
> >> > > night.
> >> > >
> >> > > AFAIK We do not have free any AWS credits for now, even for a GSoC
> >> > > students. If somebody knows a way to provide\get some - please feel
> >> > > free to chime in, I know there are some Amazonian people on the list
> >> > > :)
> >> > >
> >> > > But so far AWS spot instances is the most cost-effective solution I
> >> > > could imagine of. Bonus: if you host your instance in region
> us-east-1
> >> > > - transfer from\to S3 will be free, as that's where CommonCrawl
> >> > > dataset is living.
> >> > >
> >> > > One more thing - please check out awesome WarcBase library [2] build
> >> > > by internet preservation community. I find it really helpful,
> working
> >> > > with web archives.
> >> > >
> >> > > On the notebook design:
> >> > >  - to understand the context of this dataset better - please do some
> >> > > research how other people use it. What for, etc.
> >> > >    Would be a great material for the blog post
> >> > >  - try provide examples of all available formats: WARC, WET, WAT (in
> >> > > may be in same or different notebooks, it's up to you)
> >> > >  - while using warcbase - mind that RDD persistence will not work
> >> > > until [3] is resolved, so avoid using if for now
> >> > >
> >> > > I understand that this can be a big task, so do not worry if that
> >> > > takes time (learning AWS, etc) - just keep us posted on your
> progress
> >> > > weekly and I'll be glad to help!
> >> > >
> >> > >
> >> > >  1.
> >> > >
> >> >
> >>
> http://zeppelin.apache.org/docs/0.6.0-SNAPSHOT/storage/storage.html#notebook-storage-in-s3
> >> > >  2. https://github.com/lintool/warcbase
> >> > >  3. https://github.com/lintool/warcbase/issues/227
> >> > >
> >> > > On Mon, Jul 4, 2016 at 7:00 PM, anish singh <an...@gmail.com>
> >> > wrote:
> >> > > > Hello,
> >> > > >
> >> > > > (everything outside Zeppelin)
> >> > > > I had started work on the common crawl datasets, and tried to
> first
> >> > have
> >> > > a
> >> > > > look at only the data for May 2016. Out of the three formats
> >> > available, I
> >> > > > chose the WET(plain text format). The data only for May is divided
> >> into
> >> > > > segments and there are 24492 such segments. I downloaded only the
> >> first
> >> > > > segment for May and got 432MB of data. Now the problem is that my
> >> > laptop
> >> > > is
> >> > > > a very modest machine with core 2 duo processor and 3GB of RAM
> such
> >> > that
> >> > > > even opening the downloaded data file in LibreWriter filled the
> RAM
> >> > > > completely and hung the machine and bringing the data directly
> into
> >> > > > zeppelin or analyzing it inside zeppelin seems impossible. As good
> >> as I
> >> > > > know, there are two ways in which I can proceed :
> >> > > >
> >> > > > 1) Buying a new laptop with more RAM and processor.   OR
> >> > > > 2) Choosing another dataset
> >> > > >
> >> > > > I have no problem with either of the above ways or anything that
> you
> >> > > might
> >> > > > suggest but please let me know which way to proceed so that I may
> be
> >> > able
> >> > > > to work in speed. Meanwhile, I will read more papers and
> >> publications
> >> > on
> >> > > > possibilities of analyzing common crawl data.
> >> > > >
> >> > > > Thanks,
> >> > > > Anish.
> >> > >
> >> >
> >>
> >
> >
>

Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

Posted by anish singh <an...@gmail.com>.

Hello,

The last two weeks have been tough and full of learning, the code in the
previous mail which performed only simple transformation and reduceByKey()
to count similar domain links did not work even on the first segment(1005
MB) of data. So I studied and read extensively on the web : blogs(cloudera,
databricks and stack overflow) and books on Spark, tried all the options
and configurations on memory and performance tuning but the code did not
run. My current configurations to SPARK_SUBMIT_OPTIONS are set to
"--driver-memory 9g --driver-java-options -XX:+UseG1GC
-XX:+UseCompressedOops --conf spark.storage.memoryFraction=0.1" and even
this does not work. Even simple operations such as rdd.count() after the
transformations in the previous mail does not work. All this on an
m4.xlarge machine.

Moreover, in trying to set up standalone cluster on single machine by
following instructions in the book 'Learning Spark', I messed with file
'~/.ssh/authorized_keys' file which cut me out of the instance so I had to
terminate it and start all over again after losing all the work done in one
week.

Today, I performed a comparison of memory and cpu load values using the
size of data and the machine configurations between two conditions: (when I
worked on my local machine) vs. (m4.xlarge single instance), where

memory load = (data size) / (memory available for processing),
cpu load = (data size) / (cores available for processing)

the results of the comparison indicate that with the amount of data, the
AWS instance is 100 times more constrained than the analysis that I
previously did on my machine (for calculations, please see sheet [0] ).
This has completely stalled work as I'm unable to perform any further
operations on the data sets. Further, choosing another instance (such as 32
GiB) may also not be sufficient (as per calculations in [0]). Please let me
know if I'm missing something or how to proceed with this.

[0]. https://drive.google.com/open?id=0ByXTtaL2yHBuYnJSNGt6T2U2RjQ

Thanks,
Anish.



On Tue, Jul 12, 2016 at 12:35 PM, anish singh <an...@gmail.com> wrote:

> Hello,
>
> I had been able to setup zeppelin with spark on aws ec2 m4.xlarge instance
> a few days ago. In designing the notebook, I was trying to visualize the
> link structure by the following code :
>
> val mayBegLinks = mayBegData.keepValidPages()
>                             .flatMap(r => ExtractLinks(r.getUrl,
> r.getContentString))
>                             .map(r => (ExtractDomain(r._1),
> ExtractDomain(r._2)))
>                             .filter(r => (r._1.equals("www.fangraphs.com")
> || r._1.equals("www.osnews.com") ||   r._1.equals("www.dailytech.com")))
>
> val linkWtMap = mayBegLinks.map(r => (r, 1)).reduceByKey((x, y) => x + y)
> linkWtMap.toDF().registerTempTable("LnkWtTbl")
>
> where 'mayBegData' is some 2GB of WARC for the first two segments of May.
> This paragraph runs smoothly but in the next paragraph using %sql and the
> following statement :-
>
> select W._1 as Links, W._2 as Weight from LnkWtTbl W
>
> I get errors which are always java.lang.OutOfMemoryError because of
> Garbage Collection space exceeded or heap space exceeded and the most
> recent one is the following:
>
> org.apache.thrift.transport.TTransportException at
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
> at
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
> at
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
> at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) at
> org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:261)
> at
> org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:245)
> at
> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:312)
> at
> org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
> at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:271) at
> org.apache.zeppelin.scheduler.Job.run(Job.java:176) at
> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> I just wanted to know that even with m4.xlarge instance, is it not
> possible to process such large(~ 2GB) of data because the above code is
> relatively simple, I guess. This is restricting the flexibility with which
> the notebook can be designed. Please provide some hints/suggestions since
> I'm stuck on this since yesterday.
>
> Thanks,
> Anish.
>
>
> On Tue, Jul 5, 2016 at 12:28 PM, Alexander Bezzubov <bz...@apache.org>
> wrote:
>
>> That sounds great, Anish!
>> Congratulations on getting a new machine.
>>
>> No worries, please take your time and keep us posted on your exploration!
>> Quality is more important than quantity here.
>>
>> --
>> Alex
>>
>> On Mon, Jul 4, 2016 at 10:40 PM, anish singh <an...@gmail.com>
>> wrote:
>>
>> > Hello,
>> >
>> > Thanks Alex, I'm so glad that you helped. Here's update : I've ordered
>> new
>> > machine with more RAM and processor that should come by tomorrow. I will
>> > attempt to use it for the common crawl data and the AWS solution that
>> you
>> > provided in the previous mail. I'm presently reading papers and
>> > publications regarding analysis of common crawl data. Warcbase tool will
>> > definitely be used. I understand that common crawl datasets are
>> important
>> > and I will do everything it takes to make notebooks on them, the only
>> > tension is that it may take more time than the previous notebooks.
>> >
>> > Anish.
>> >
>> > On Mon, Jul 4, 2016 at 6:30 PM, Alexander Bezzubov <bz...@apache.org>
>> wrote:
>> >
>> > > Hi Anish,
>> > >
>> > > thanks for keeping us posted about a progress!
>> > >
>> > > CommonCrawl is important dataset and it would be awesome if we could
>> > > find a way for you to build some notebooks for it though this this
>> > > years GSoC program.
>> > >
>> > > How about running Zeppelin on a single big enough node in AWS for the
>> > > sake of this notebook?
>> > > If you use spot instance you could get even big instances for really
>> > > affordable price of 2-4$ a day, just need to make sure your persist
>> > > notebooks on S3 [1] to avoid loosing the data and shut down it for the
>> > > night.
>> > >
>> > > AFAIK We do not have free any AWS credits for now, even for a GSoC
>> > > students. If somebody knows a way to provide\get some - please feel
>> > > free to chime in, I know there are some Amazonian people on the list
>> > > :)
>> > >
>> > > But so far AWS spot instances is the most cost-effective solution I
>> > > could imagine of. Bonus: if you host your instance in region us-east-1
>> > > - transfer from\to S3 will be free, as that's where CommonCrawl
>> > > dataset is living.
>> > >
>> > > One more thing - please check out awesome WarcBase library [2] build
>> > > by internet preservation community. I find it really helpful, working
>> > > with web archives.
>> > >
>> > > On the notebook design:
>> > >  - to understand the context of this dataset better - please do some
>> > > research how other people use it. What for, etc.
>> > >    Would be a great material for the blog post
>> > >  - try provide examples of all available formats: WARC, WET, WAT (in
>> > > may be in same or different notebooks, it's up to you)
>> > >  - while using warcbase - mind that RDD persistence will not work
>> > > until [3] is resolved, so avoid using if for now
>> > >
>> > > I understand that this can be a big task, so do not worry if that
>> > > takes time (learning AWS, etc) - just keep us posted on your progress
>> > > weekly and I'll be glad to help!
>> > >
>> > >
>> > >  1.
>> > >
>> >
>> http://zeppelin.apache.org/docs/0.6.0-SNAPSHOT/storage/storage.html#notebook-storage-in-s3
>> > >  2. https://github.com/lintool/warcbase
>> > >  3. https://github.com/lintool/warcbase/issues/227
>> > >
>> > > On Mon, Jul 4, 2016 at 7:00 PM, anish singh <an...@gmail.com>
>> > wrote:
>> > > > Hello,
>> > > >
>> > > > (everything outside Zeppelin)
>> > > > I had started work on the common crawl datasets, and tried to first
>> > have
>> > > a
>> > > > look at only the data for May 2016. Out of the three formats
>> > available, I
>> > > > chose the WET(plain text format). The data only for May is divided
>> into
>> > > > segments and there are 24492 such segments. I downloaded only the
>> first
>> > > > segment for May and got 432MB of data. Now the problem is that my
>> > laptop
>> > > is
>> > > > a very modest machine with core 2 duo processor and 3GB of RAM such
>> > that
>> > > > even opening the downloaded data file in LibreWriter filled the RAM
>> > > > completely and hung the machine and bringing the data directly into
>> > > > zeppelin or analyzing it inside zeppelin seems impossible. As good
>> as I
>> > > > know, there are two ways in which I can proceed :
>> > > >
>> > > > 1) Buying a new laptop with more RAM and processor.   OR
>> > > > 2) Choosing another dataset
>> > > >
>> > > > I have no problem with either of the above ways or anything that you
>> > > might
>> > > > suggest but please let me know which way to proceed so that I may be
>> > able
>> > > > to work in speed. Meanwhile, I will read more papers and
>> publications
>> > on
>> > > > possibilities of analyzing common crawl data.
>> > > >
>> > > > Thanks,
>> > > > Anish.
>> > >
>> >
>>
>
>

Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

Posted by anish singh <an...@gmail.com>.

Hello,

I had been able to setup zeppelin with spark on aws ec2 m4.xlarge instance
a few days ago. In designing the notebook, I was trying to visualize the
link structure by the following code :

val mayBegLinks = mayBegData.keepValidPages()
                            .flatMap(r => ExtractLinks(r.getUrl,
r.getContentString))
                            .map(r => (ExtractDomain(r._1),
ExtractDomain(r._2)))
                            .filter(r => (r._1.equals("www.fangraphs.com")
|| r._1.equals("www.osnews.com") ||   r._1.equals("www.dailytech.com")))

val linkWtMap = mayBegLinks.map(r => (r, 1)).reduceByKey((x, y) => x + y)
linkWtMap.toDF().registerTempTable("LnkWtTbl")

where 'mayBegData' is some 2GB of WARC for the first two segments of May.
This paragraph runs smoothly but in the next paragraph using %sql and the
following statement :-

select W._1 as Links, W._2 as Weight from LnkWtTbl W

I get errors which are always java.lang.OutOfMemoryError because of Garbage
Collection space exceeded or heap space exceeded and the most recent one is
the following:

org.apache.thrift.transport.TTransportException at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) at
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:261)
at
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:245)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:312)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:271) at
org.apache.zeppelin.scheduler.Job.run(Job.java:176) at
org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I just wanted to know that even with m4.xlarge instance, is it not possible
to process such large(~ 2GB) of data because the above code is relatively
simple, I guess. This is restricting the flexibility with which the
notebook can be designed. Please provide some hints/suggestions since I'm
stuck on this since yesterday.

Thanks,
Anish.


On Tue, Jul 5, 2016 at 12:28 PM, Alexander Bezzubov <bz...@apache.org> wrote:

> That sounds great, Anish!
> Congratulations on getting a new machine.
>
> No worries, please take your time and keep us posted on your exploration!
> Quality is more important than quantity here.
>
> --
> Alex
>
> On Mon, Jul 4, 2016 at 10:40 PM, anish singh <an...@gmail.com> wrote:
>
> > Hello,
> >
> > Thanks Alex, I'm so glad that you helped. Here's update : I've ordered
> new
> > machine with more RAM and processor that should come by tomorrow. I will
> > attempt to use it for the common crawl data and the AWS solution that you
> > provided in the previous mail. I'm presently reading papers and
> > publications regarding analysis of common crawl data. Warcbase tool will
> > definitely be used. I understand that common crawl datasets are important
> > and I will do everything it takes to make notebooks on them, the only
> > tension is that it may take more time than the previous notebooks.
> >
> > Anish.
> >
> > On Mon, Jul 4, 2016 at 6:30 PM, Alexander Bezzubov <bz...@apache.org>
> wrote:
> >
> > > Hi Anish,
> > >
> > > thanks for keeping us posted about a progress!
> > >
> > > CommonCrawl is important dataset and it would be awesome if we could
> > > find a way for you to build some notebooks for it though this this
> > > years GSoC program.
> > >
> > > How about running Zeppelin on a single big enough node in AWS for the
> > > sake of this notebook?
> > > If you use spot instance you could get even big instances for really
> > > affordable price of 2-4$ a day, just need to make sure your persist
> > > notebooks on S3 [1] to avoid loosing the data and shut down it for the
> > > night.
> > >
> > > AFAIK We do not have free any AWS credits for now, even for a GSoC
> > > students. If somebody knows a way to provide\get some - please feel
> > > free to chime in, I know there are some Amazonian people on the list
> > > :)
> > >
> > > But so far AWS spot instances is the most cost-effective solution I
> > > could imagine of. Bonus: if you host your instance in region us-east-1
> > > - transfer from\to S3 will be free, as that's where CommonCrawl
> > > dataset is living.
> > >
> > > One more thing - please check out awesome WarcBase library [2] build
> > > by internet preservation community. I find it really helpful, working
> > > with web archives.
> > >
> > > On the notebook design:
> > >  - to understand the context of this dataset better - please do some
> > > research how other people use it. What for, etc.
> > >    Would be a great material for the blog post
> > >  - try provide examples of all available formats: WARC, WET, WAT (in
> > > may be in same or different notebooks, it's up to you)
> > >  - while using warcbase - mind that RDD persistence will not work
> > > until [3] is resolved, so avoid using if for now
> > >
> > > I understand that this can be a big task, so do not worry if that
> > > takes time (learning AWS, etc) - just keep us posted on your progress
> > > weekly and I'll be glad to help!
> > >
> > >
> > >  1.
> > >
> >
> http://zeppelin.apache.org/docs/0.6.0-SNAPSHOT/storage/storage.html#notebook-storage-in-s3
> > >  2. https://github.com/lintool/warcbase
> > >  3. https://github.com/lintool/warcbase/issues/227
> > >
> > > On Mon, Jul 4, 2016 at 7:00 PM, anish singh <an...@gmail.com>
> > wrote:
> > > > Hello,
> > > >
> > > > (everything outside Zeppelin)
> > > > I had started work on the common crawl datasets, and tried to first
> > have
> > > a
> > > > look at only the data for May 2016. Out of the three formats
> > available, I
> > > > chose the WET(plain text format). The data only for May is divided
> into
> > > > segments and there are 24492 such segments. I downloaded only the
> first
> > > > segment for May and got 432MB of data. Now the problem is that my
> > laptop
> > > is
> > > > a very modest machine with core 2 duo processor and 3GB of RAM such
> > that
> > > > even opening the downloaded data file in LibreWriter filled the RAM
> > > > completely and hung the machine and bringing the data directly into
> > > > zeppelin or analyzing it inside zeppelin seems impossible. As good
> as I
> > > > know, there are two ways in which I can proceed :
> > > >
> > > > 1) Buying a new laptop with more RAM and processor.   OR
> > > > 2) Choosing another dataset
> > > >
> > > > I have no problem with either of the above ways or anything that you
> > > might
> > > > suggest but please let me know which way to proceed so that I may be
> > able
> > > > to work in speed. Meanwhile, I will read more papers and publications
> > on
> > > > possibilities of analyzing common crawl data.
> > > >
> > > > Thanks,
> > > > Anish.
> > >
> >
>

Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

Posted by Alexander Bezzubov <bz...@apache.org>.

That sounds great, Anish!
Congratulations on getting a new machine.

No worries, please take your time and keep us posted on your exploration!
Quality is more important than quantity here.

--
Alex

On Mon, Jul 4, 2016 at 10:40 PM, anish singh <an...@gmail.com> wrote:

> Hello,
>
> Thanks Alex, I'm so glad that you helped. Here's update : I've ordered new
> machine with more RAM and processor that should come by tomorrow. I will
> attempt to use it for the common crawl data and the AWS solution that you
> provided in the previous mail. I'm presently reading papers and
> publications regarding analysis of common crawl data. Warcbase tool will
> definitely be used. I understand that common crawl datasets are important
> and I will do everything it takes to make notebooks on them, the only
> tension is that it may take more time than the previous notebooks.
>
> Anish.
>
> On Mon, Jul 4, 2016 at 6:30 PM, Alexander Bezzubov <bz...@apache.org> wrote:
>
> > Hi Anish,
> >
> > thanks for keeping us posted about a progress!
> >
> > CommonCrawl is important dataset and it would be awesome if we could
> > find a way for you to build some notebooks for it though this this
> > years GSoC program.
> >
> > How about running Zeppelin on a single big enough node in AWS for the
> > sake of this notebook?
> > If you use spot instance you could get even big instances for really
> > affordable price of 2-4$ a day, just need to make sure your persist
> > notebooks on S3 [1] to avoid loosing the data and shut down it for the
> > night.
> >
> > AFAIK We do not have free any AWS credits for now, even for a GSoC
> > students. If somebody knows a way to provide\get some - please feel
> > free to chime in, I know there are some Amazonian people on the list
> > :)
> >
> > But so far AWS spot instances is the most cost-effective solution I
> > could imagine of. Bonus: if you host your instance in region us-east-1
> > - transfer from\to S3 will be free, as that's where CommonCrawl
> > dataset is living.
> >
> > One more thing - please check out awesome WarcBase library [2] build
> > by internet preservation community. I find it really helpful, working
> > with web archives.
> >
> > On the notebook design:
> >  - to understand the context of this dataset better - please do some
> > research how other people use it. What for, etc.
> >    Would be a great material for the blog post
> >  - try provide examples of all available formats: WARC, WET, WAT (in
> > may be in same or different notebooks, it's up to you)
> >  - while using warcbase - mind that RDD persistence will not work
> > until [3] is resolved, so avoid using if for now
> >
> > I understand that this can be a big task, so do not worry if that
> > takes time (learning AWS, etc) - just keep us posted on your progress
> > weekly and I'll be glad to help!
> >
> >
> >  1.
> >
> http://zeppelin.apache.org/docs/0.6.0-SNAPSHOT/storage/storage.html#notebook-storage-in-s3
> >  2. https://github.com/lintool/warcbase
> >  3. https://github.com/lintool/warcbase/issues/227
> >
> > On Mon, Jul 4, 2016 at 7:00 PM, anish singh <an...@gmail.com>
> wrote:
> > > Hello,
> > >
> > > (everything outside Zeppelin)
> > > I had started work on the common crawl datasets, and tried to first
> have
> > a
> > > look at only the data for May 2016. Out of the three formats
> available, I
> > > chose the WET(plain text format). The data only for May is divided into
> > > segments and there are 24492 such segments. I downloaded only the first
> > > segment for May and got 432MB of data. Now the problem is that my
> laptop
> > is
> > > a very modest machine with core 2 duo processor and 3GB of RAM such
> that
> > > even opening the downloaded data file in LibreWriter filled the RAM
> > > completely and hung the machine and bringing the data directly into
> > > zeppelin or analyzing it inside zeppelin seems impossible. As good as I
> > > know, there are two ways in which I can proceed :
> > >
> > > 1) Buying a new laptop with more RAM and processor.   OR
> > > 2) Choosing another dataset
> > >
> > > I have no problem with either of the above ways or anything that you
> > might
> > > suggest but please let me know which way to proceed so that I may be
> able
> > > to work in speed. Meanwhile, I will read more papers and publications
> on
> > > possibilities of analyzing common crawl data.
> > >
> > > Thanks,
> > > Anish.
> >
>

Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

Posted by anish singh <an...@gmail.com>.

Hello,

Thanks Alex, I'm so glad that you helped. Here's update : I've ordered new
machine with more RAM and processor that should come by tomorrow. I will
attempt to use it for the common crawl data and the AWS solution that you
provided in the previous mail. I'm presently reading papers and
publications regarding analysis of common crawl data. Warcbase tool will
definitely be used. I understand that common crawl datasets are important
and I will do everything it takes to make notebooks on them, the only
tension is that it may take more time than the previous notebooks.

Anish.

On Mon, Jul 4, 2016 at 6:30 PM, Alexander Bezzubov <bz...@apache.org> wrote:

> Hi Anish,
>
> thanks for keeping us posted about a progress!
>
> CommonCrawl is important dataset and it would be awesome if we could
> find a way for you to build some notebooks for it though this this
> years GSoC program.
>
> How about running Zeppelin on a single big enough node in AWS for the
> sake of this notebook?
> If you use spot instance you could get even big instances for really
> affordable price of 2-4$ a day, just need to make sure your persist
> notebooks on S3 [1] to avoid loosing the data and shut down it for the
> night.
>
> AFAIK We do not have free any AWS credits for now, even for a GSoC
> students. If somebody knows a way to provide\get some - please feel
> free to chime in, I know there are some Amazonian people on the list
> :)
>
> But so far AWS spot instances is the most cost-effective solution I
> could imagine of. Bonus: if you host your instance in region us-east-1
> - transfer from\to S3 will be free, as that's where CommonCrawl
> dataset is living.
>
> One more thing - please check out awesome WarcBase library [2] build
> by internet preservation community. I find it really helpful, working
> with web archives.
>
> On the notebook design:
>  - to understand the context of this dataset better - please do some
> research how other people use it. What for, etc.
>    Would be a great material for the blog post
>  - try provide examples of all available formats: WARC, WET, WAT (in
> may be in same or different notebooks, it's up to you)
>  - while using warcbase - mind that RDD persistence will not work
> until [3] is resolved, so avoid using if for now
>
> I understand that this can be a big task, so do not worry if that
> takes time (learning AWS, etc) - just keep us posted on your progress
> weekly and I'll be glad to help!
>
>
>  1.
> http://zeppelin.apache.org/docs/0.6.0-SNAPSHOT/storage/storage.html#notebook-storage-in-s3
>  2. https://github.com/lintool/warcbase
>  3. https://github.com/lintool/warcbase/issues/227
>
> On Mon, Jul 4, 2016 at 7:00 PM, anish singh <an...@gmail.com> wrote:
> > Hello,
> >
> > (everything outside Zeppelin)
> > I had started work on the common crawl datasets, and tried to first have
> a
> > look at only the data for May 2016. Out of the three formats available, I
> > chose the WET(plain text format). The data only for May is divided into
> > segments and there are 24492 such segments. I downloaded only the first
> > segment for May and got 432MB of data. Now the problem is that my laptop
> is
> > a very modest machine with core 2 duo processor and 3GB of RAM such that
> > even opening the downloaded data file in LibreWriter filled the RAM
> > completely and hung the machine and bringing the data directly into
> > zeppelin or analyzing it inside zeppelin seems impossible. As good as I
> > know, there are two ways in which I can proceed :
> >
> > 1) Buying a new laptop with more RAM and processor.   OR
> > 2) Choosing another dataset
> >
> > I have no problem with either of the above ways or anything that you
> might
> > suggest but please let me know which way to proceed so that I may be able
> > to work in speed. Meanwhile, I will read more papers and publications on
> > possibilities of analyzing common crawl data.
> >
> > Thanks,
> > Anish.
>

Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

Posted by Alexander Bezzubov <bz...@apache.org>.

Hi Anish,

thanks for keeping us posted about a progress!

CommonCrawl is important dataset and it would be awesome if we could
find a way for you to build some notebooks for it though this this
years GSoC program.

How about running Zeppelin on a single big enough node in AWS for the
sake of this notebook?
If you use spot instance you could get even big instances for really
affordable price of 2-4$ a day, just need to make sure your persist
notebooks on S3 [1] to avoid loosing the data and shut down it for the
night.

AFAIK We do not have free any AWS credits for now, even for a GSoC
students. If somebody knows a way to provide\get some - please feel
free to chime in, I know there are some Amazonian people on the list
:)

But so far AWS spot instances is the most cost-effective solution I
could imagine of. Bonus: if you host your instance in region us-east-1
- transfer from\to S3 will be free, as that's where CommonCrawl
dataset is living.

One more thing - please check out awesome WarcBase library [2] build
by internet preservation community. I find it really helpful, working
with web archives.

On the notebook design:
 - to understand the context of this dataset better - please do some
research how other people use it. What for, etc.
   Would be a great material for the blog post
 - try provide examples of all available formats: WARC, WET, WAT (in
may be in same or different notebooks, it's up to you)
 - while using warcbase - mind that RDD persistence will not work
until [3] is resolved, so avoid using if for now

I understand that this can be a big task, so do not worry if that
takes time (learning AWS, etc) - just keep us posted on your progress
weekly and I'll be glad to help!

 1. http://zeppelin.apache.org/docs/0.6.0-SNAPSHOT/storage/storage.html#notebook-storage-in-s3
 2. https://github.com/lintool/warcbase
 3. https://github.com/lintool/warcbase/issues/227

On Mon, Jul 4, 2016 at 7:00 PM, anish singh <an...@gmail.com> wrote:
> Hello,
>
> (everything outside Zeppelin)
> I had started work on the common crawl datasets, and tried to first have a
> look at only the data for May 2016. Out of the three formats available, I
> chose the WET(plain text format). The data only for May is divided into
> segments and there are 24492 such segments. I downloaded only the first
> segment for May and got 432MB of data. Now the problem is that my laptop is
> a very modest machine with core 2 duo processor and 3GB of RAM such that
> even opening the downloaded data file in LibreWriter filled the RAM
> completely and hung the machine and bringing the data directly into
> zeppelin or analyzing it inside zeppelin seems impossible. As good as I
> know, there are two ways in which I can proceed :
>
> 1) Buying a new laptop with more RAM and processor.   OR
> 2) Choosing another dataset
>
> I have no problem with either of the above ways or anything that you might
> suggest but please let me know which way to proceed so that I may be able
> to work in speed. Meanwhile, I will read more papers and publications on
> possibilities of analyzing common crawl data.
>
> Thanks,
> Anish.