You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Arvind Shyamsundar <ar...@microsoft.com> on 2020/04/03 16:54:35 UTC
RE: [EXTERNAL] Re: Accumulo on S3

hi Josh - I do have a recording of your talk from Nov 12, 2019. Let me separately work with Marc Parisi and yourself on an appropriate way to share broadly and then we can update this thread.

Thanks.

Arvind Shyamsundar

-----Original Message-----
From: Josh Elser <el...@apache.org> 
Sent: Friday, April 3, 2020 9:10 AM
To: user@accumulo.apache.org
Subject: [EXTERNAL] Re: Accumulo on S3

It sounds like you're running into the known S3 consistency issues. 
However, I don't know what exactly EMRFS is supposed to support all of the things that Accumulo requires. I would assume that EMRFS should be bridging the gap from S3 (a blobstore) to a consistent, distributed FileSystem that Accumulo provides. Their summary[1] indicates that consistent listings and read-after-write is solve which is a big problem. Not sure if you are supposed to also get atomic rename from it.

This presentation[2] should be a good primer I put together earlier this year on cloud storage for BigTables which may help you understand what's going on. I gave it at a meetup here in MD a couple of months back, but I don't think we were recording it.

[1] https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.aws.amazon.com%2Femr%2Flatest%2FManagementGuide%2Femr-fs.html&amp;data=02%7C01%7Carvindsh%40microsoft.com%7C754bbcda266842aaf1f908d7d7e98916%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637215270338426423&amp;sdata=ywPGgV11aBQZqH%2BcvepDlWQuw0L8jmeSzftR7Zc0Jx4%3D&amp;reserved=0
[2]
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F1Or1s-X0JjiLM87HKIOWlh3WlkdUQfYH9%2Fview%3Fusp%3Dsharing&amp;data=02%7C01%7Carvindsh%40microsoft.com%7C754bbcda266842aaf1f908d7d7e98916%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637215270338426423&amp;sdata=hW9tOmnL63w6D2AC3f8MJ7v5GvpY69EEmcbj5%2FDffEI%3D&amp;reserved=0

On 4/2/20 3:56 PM, Kevin Hobbs wrote:
> Accumulo Users,
> 
> Is AWS EMR's "EMRFS consistent view" useful or required for Accumulo2 
> on S3? Has anyone else tried EMR + Accumulo2 on S3?
> 
> I have incorporated *most* of the steps in the blog post
> 
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Faccu
> mulo.apache.org%2Fblog%2F2019%2F09%2F10%2Faccumulo-S3-notes.html&amp;d
> ata=02%7C01%7Carvindsh%40microsoft.com%7C754bbcda266842aaf1f908d7d7e98
> 916%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637215270338436394&am
> p;sdata=DjyhduLB12AzSdR5GnbxVABVmupH3YeL%2FQFAhBlmwpQ%3D&amp;reserved=
> 0
> 
> into an AWS EMR bootstrap action, that creates an Accumulo cluster 
> running on emr-6.0.0-beta2. I have not used the hadoop-aws-relocated 
> jar as the emr jars are available.
> 
> I am able to use a GeoMesa snapshot to ingest and retrieve data on the
> s3 volume. However, I just tried an ingest of about 10GB which 
> progressed smoothly for a while until the masters  web UI reported 
> "MajC Failed, extent = a<;":
> 
> java.io.IOException: Rename
> s3://THEBUCKET/accumulo/tables/a/default_tablet/A00000ci.rf_tmp to 
> s3://THEBUCKET/accumulo/tables/a/default_tablet/A00000ci.rf returned 
> false
>      at
> org.apache.accumulo.tserver.tablet.DatafileManager.rename(DatafileMana
> ger.java:85)
> 
>      at
> org.apache.accumulo.tserver.tablet.DatafileManager.bringMajorCompactio
> nOnline(DatafileManager.java:533)
> 
>      at
> org.apache.accumulo.tserver.tablet.Tablet._majorCompact(Tablet.java:20
> 51)
>      at
> org.apache.accumulo.tserver.tablet.Tablet.majorCompact(Tablet.java:216
> 4)
>      at
> org.apache.accumulo.tserver.tablet.CompactionRunner.run(CompactionRunn
> er.java:37)
> 
>      at 
> org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
>      at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
> ava:1149)
> 
>      at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
> java:624)
> 
>      at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java
> :35)
>      at java.lang.Thread.run(Thread.java:748)
> 
> 
> A bit later it reported:
> 
> java.io.FileNotFoundException: No such file or directory 
> 's3://THEBUCKET/accumulo/tables/c/t-0000090/F00000nz.rf'
>      at
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3Nat
> iveFileSystem.java:808)
> 
>      at
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSy
> stem.java:1212)
> 
>      at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:902)
>      at
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:207)
>      at
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Cachabl
> eBuilder.lambda$fsPath$0(CachableBlockFile.java:91)
> 
>      at
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.
> getBCFile(CachableBlockFile.java:172)
> 
>      at
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.
> getMetaBlock(CachableBlockFile.java:400)
> 
>      at
> org.apache.accumulo.core.file.rfile.RFile$Reader.<init>(RFile.java:115
> 6)
>      at
> org.apache.accumulo.core.file.rfile.RFile$Reader.<init>(RFile.java:125
> 1)
>      at
> org.apache.accumulo.core.file.rfile.RFileOperations.getReader(RFileOpe
> rations.java:53)
> 
>      at
> org.apache.accumulo.core.file.rfile.RFileOperations.openReader(RFileOp
> erations.java:68)
> 
>      at
> org.apache.accumulo.core.file.DispatchingFileFactory.openReader(Dispat
> chingFileFactory.java:83)
> 
>      at
> org.apache.accumulo.core.file.FileOperations$ReaderBuilder.build(FileO
> perations.java:478)
> 
>      at
> org.apache.accumulo.tserver.tablet.Compactor.openMapDataFiles(Compacto
> r.java:299)
> 
>      at
> org.apache.accumulo.tserver.tablet.Compactor.compactLocalityGroup(Comp
> actor.java:344)
> 
>      at
> org.apache.accumulo.tserver.tablet.Compactor.call(Compactor.java:225)
>      at
> org.apache.accumulo.tserver.tablet.Tablet._majorCompact(Tablet.java:20
> 39)
>      at
> org.apache.accumulo.tserver.tablet.Tablet.majorCompact(Tablet.java:216
> 4)
>      at
> org.apache.accumulo.tserver.tablet.CompactionRunner.run(CompactionRunn
> er.java:37)
> 
>      at 
> org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
>      at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
> ava:1149)
> 
>      at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
> java:624)
> 
>      at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java
> :35)
>      at java.lang.Thread.run(Thread.java:748)
> 
> 
> These seem like the same sort of problems HBASE on EMR can have when 
> EMRFS isn't functioning properly.
> 
> --Kevin
> 
> On 3/3/20 1:57 PM, Jim Hughes wrote:
>> Hi all,
>>
>> The next major release of GeoMesa is aimed at supporting Accumulo 2.x. 
>> As part of testing, my coworker Kevin and I are trying out Accumulo
>> 2.0 on S3.
>>
>> Keith's blog post[1] is great.  As people have tested Accumulo 2.0 in 
>> AWS, has anyone tried using EMR for the underlying HDFS cluster (and 
>> then installing Accumulo via bootstrap actions)?  Is there a 
>> preferred/suggested deployment strategy?
>>
>> Cheers,
>>
>> Jim
>>
>> 1. 
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Faccumulo.apache.org%2Fblog%2F2019%2F09%2F10%2Faccumulo-S3-notes.html&amp;data=02%7C01%7Carvindsh%40microsoft.com%7C754bbcda266842aaf1f908d7d7e98916%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637215270338436394&amp;sdata=DjyhduLB12AzSdR5GnbxVABVmupH3YeL%2FQFAhBlmwpQ%3D&amp;reserved=0
>>