You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by GitBox <gi...@apache.org> on 2019/12/11 16:14:02 UTC

[GitHub] [hbase] busbey commented on a change in pull request #785: HBASE-23239 Reporting on status of backing MOB files from client-facing cells

busbey commented on a change in pull request #785: HBASE-23239 Reporting on status of backing MOB files from client-facing cells
URL: https://github.com/apache/hbase/pull/785#discussion_r356692380
 
 

 ##########
 File path: hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mob/mapreduce/MobRefReporter.java
 ##########
 @@ -0,0 +1,509 @@
+/**
+ *
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.mob.mapreduce;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Base64;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hbase.Cell;
+import org.apache.hadoop.hbase.HBaseConfiguration;
+import org.apache.hadoop.hbase.HConstants;
+import org.apache.hadoop.hbase.TableName;
+import org.apache.hadoop.hbase.client.Admin;
+import org.apache.hadoop.hbase.client.ColumnFamilyDescriptor;
+import org.apache.hadoop.hbase.client.Connection;
+import org.apache.hadoop.hbase.client.ConnectionFactory;
+import org.apache.hadoop.hbase.client.Result;
+import org.apache.hadoop.hbase.client.Scan;
+import org.apache.hadoop.hbase.client.TableDescriptor;
+import org.apache.hadoop.hbase.io.HFileLink;
+import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
+import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
+import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
+import org.apache.hadoop.hbase.mapreduce.TableMapper;
+import org.apache.hadoop.hbase.mob.MobConstants;
+import org.apache.hadoop.hbase.mob.MobUtils;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.apache.hadoop.hbase.util.EnvironmentEdgeManager;
+import org.apache.hadoop.hbase.util.FSUtils;
+import org.apache.hadoop.hbase.util.HFileArchiveUtil;
+import org.apache.hadoop.hbase.util.Pair;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.Reducer;
+import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
+import org.apache.hadoop.security.UserGroupInformation;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.apache.yetus.audience.InterfaceAudience;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+
+/**
+ * Scans a given table + CF for all mob reference cells to get the list of backing mob files.
+ * For each referenced file we attempt to verify that said file is on the FileSystem in a place
+ * that the MOB system will look when attempting to resolve the actual value.
+ *
+ * The job includes counters that can help provide a rough sketch of the mob data.
+ *
+ * <pre>
+ * Map-Reduce Framework
+ *         Map input records=10000
+ * ...
+ *         Reduce output records=99
+ * ...
+ * CELLS PER ROW
+ *         Number of rows with 1s of cells per row=10000
+ * MOB
+ *         NUM_CELLS=52364
+ * PROBLEM
+ *         Affected rows=338
+ *         Problem MOB files=2
+ * ROWS WITH PROBLEMS PER FILE
+ *         Number of HFiles with 100s of affected rows=2
+ * SIZES OF CELLS
+ *         Number of cells with size in the 10,000s of bytes=627
+ *         Number of cells with size in the 100,000s of bytes=51392
+ *         Number of cells with size in the 1,000,000s of bytes=345
+ * SIZES OF ROWS
+ *         Number of rows with total size in the 100,000s of bytes=6838
+ *         Number of rows with total size in the 1,000,000s of bytes=3162
+ * </pre>
+ *
+ *   * Map-Reduce Framework:Map input records - the number of rows with mob references
+ *   * Map-Reduce Framework:Reduce output records - the number of unique hfiles referenced
+ *   * MOB:NUM_CELLS - the total number of mob reference cells
+ *   * PROBLEM:Affected rows - the number of rows that reference hfiles with an issue
+ *   * PROBLEM:Problem MOB files - the number of unique hfiles that have an issue
+ *   * CELLS PER ROW: - this counter group gives a histogram of the order of magnitude of the
+ *         number of cells in a given row by grouping by the number of digits used in each count.
+ *         This allows us to see more about the distribution of cells than what we can determine
+ *         with just the cell count and the row count. In this particular example we can see that
+ *         all of our rows have somewhere between 1 - 9 cells.
+ *   * ROWS WITH PROBLEMS PER FILE: - this counter group gives a histogram of the order of
+ *         magnitude of the number of rows in each of the hfiles with a problem. e.g. in the
+ *         example there are 2 hfiles and they each have the same order of magnitude number of rows,
+ *         specifically between 100 and 999.
+ *   * SIZES OF CELLS: - this counter group gives a histogram of the order of magnitude of
+ *         the size of mob values according to our reference cells. e.g. in the example above we
+ *         have cell sizes that are all between 10,000 bytes and 9,999,999 bytes. From this
+ *         histogram we can also see that _most_ cells are 100,000 - 999,000 bytes and the smaller
+ *         and bigger ones are outliers making up less than 2% of mob cells.
+ *   * SIZES OF ROWS: - this counter group gives a histogram of the order of magnitude of the
+ *         size of mob values across each row according to our reference cells. In the example above
+ *         we have rows that are are between 100,000 bytes and 9,999,999 bytes. We can also see that
+ *         about 2/3rd of our rows are 100,000 - 999,999 bytes.
+ *
+ * Generates a report that gives one file status per line, with tabs dividing fields.
+ *
+ * <pre>
+ * RESULT OF LOOKUP	FILE REF	comma seperated, base64 encoded rows when there's a problem
+ * </pre>
+ *
+ * e.g.
+ *
+ * <pre>
+ * MOB DIR	09c576e28a65ed2ead0004d192ffaa382019110184b30a1c7e034573bf8580aef8393402
+ * MISSING FILE    28e252d7f013973174750d483d358fa020191101f73536e7133f4cd3ab1065edf588d509        MmJiMjMyYzBiMTNjNzc0OTY1ZWY4NTU4ZjBmYmQ2MTUtNTIz,MmEzOGE0YTkzMTZjNDllNWE4MzM1MTdjNDVkMzEwNzAtODg=
+ * </pre>
+ *
+ * Possible results are listed; the first three indicate things are working properly.
 
 Review comment:
   the space shouldn't impact how consumable it is. we use tabs to delimit the pieces one needs to pull out.
   
   I'm already using this in production just directly with linux tools fine. I'm writing a java tool to parse it next and I don't expect to have an issue there either.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services