You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2019/11/17 14:11:30 UTC

[GitHub] [drill] cgivre commented on issue #1778: Drill-7233: Format Plugin for HDF5

cgivre commented on issue #1778: Drill-7233: Format Plugin for HDF5
URL: https://github.com/apache/drill/pull/1778#issuecomment-554749532
 
 
   @Arina, I'm still doing some code cleanup... I'll finish by tomorrow.
   
   > On Nov 17, 2019, at 9:10 AM, Arina Ielchiieva <no...@github.com> wrote:
   > 
   > @arina-ielchiieva commented on this pull request.
   > 
   > @cgivre <https://github.com/cgivre> in therms if Drill hygiene the PR is not ready. Some of the @paul-rogers <https://github.com/paul-rogers> are comments are not addressed, including new one from me. Too many test files is also an issue. PR so far is not ready top be committed.
   > 
   > In contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5Column.java <https://github.com/apache/drill/pull/1778#discussion_r347138108>:
   > 
   > > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   > + * See the License for the specific language governing permissions and
   > + * limitations under the License.
   > + */
   > +
   > +package org.apache.drill.exec.store.hdf5;
   > +
   > +import ch.systemsx.cisd.hdf5.HDF5DataSetInformation;
   > +import ch.systemsx.cisd.hdf5.HDF5DataTypeInformation;
   > +
   > +/*
   > +This class is used in Drill to build a HashMap of the available columns
   > + */
   > +public class HDF5Column {
   > +
   > +  private String columnName;
   > Please make fields final...
   > 
   > In contrib/format-hdf5/pom.xml <https://github.com/apache/drill/pull/1778#discussion_r347138260>:
   > 
   > > +  <repositories>
   > +    <repository>
   > +      <id>jhdf5-repo</id>
   > +      <name>ImageJ Repository</name>
   > +      <url>https://maven.imagej.net/content/repositories/public/</url>
   > +    </repository>
   > +  </repositories>
   > +
   > +  <dependencies>
   > +    <dependency>
   > +      <groupId>org.apache.drill.exec</groupId>
   > +      <artifactId>drill-java-exec</artifactId>
   > +      <version>${project.version}</version>
   > +    </dependency>
   > +    <dependency>
   > +      <groupId>cisd</groupId>
   > Which license this lib is using? Is it approved by Apache?
   > 
   > In contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java <https://github.com/apache/drill/pull/1778#discussion_r347138289>:
   > 
   > > +
   > +  private static final String FLOAT_COLUMN_NAME = "float_data";
   > +
   > +  private static final String DOUBLE_COLUMN_NAME = "double_data";
   > +
   > +  private static final String LONG_COLUMN_NAME = "long_data";
   > +
   > +  private static final String STRING_COLUMN_NAME = "string_data";
   > +
   > +  private static final String BOOLEAN_COLUMN_NAME = "boolean_data";
   > +
   > +  private static final String PATH_PATTERN_REGEX = "/*.*/(.+?)$";
   > +
   > +  private static final Pattern PATH_PATTERN = Pattern.compile(PATH_PATTERN_REGEX);
   > +
   > +  private FileSplit split;
   > Some of the fields can be final...
   > 
   > In contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java <https://github.com/apache/drill/pull/1778#discussion_r347138316>:
   > 
   > > +
   > +    //Write attributes if present
   > +    if (metadataRow.getAttributes().size() > 0) {
   > +      writeAttributes(rowWriter, metadataRow);
   > +    }
   > +
   > +    if (metadataRow.getDataType().equalsIgnoreCase("DATASET")) {
   > +      projectDataset(rowWriter, metadataRow.getPath());
   > +    }
   > +    rowWriter.save();
   > +    return !metadataIterator.hasNext();
   > +  }
   > +
   > +  /**
   > +   * This helper function returns the name of a HDF5 record from a data path
   > +   *
   > Please proceed an example.
   > 
   > In contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java <https://github.com/apache/drill/pull/1778#discussion_r347138328>:
   > 
   > > +   * @return String name of data
   > +   */
   > +  private String getNameFromPath(String path) {
   > +    // Now create matcher object.
   > +    Matcher m = PATH_PATTERN.matcher(path);
   > +    if (m.find()) {
   > +      return m.group(1);
   > +    } else {
   > +      return "";
   > +    }
   > +  }
   > +
   > +  /**
   > +   * This function gets the file metadata from a given HDF5 file.  It will extract the file name the path, and adds any information to the
   > +   * metadata List.
   > +   * @param members
   > Add description to avoid warnings in the IDE.
   > 
   > In contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java <https://github.com/apache/drill/pull/1778#discussion_r347138342>:
   > 
   > > +   *
   > +   * @param path The path for which you wish to retrieve attributes
   > +   * @return HashMap The attributes for the given path.  Empty HashMap if no attributes present
   > +   */
   > +  private HashMap getAttributes(String path) {
   > +    HashMap<String, HDF5Attribute> attributes = new HashMap<>();
   > +
   > +    long attrCount = hdf5Reader.object().getObjectInformation(path).getNumberOfAttributes();
   > +    if (attrCount > 0) {
   > +      List<String> attrNames = hdf5Reader.object().getAllAttributeNames(path);
   > +      for (String name : attrNames) {
   > +        try {
   > +          HDF5Attribute attribute = HDF5Utils.getAttribute(path, name, hdf5Reader);
   > +          attributes.put(attribute.getKey(), attribute);
   > +        } catch (Exception e) {
   > +          logger.info("Couldn't add attribute: " + path + " " + name);
   > ⬇️ Suggested change
   > -          logger.info("Couldn't add attribute: " + path + " " + name);
   > +          logger.info("Couldn't add attribute: {} {}", path, name);
   > In contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5DrillMetadata.java <https://github.com/apache/drill/pull/1778#discussion_r347138397>:
   > 
   > > + * distributed under the License is distributed on an "AS IS" BASIS,
   > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   > + * See the License for the specific language governing permissions and
   > + * limitations under the License.
   > + */
   > +
   > +package org.apache.drill.exec.store.hdf5;
   > +
   > +import java.util.HashMap;
   > +
   > +public class HDF5DrillMetadata {
   > +  private String path;
   > +
   > +  private String dataType;
   > +
   > +  private HashMap attributes;
   > Map? Generics?
   > 
   > In contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5FormatPlugin.java <https://github.com/apache/drill/pull/1778#discussion_r347138419>:
   > 
   > > +import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader;
   > +import org.apache.drill.exec.proto.UserBitShared;
   > +import org.apache.drill.exec.server.DrillbitContext;
   > +import org.apache.drill.exec.server.options.OptionManager;
   > +import org.apache.drill.exec.store.dfs.easy.EasySubScan;
   > +import org.apache.hadoop.conf.Configuration;
   > +import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin;
   > +import org.apache.drill.exec.store.hdf5.HDF5BatchReader.HDF5ReaderConfig;
   > +
   > +public class HDF5FormatPlugin extends EasyFormatPlugin<HDF5FormatConfig> {
   > +
   > +  protected static final String DEFAULT_NAME = "hdf5";
   > +
   > +  private HDF5FormatConfig config;
   > +
   > +  private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(HDF5FormatPlugin.class);
   > Imports
   > 
   > In contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5Utils.java <https://github.com/apache/drill/pull/1778#discussion_r347138427>:
   > 
   > > +import ch.systemsx.cisd.hdf5.HDF5EnumerationValue;
   > +import ch.systemsx.cisd.hdf5.IHDF5Reader;
   > +import ch.systemsx.cisd.hdf5.HDF5DataSetInformation;
   > +import ch.systemsx.cisd.hdf5.HDF5DataTypeInformation;
   > +import org.apache.commons.lang3.ArrayUtils;
   > +import org.apache.drill.common.types.TypeProtos.MinorType;
   > +
   > +import java.io.IOException;
   > +import java.util.BitSet;
   > +import java.util.List;
   > +import java.util.Map;
   > +import java.util.regex.Matcher;
   > +import java.util.regex.Pattern;
   > +
   > +public class HDF5Utils {
   > +  private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(HDF5Utils.class);
   > Imports
   > 
   > In contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5Utils.java <https://github.com/apache/drill/pull/1778#discussion_r347138433>:
   > 
   > > +import java.io.IOException;
   > +import java.util.BitSet;
   > +import java.util.List;
   > +import java.util.Map;
   > +import java.util.regex.Matcher;
   > +import java.util.regex.Pattern;
   > +
   > +public class HDF5Utils {
   > +  private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(HDF5Utils.class);
   > +
   > +  private static final boolean caseSensitive = false;
   > +
   > +  /**
   > +   * This function returns and HDF5Attribute object for use when Drill maps the attributes.
   > +   *
   > +   * @param pathName
   > Description
   > 
   > In contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5Utils.java <https://github.com/apache/drill/pull/1778#discussion_r347138454>:
   > 
   > > +        }
   > +      }
   > +      if (path.equalsIgnoreCase(resolvedPath)) {
   > +        return resolvedPath;
   > +      }
   > +    }
   > +    return null;
   > +  }
   > +
   > +  /**
   > +   * This helper function returns the name of a HDF5 record from a data path
   > +   *
   > +   * @param path Path to HDF5 data
   > +   * @return String name of data
   > +   */
   > +  public static String getNameFromPath(String path) {
   > I think you already have similar method, extract to util class to use on both places
   > 
   > In contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/writers/HDF5DataWriter.java <https://github.com/apache/drill/pull/1778#discussion_r347138469>:
   > 
   > > + * See the License for the specific language governing permissions and
   > + * limitations under the License.
   > + */
   > +
   > +package org.apache.drill.exec.store.hdf5.writers;
   > +
   > +import ch.systemsx.cisd.hdf5.IHDF5Reader;
   > +import org.apache.drill.exec.physical.resultSet.RowSetLoader;
   > +
   > +import java.util.ArrayList;
   > +import java.util.List;
   > +
   > +public class HDF5DataWriter {
   > +  final RowSetLoader columnWriter;
   > +
   > +  final IHDF5Reader reader;
   > Why fields are not private?
   > 
   > In contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/writers/HDF5DataWriter.java <https://github.com/apache/drill/pull/1778#discussion_r347138489>:
   > 
   > > +    return false;
   > +  }
   > +
   > +  public int currentRowCount() {
   > +    return counter;
   > +  }
   > +
   > +  public List<Object> getColumn(int columnIndex) {
   > +    ArrayList<Object> result = new ArrayList<Object>();
   > +    for (Object[] compoundDatum : compoundData) {
   > +      result.add(compoundDatum[columnIndex]);
   > +    }
   > +    return result;
   > +  }
   > +
   > +  public int getDatasize() {
   > Why data size is always 0?
   > 
   > In contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/writers/HDF5MapDataWriter.java <https://github.com/apache/drill/pull/1778#discussion_r347138515>:
   > 
   > > +import org.apache.drill.common.exceptions.UserException;
   > +import org.apache.drill.exec.physical.resultSet.RowSetLoader;
   > +import org.slf4j.Logger;
   > +import org.slf4j.LoggerFactory;
   > +
   > +import java.util.ArrayList;
   > +import java.util.List;
   > +
   > +public class HDF5MapDataWriter extends HDF5DataWriter {
   > +  private static final Logger logger = LoggerFactory.getLogger(HDF5MapDataWriter.class);
   > +
   > +  private final String UNSAFE_SPACE_SEPARATOR = " ";
   > +
   > +  private final String SAFE_SPACE_SEPARATOR = "_";
   > +
   > +  private ArrayList<HDF5DataWriter> dataWriters;
   > List
   > 
   > In contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/writers/HDF5MapDataWriter.java <https://github.com/apache/drill/pull/1778#discussion_r347138520>:
   > 
   > > +import org.slf4j.Logger;
   > +import org.slf4j.LoggerFactory;
   > +
   > +import java.util.ArrayList;
   > +import java.util.List;
   > +
   > +public class HDF5MapDataWriter extends HDF5DataWriter {
   > +  private static final Logger logger = LoggerFactory.getLogger(HDF5MapDataWriter.class);
   > +
   > +  private final String UNSAFE_SPACE_SEPARATOR = " ";
   > +
   > +  private final String SAFE_SPACE_SEPARATOR = "_";
   > +
   > +  private ArrayList<HDF5DataWriter> dataWriters;
   > +
   > +  private ArrayList<String> fieldNames;
   > List
   > 
   > In contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/writers/HDF5MapDataWriter.java <https://github.com/apache/drill/pull/1778#discussion_r347138533>:
   > 
   > > +  private ArrayList<HDF5DataWriter> dataWriters;
   > +
   > +  private ArrayList<String> fieldNames;
   > +
   > +
   > +  public HDF5MapDataWriter(IHDF5Reader reader, RowSetLoader columnWriter, String datapath) {
   > +    super(reader, columnWriter, datapath);
   > +    fieldNames = new ArrayList<>();
   > +
   > +    compoundData = reader.compounds().readArray(datapath, Object[].class);
   > +    dataWriters = new ArrayList<>();
   > +    fieldNames = getFieldNames();
   > +    try {
   > +      getDataWriters();
   > +    } catch (Exception e) {
   > +      throw UserException.dataReadError().addContext("Error writing Compound Field: ").addContext(e.getMessage()).build(logger);
   > Chaining fro new lines.
   > 
   > In contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/writers/HDF5TimestampDataWriter.java <https://github.com/apache/drill/pull/1778#discussion_r347138565>:
   > 
   > > + */
   > +
   > +package org.apache.drill.exec.store.hdf5.writers;
   > +
   > +import ch.systemsx.cisd.hdf5.IHDF5Reader;
   > +import org.apache.drill.common.types.TypeProtos;
   > +import org.apache.drill.exec.physical.resultSet.RowSetLoader;
   > +import org.apache.drill.exec.record.metadata.ColumnMetadata;
   > +import org.apache.drill.exec.record.metadata.MetadataUtils;
   > +import org.apache.drill.exec.store.hdf5.HDF5Utils;
   > +import org.apache.drill.exec.vector.accessor.ScalarWriter;
   > +import org.joda.time.Instant;
   > +
   > +public class HDF5TimestampDataWriter extends HDF5DataWriter {
   > +
   > +  private long[] data;
   > final
   > 
   > In contrib/format-hdf5/src/main/resources/bootstrap-format-plugins.json <https://github.com/apache/drill/pull/1778#discussion_r347138571>:
   > 
   > > +        }
   > +      }
   > +    },
   > +    "s3": {
   > +      "type": "file",
   > +      "formats": {
   > +        "hdf5": {
   > +          "type": "hdf5",
   > +          "extensions": [
   > +            "h5"
   > +          ]
   > +        }
   > +      }
   > +    }
   > +  }
   > +}
   > New line
   > 
   > In contrib/format-hdf5/src/test/java/org/apache/drill/exec/store/hdf5/TestHDF5Utils.java <https://github.com/apache/drill/pull/1778#discussion_r347138681>:
   > 
   > > +  @Test
   > +  public void testGetNameFromPath() {
   > +    String path1 = "/group1";
   > +    String path2 = "/group1/group2/group3";
   > +    String emptyPath = "";
   > +    String nullPath = null;
   > +    String rootPath = "/";
   > +
   > +    assertEquals(HDF5Utils.getNameFromPath(path1), "group1");
   > +    assertEquals(HDF5Utils.getNameFromPath(path2), "group3");
   > +    assertEquals(HDF5Utils.getNameFromPath(emptyPath), "");
   > +    assertNull(HDF5Utils.getNameFromPath(nullPath));
   > +    assertEquals(HDF5Utils.getNameFromPath(rootPath), "");
   > +  }
   > +
   > +}
   > I think are adding too many tests files and some of them are 1 mb large, please reduce number of files as well as their size.
   > 
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub <https://github.com/apache/drill/pull/1778?email_source=notifications&email_token=ABKB7PSRTZYS7XWJFUMP5X3QUFGFFA5CNFSM4HJ53SBKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCL2KD2A#pullrequestreview-318022120>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKB7PSEEV36YHGNA7ZF4FTQUFGFFANCNFSM4HJ53SBA>.
   > 
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services