You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2018/08/04 23:49:00 UTC

[jira] [Created] (DRILL-6667) Include internal data sets in Documentation Sample Datasets

Paul Rogers created DRILL-6667:
----------------------------------

             Summary: Include internal data sets in Documentation Sample Datasets
                 Key: DRILL-6667
                 URL: https://issues.apache.org/jira/browse/DRILL-6667
             Project: Apache Drill
          Issue Type: Improvement
          Components: Documentation
    Affects Versions: 1.13.0
            Reporter: Paul Rogers
            Assignee: Bridget Bevens


The Drill documentation provides the "Sample Datasets" section, which is very handy. However, this section does not discuss the two datasets provided with Drill itself.

* Julian Hyde's [FoodMart data set|https://github.com/julianhyde/foodmart-data-hsqldb], available on the class path.
* TPC-H data set.

The "FoodMart" data set is available directly under {{cp}}. In fact, the Drill sample query (see below) references a FoodMart table. To see the list of tables (at development time), find the {{foodmark-data-json-0.4.jar}} file in the Maven dependencies for {{drill-java-exec}}. The table names here are simplified relative to those in the ER diagram in the above link. Perhaps include a simple table with names, and the mapping to the original names, and a link to (or just embed the link) to the FoodMart ER image. The data is available in JSON format.

TPCH data is available in `cp`.`tpch/*.parquet`, in Parquet format. The schema is described in the [TPC-H specification](http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp).

Further, in the "Tutorials" section, "Analyzing the Yelp Academic Dataset", we mention the Yelp data set. But, we don't mention that in the "Sample Datasets" section. We should, just to be consistent and to save the reader time when going back and saying, "Hey, didn't Drill provide some kind of Yelp data? Let me look in Sample Datasets. Wait.. no Yelp?"

These are very handy, but hard to find: I find I must keep searching the source code to remember file names and directory paths. End uses won't have this luxury.

Suggestion: Describe the files available in the class path data source.

Along these same lines, in "Connect a Data Source", there is no mention of the class path data source. Yet, we reference that data source in the Web Console where we suggest a sample query to run:

{code}
Sample SQL query: SELECT * FROM cp.`employee.json` LIMIT 20
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)