You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "vtlim (via GitHub)" <gi...@apache.org> on 2023/02/10 21:14:45 UTC

[GitHub] [druid] vtlim commented on a diff in pull request #13485: adds a Jupyter visualization tutorial

vtlim commented on code in PR #13485:
URL: https://github.com/apache/druid/pull/13485#discussion_r1103252163


##########
examples/quickstart/jupyter-notebooks/druid-visualization.ipynb:
##########
@@ -0,0 +1,610 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ad4e60b6",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Tutorial: Basic visualizations using the Druid API\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n",
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "  \n",
+    "This tutorial introduces basic visualization options you can use with the Druid API. It focuses on two Python modules: [pandas](https://pandas.pydata.org/) and [Bokeh](https://bokeh.org/). This tutorial builds on [Learn the basics of the Druid API](api-tutorial.ipynb).\n",
+    "\n",
+    "\n",
+    "## Table of contents\n",
+    "\n",
+    "- [Prerequisites](#Prerequisites)\n",
+    "- [Display data in a DataFrame](#Display-data-in-a-DataFrame)\n",
+    "- [Display data with a bar graph](#Display-data-with-a-bar-graph)\n",
+    "- [Display data with a line graph](#Display-data-with-a-line-graph)\n",
+    "- [Next steps](#Next-steps)\n",
+    "\n",
+    "For the best experience, use JupyterLab so that you can always access the table of contents."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8d6bbbcb",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "If you haven't already, you'll need install the Requests library for Python before you start. For example:\n",
+    "\n",
+    "```bash\n",
+    "pip3 install requests\n",
+    "```\n",
+    "\n",
+    "Additionally, install the pandas and Bokeh libraries. For example:\n",
+    "```bash\n",
+    "pip3 install pandas\n",
+    "pip3 install bokeh\n",
+    "```\n",
+    "\n",
+    "Next, you'll need a Druid cluster with data. This tutorial uses the `wikipedia_api` datasource from [Learn the basics of the Druid API](api-tutorial.ipynb).\n",
+    "\n",
+    "\n",
+    "Finally, you'll need either JupyterLab (recommended) or Jupyter Notebook. Both the quickstart Druid cluster and Jupyter are deployed at `localhost:8888` by default, so you'll \n",
+    "need to change the port for Jupyter. To do so, stop Jupyter and start it again with the `port` parameter included. For example, you can use the following command to start Jupyter on port `3001`:\n",
+    "\n",
+    "```bash\n",
+    "# If you're using JupyterLab\n",
+    "jupyter lab --port 3001\n",
+    "# If you're using Jupyter notebook\n",
+    "jupyter notebook --port 3001 \n",
+    "```\n",
+    "\n",
+    "To start this tutorial, run the next cell. It imports the Python packages you'll need and defines the Druid host where the Druid Router service listens. The quickstart deployment configures the to listen on port `8888` by default, so you'll be making API calls against `http://localhost:8888`. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b7f08a52",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests, json\n",
+    "import pandas as pd\n",
+    "from bokeh.palettes import Spectral10\n",
+    "from bokeh.io import output_notebook, show\n",
+    "from bokeh.plotting import figure\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment.\n",
+    "# In a distributed environment, use the Router service as the `druid_host`.\n",
+    "druid_host = \"http://localhost:8888\"\n",
+    "dataSourceName = \"wikipedia_api\"\n",
+    "\n",
+    "# Set basic output formatting.\n",
+    "bold = '\\033[1m'\n",
+    "standard = '\\033[0m'\n",
+    "print(f'{bold}Druid host{standard}: {druid_host}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2093ecf0-fb4b-405b-a216-094583580e0a",
+   "metadata": {},
+   "source": [
+    "In the rest of this tutorial, the `endpoint`, `http_method`, and `payload` variables are updated in code cells to call a different Druid endpoint to accomplish a task."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3b55af57-9c79-4e45-a22c-438c1b94112e",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Display data with Pandas\n",
+    "\n",
+    "By default, when you query Druid using the API, Druid returns the results as JSON. The JSON output is great for programmatic operations, but it is not easy to scan the data visually. This section shows you how to use the Python pandas module to transform JSON query results to tabular format. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "622f2158-75c9-4b12-bd8a-c92d30994c1f",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "The following cell selects the top 10 channels by the number of additions and outputs the data as JSON."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b7dd08f7-cba4-4f14-acd3-b5384dae7d88",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = \"/druid/v2/sql\"\n",
+    "print(f'{bold}Query endpoint{standard}: {druid_host}{endpoint}')\n",
+    "\n",
+    "http_method = \"POST\"\n",
+    "query = f'SELECT channel, SUM(added) AS additions FROM {dataSourceName} GROUP BY channel ORDER BY additions DESC LIMIT 10'\n",
+    "\n",
+    "payload = json.dumps({\n",
+    "  \"query\" : query\n",
+    "})\n",
+    "headers = {'Content-Type': 'application/json'}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, headers=headers, data=payload)\n",
+    "\n",
+    "print(f'{bold}Query{standard}: {query}')\n",
+    "print(f'{bold}Query results{standard}:')\n",
+    "print(json.dumps(response.json(), indent=4))\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "950b2cc4-9935-497d-a3f5-e89afcc85965",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "The following cell takes the JSON results from the Druid API and uses the pandas [`read_json`](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html) method to convert the results to a pandas DataFrame object.\n",
+    "You can display the pandas ojbect in a tabular format.\n",
+    "The `orient` parameter sets pandas to accept data as a list of records in the format: `[{column: value, ...}, {column: value, ...}]`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c3860d64-fba6-43bc-80e2-404f5b3b9baa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "additions_pd = pd.read_json(json.dumps(response.json()), orient='records')\n",
+    "\n",
+    "additions_pd"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9488cbee-63bf-414a-aff3-207adf39e115",
+   "metadata": {},
+   "source": [
+    "One benefit of using a DataFrame is that you can access all the pandas `DataFrame` object methods. For example, `DataFrame.max()`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4d0a4ada-c9cf-4eee-a55f-b524d2dfad64",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "additions_pd.max()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a1720a88-cf1b-4022-981a-69441fac54fd",
+   "metadata": {},
+   "source": [
+    "Check out the [pandas docs](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html) for more ideas about how to use pandas and DataFrames \n",
+    "with your Druid data."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "07a16047-dba1-4f45-9fcb-6130c37eca61",
+   "metadata": {},
+   "source": [
+    "## Display data with a bar graph\n",
+    "\n",
+    "Tabular format is OK for scanning data; however, you can use visualization and plotting tools that work with Jupyter Notebooks to visualize data as plots or graphs. This section uses [Bokeh](https://bokeh.org/) to illustrate some basic plots using Druid data.\n",
+    "\n",
+    "In this section, you create a simple bar chart showing the channels with the most additions during the time range."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8420df27-0e80-48d1-a806-43f6bf81dd16",
+   "metadata": {},
+   "source": [
+    "First, call `output_notebook()` to set Bokeh to output plots inline within the notebook. This lets you view your plot inline. You can also use Bokeh's `output_file` function to write your plot to an HTML file. If you want to experiment with `output_file`, add it to the list of imports. For example:\n",
+    "```\n",
+    "from bokeh.io import output_notebook, show, output_file\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d824a82f-9e71-43c4-b289-78d4970c5cfc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "output_notebook()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5859d743-a5a7-445c-a06f-89b8c30856ed",
+   "metadata": {},
+   "source": [
+    "There are several ways to use Bokeh with a DataFrame. In this case, make a list of channels to serve as the x-axis of our plot. For the y-axis,  divide the total additions by 100000 for ease of display."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "70b40bfc-106d-462c-a5a1-51d0da4c4498",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "channels = additions_pd.channel.to_list()\n",
+    "total_additions = [x / 100000 for x in additions_pd.additions.to_list()]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "efc5f142-8b68-4430-a666-2ed961d732b9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "channels"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9f619fd3-6a01-4dc6-b8eb-81e4b4ae1a31",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "total_additions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6b430b4c-df81-419d-a81e-63bb0150031b",
+   "metadata": {},
+   "source": [
+    "Next, initialize the Bokeh plot (figure) with some basic configurations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "07644ccd-1f0c-4a50-8929-82559f9bf803",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    " # Create a new plot with a title. Set the size in pixels\n",
+    "bar_plot = figure(height=500, width=750, x_range=channels,  title=\"Additions by channel\",\n",
+    "           toolbar_location=None)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e0a01f4c-6f35-4eb7-9263-4f6ed6c68223",
+   "metadata": {},
+   "source": [
+    "Now, configure the renderer for the vertical bars on the plot:\n",
+    "- Set the x-axis to `channels`, the list of channels.\n",
+    "- Set the `top` coordinate that determines the bar height to `total_additions`.\n",
+    "- For a splash of color, set the `color` to `Spectral10`. \n",
+    "\n",
+    "Note that palettes in Bokeh are lists of colors. Bokeh expects the list length to equal the list length of the data dimensions -- in this case 10 colors.\n",
+    "\n",
+    "See the Bokeh docs for more information on [vertical bars](https://docs.bokeh.org/en/latest/docs/reference/plotting/figure.html#bokeh.plotting.figure.vbar) and [palettes](https://docs.bokeh.org/en/latest/docs/reference/palettes.html). Note the expected output for the cell: `GlyphRenderer(\tid = 'p1054', …)`"

Review Comment:
   ```suggestion
       "See the Bokeh docs for more information on [vertical bars](https://docs.bokeh.org/en/latest/docs/reference/plotting/figure.html#bokeh.plotting.figure.vbar) and [palettes](https://docs.bokeh.org/en/latest/docs/reference/palettes.html). Note the expected output for the cell: `GlyphRenderer(id = 'p1054', …)`"
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org