You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@impala.apache.org by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org> on 2017/04/07 01:52:42 UTC

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Taras Bobrovytsky has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6579

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................

IMPALA-5181: Extract PYPI metadata from a webpage

There were some build failures due to a failure to download a JSON file
containing package metadata from PYPI. We need to switch to downloading
this from a PYPI mirror. In order to be able to download the metadata
from a PYPI mirror, we need be able to extract the data from a web page,
because PYPI mirrors do not always have a JSON interface.

We implement a regex based html parser in this patch. Also, we increase
the number of download attempts and randomly vary the amount of time
between each attempt.

Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
---
M infra/python/deps/pip_download.py
1 file changed, 63 insertions(+), 33 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/79/6579/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "David Knupp (Code Review)" <ge...@cloudera.org>.

David Knupp has posted comments on this change.

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................


Patch Set 1:

I'm totally in agreement.

-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................


Patch Set 1:

*Not suggesting we should try to do that now. We should put out the fire first by supporting mirrors before thinking about redesigning the building.

-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................


Patch Set 1:

One thing to keep in mind is that we only have system pip available at this point, and the lowest-common denominator is pretty low. On CentOS 5, for example, "pip download" didn't exist, but pip install --download does. 

I actually made got that to work a while back, (with a hack to work around a problem with kudu-python): https://gerrit.cloudera.org/#/c/3217/1/infra/python/deps/download_requirements

I only added pip_download.py in the first place to avoid that hack but I agree the amount of work invested into this wasn't justified. Use JSON they said, it'll be easy they said :). That said, I think given the work we've sunk into it and the fact that it works, it seems easier to reason about than the behaviour of any possible version of pip from the last 10 years.

-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "David Knupp (Code Review)" <ge...@cloudera.org>.

David Knupp has posted comments on this change.

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................


Patch Set 1:

There may well be some historical context that I'm missing here, or just some
complication that I'm overlooking, so sincere apologies if these comments are
not relevant, but one thing I've noticed is that even though we create a python
virtualenv, instead of ever activating it, we use wrapper scripts to construct
elaborate paths to the enclosed binaries (not just Impala does this, but other
Cloudera teams do this too.)

E.g., if I'm in my system evironment:

  dknupp@dknupp-desktop:0:~ $ which python pip virtualenv
  /usr/bin/python
  /usr/local/bin/pip
  /usr/local/bin/virtualenv
  dknupp@dknupp-desktop:0:~ $ pip --version
  pip 8.1.2 from /usr/local/lib/python2.7/dist-packages (python 2.7)
  dknupp@dknupp-desktop:0:~ $ virtualenv --version
  15.0.2  

But if I _activate_ the Impala python virtualenv first...

  dknupp@dknupp-desktop:0:~ $ source ${IMPALA_HOME}/infra/python/env/bin/activate
  (env)dknupp@dknupp-desktop:0:~ $ which python pip virtualenv
  /home/dknupp/Impala/infra/python/env/bin/python
  /home/dknupp/Impala/infra/python/env/bin/pip
  /home/dknupp/Impala/infra/python/env/bin/virtualenv

...the system versions no longer matter, and I can upgrade these at will.

  (env)dknupp@dknupp-desktop:0:~ $ pip install -U pip virtualenv
  Collecting pip
    Using cached pip-9.0.1-py2.py3-none-any.whl
  Collecting virtualenv
    Downloading virtualenv-15.1.0-py2.py3-none-any.whl (1.8MB)
      100% |################################| 1.8MB 1.0MB/s
  Installing collected packages: pip, virtualenv
    Found existing installation: pip 8.1.2
      Uninstalling pip-8.1.2:
        Successfully uninstalled pip-8.1.2
    Found existing installation: virtualenv 13.1.0
      Uninstalling virtualenv-13.1.0:
        Successfully uninstalled virtualenv-13.1.0
  Successfully installed pip-9.0.1 virtualenv-15.1.0

  (env)dknupp@dknupp-desktop:0:~ $ pip --version
  pip 9.0.1 from /home/dknupp/Impala/infra/python/env/local/lib/python2.7/site-packages (python 2.7)
  (env)dknupp@dknupp-desktop:0:~ $ virtualenv --version
  15.1.0

When I deactivate the virtualenv, the system versions are untouched.

  dknupp@dknupp-desktop:0:~ $ pip --version
  pip 8.1.2 from /usr/local/lib/python2.7/dist-packages (python 2.7)
  dknupp@dknupp-desktop:0:~ $ virtualenv --version
  15.0.2

So a possible workflow would be to:

1. Create the Impala virtualenv in ${IMPALA_HOME}/infra/python/env first
2. Activate the virtualenv
3. "pip install -U pip" to upgrade ${IMPALA_HOME}/infra/python/env/bin/pip
4. Using the upgraded pip to download & install the various things
5. Deactivate the virtualenv

-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has uploaded a new patch set (#2).

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................

IMPALA-5181: Extract PYPI metadata from a webpage

There were some build failures due to a failure to download a JSON file
containing package metadata from PYPI. We need to switch to downloading
this from a PYPI mirror. In order to be able to download the metadata
from a PYPI mirror, we need be able to extract the data from a web page,
because PYPI mirrors do not always have a JSON interface.

We implement a regex based html parser in this patch. Also, we increase
the number of download attempts and randomly vary the amount of time
between each attempt.

Testing:
- Tested locally against PYPI and a PYPI mirror.
- Ran a private build that passed (which used a PYPI mirror).

Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
---
M infra/python/deps/pip_download.py
1 file changed, 57 insertions(+), 33 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/79/6579/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "David Knupp (Code Review)" <ge...@cloudera.org>.

David Knupp has posted comments on this change.

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................


Patch Set 1:

I have to confess, I've long felt like there's lot of over engineering in the way that we set up the Impala python environment. E.g., pip allows for this usage:

   pip download -i <pypi_index> -r <requirements_file> --no-binary=:all:

It doesn't download wheels, and it won't download packages that are already present, which I think is the main purpose of this file? It might mean that we have to update our version of pip in infra/python/env/bin as a first step, but if it works, it would obviate the need for a lot of this logic where bugs can hide.

If this is the fastest solution to getting build working again so we can unlock the repo and start checking in code again, that's fine. But I think that there might be an easier way to do this.

-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Hello Lars Volker,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6579

to look at the new patch set (#2).

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................

IMPALA-5181: Extract PYPI metadata from a webpage

There were some build failures due to a failure to download a JSON file
containing package metadata from PYPI. We need to switch to downloading
this from a PYPI mirror. In order to be able to download the metadata
from a PYPI mirror, we need be able to extract the data from a web page,
because PYPI mirrors do not always have a JSON interface.

We implement a regex based html parser in this patch. Also, we increase
the number of download attempts and randomly vary the amount of time
between each attempt.

Testing:
- Tested locally against PYPI and a PYPI mirror.
- Ran a private build that passed (which used a PYPI mirror).

Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
---
M infra/python/deps/pip_download.py
1 file changed, 57 insertions(+), 33 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/79/6579/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "David Knupp (Code Review)" <ge...@cloudera.org>.

David Knupp has posted comments on this change.

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................


Patch Set 1:

Again, just wanted to reiterate that I think if this is the quickest path to unlocking 
the repo, this is an acceptable approach. I don't mean to hold this up. Discussions as to whether we should revisit the whole virtualenv workflow can happen elsewhere. In light of that...

1. I think Lars' point about adding a comment in the code as to why lxml (etc.) can't be used is still a good one.

2. Can you add a testing comment in the commit msg describing how this was tested. Did you try it against the default public PYPI, and also non-default mirror?

-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Hello Lars Volker,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6579

to look at the new patch set (#2).

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................

IMPALA-5181: Extract PYPI metadata from a webpage

There were some build failures due to a failure to download a JSON file
containing package metadata from PYPI. We need to switch to downloading
this from a PYPI mirror. In order to be able to download the metadata
from a PYPI mirror, we need be able to extract the data from a web page,
because PYPI mirrors do not always have a JSON interface.

We implement a regex based html parser in this patch. Also, we increase
the number of download attempts and randomly vary the amount of time
between each attempt.

Testing:
- Tested locally against PYPI and a PYPI mirror.
- Ran a private build that passed (which used a PYPI mirror).

Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
---
M infra/python/deps/pip_download.py
1 file changed, 57 insertions(+), 33 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/79/6579/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................


Patch Set 1:

Actually "pip download" doesn't even exist on recent distributions. On ubuntu 16.04:

  tarmstrong@tarmstrong-box:~$ pip --help

  Usage:   
  pip <command> [options]

Commands:
  install                     Install packages.
  uninstall                   Uninstall packages.
  freeze                      Output installed packages in requirements format.
  list                        List installed packages.
  show                        Show information about installed packages.
  search                      Search PyPI for packages.
  wheel                       Build wheels from your requirements.
  help                        Show help for commands.

General Options:
  -h, --help                  Show help.
  --isolated                  Run pip in an isolated mode, ignoring
                              environment variables and user configuration.
  -v, --verbose               Give more output. Option is additive, and can be
                              used up to 3 times.
  -V, --version               Show version and exit.
  -q, --quiet                 Give less output.
  --log <path>                Path to a verbose appending log.
  --proxy <proxy>             Specify a proxy in the form
                              [user:passwd@]proxy.server:port.
  --retries <retries>         Maximum number of retries each connection should
                              attempt (default 5 times).
  --timeout <sec>             Set the socket timeout (default 15 seconds).
  --exists-action <action>    Default action when a path already exists:
                              (s)witch, (i)gnore, (w)ipe, (b)ackup.
  --trusted-host <hostname>   Mark this host as trusted, even though it does
                              not have valid or any HTTPS.
  --cert <path>               Path to alternate CA bundle.
  --client-cert <path>        Path to SSL client certificate, a single file
                              containing the private key and the certificate
                              in PEM format.
  --cache-dir <dir>           Store the cache data in <dir>.
  --no-cache-dir              Disable the cache.
  --disable-pip-version-check
                              Don't periodically check PyPI to determine
                              whether a new version of pip is available for
                              download. Implied with --no-index.

-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................


Patch Set 1:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/6579/1/infra/python/deps/pip_download.py
File infra/python/deps/pip_download.py:

Line 73:   if GET_METADATA_FROM_JSON:
> Do we need to keep the JSON API version? The JSON code is cleaner but it se
Removed.


Line 74:     REQUIRED_PKG_TYPE = 'sdist' # Wheel archives are not supported.
> Perhaps a nit, but if we're considering REQUIRED_PKG_TYPE to be a "constant
This is no longer relevant because this section of the code is removed.


Line 83:     url = '{0}/simple/{1}/'.format(PYPI_MIRROR, pkg_name)
> Mention that this is the PEP 503 format in case people want to look at the 
Done


Line 86:     regex = r'<a href=\".*packages/(.*)#md5=(.*?)\".*>(.*)<\/a>'
> I wasn't aware how early in the setup process this runs. In this case I agr
Done. Made all qualifiers non-greedy. Added a comment explaining why we parse with regex.


Line 91:       if (
> nit: formatting of this seems weird. I'd probably prefer this (although I'l
Done.

I formatted it that way to satisfy flake8 formatter, but after researching it a bit, it's correct to write it the way you suggested too (even though flake8 complains).


-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................


Patch Set 1:

The problem is that this script is the one that downloads the virtualenv that we're going to use. 

I think there's a valid immediate reason for all of the things that have been added but maybe the whole bootstrapping process could be simplified if we rethought some of the earlier decisions.

-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.

Lars Volker has posted comments on this change.

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................


Patch Set 1: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6579/1/infra/python/deps/pip_download.py
File infra/python/deps/pip_download.py:

Line 86:     regex = r'<a href=\".*packages/(.*)#md5=(.*?)\".*>(.*)<\/a>'
> I considered using beautifulsoup, but the problem is that we have to downlo
I wasn't aware how early in the setup process this runs. In this case I agree that using a regex seems like the best thing to do.


-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6579/1/infra/python/deps/pip_download.py
File infra/python/deps/pip_download.py:

Line 86:     regex = r'<a href=\".*packages/(.*)#md5=(.*?)\".*>(.*)<\/a>'
> Can you add a comment explaining why we're using regexes to parse the HTML,
I considered using beautifulsoup, but the problem is that we have to download and install it first before using it in this script. Let me know if you have some ideas how we can do this (I think it's definitely a better solution).

Since the html is guaranteed to be structured a certain way according to the PEP 503 documentation, I think it's ok to use regex to parse.


-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................


Patch Set 1:

(4 comments)

Thanks for solving this. I tried this out locally and it worked nicely for me. Only minor comments.

http://gerrit.cloudera.org:8080/#/c/6579/1/infra/python/deps/pip_download.py
File infra/python/deps/pip_download.py:

Line 73:   if GET_METADATA_FROM_JSON:
Do we need to keep the JSON API version? The JSON code is cleaner but it seems easier to only maintain one version - there's always a chance we accidentally break the non-JSON version and don't catch it on checkin.


Line 83:     url = '{0}/simple/{1}/'.format(PYPI_MIRROR, pkg_name)
Mention that this is the PEP 503 format in case people want to look at the docs: https://www.python.org/dev/peps/pep-0503/


Line 86:     regex = r'<a href=\".*packages/(.*)#md5=(.*?)\".*>(.*)<\/a>'
I think this would be more robust if all of the * quantifiers were non-greedy, i.e. *? 

Otherwise it relies on there being only one <a>...</a> per line, which seems to be true in practice but isn't mentioned in the PEP 503 spec.


Line 91:       if (
nit: formatting of this seems weird. I'd probably prefer this (although I'll defer to you if this is standard python style).

      if (file_name.endswith('-{0}.tar.gz'.format(pkg_version)) or
          file_name.endswith('-{0}.tar.bz2'.format(pkg_version)) or
          file_name.endswith('-{0}.zip'.format(pkg_version))):


-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.

Lars Volker has posted comments on this change.

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6579/1/infra/python/deps/pip_download.py
File infra/python/deps/pip_download.py:

Line 86:     regex = r'<a href=\".*packages/(.*)#md5=(.*?)\".*>(.*)<\/a>'
> I think this would be more robust if all of the * quantifiers were non-gree
Can you add a comment explaining why we're using regexes to parse the HTML, i.e. why we can't use beautifulsoup or lxml or the like?


-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................


Patch Set 2: Code-Review+2

(1 comment)

Thank you!

http://gerrit.cloudera.org:8080/#/c/6579/1/infra/python/deps/pip_download.py
File infra/python/deps/pip_download.py:

Line 91:   if not candidates:
> Done.
Oh ok, I'm fine leaving it that way if it works nicer wiht auto-formatting.


-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.

Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................


IMPALA-5181: Extract PYPI metadata from a webpage

There were some build failures due to a failure to download a JSON file
containing package metadata from PYPI. We need to switch to downloading
this from a PYPI mirror. In order to be able to download the metadata
from a PYPI mirror, we need be able to extract the data from a web page,
because PYPI mirrors do not always have a JSON interface.

We implement a regex based html parser in this patch. Also, we increase
the number of download attempts and randomly vary the amount of time
between each attempt.

Testing:
- Tested locally against PYPI and a PYPI mirror.
- Ran a private build that passed (which used a PYPI mirror).

Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Reviewed-on: http://gerrit.cloudera.org:8080/6579
Reviewed-by: Tim Armstrong <ta...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M infra/python/deps/pip_download.py
1 file changed, 57 insertions(+), 33 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Tim Armstrong: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "David Knupp (Code Review)" <ge...@cloudera.org>.

David Knupp has posted comments on this change.

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6579/1/infra/python/deps/pip_download.py
File infra/python/deps/pip_download.py:

Line 74:     REQUIRED_PKG_TYPE = 'sdist' # Wheel archives are not supported.
Perhaps a nit, but if we're considering REQUIRED_PKG_TYPE to be a "constant," I think I'd prefer to see it at the top of the file, rather than declared within this function.


-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.

Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................


Patch Set 2: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5181: Extract PYPI metadata from a webpage

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.

Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5181: Extract PYPI metadata from a webpage
......................................................................


Patch Set 2:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/443/

-- 
To view, visit http://gerrit.cloudera.org:8080/6579
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If3845a0d5f568d4352e3cc4883596736974fd7de
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: David Knupp <dk...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No