You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Valentyn Tymofieiev <va...@google.com.INVALID> on 2020/05/28 21:46:19 UTC

Why downloading sources of pyarrow and its requirements takes several minutes?

Hi Arrow dev community,

Do you have any insight why

          python -m pip download --dest /tmp pyarrow==0.16.0 --no-binary
:all:

takes several minutes to execute? From the output we can see that pip get
stuck on:

  File was already downloaded /tmp/pyarrow-0.16.0.tar.gz
  Installing build dependencies ... |

There is a significant increase in runtime between 0.15.1 and 0.16.0. I
suspect  some build dependencies need to be installed before pip
understands the dependencies of pyarrow.  Is there some inefficiency in
Avro's setup.py that is causing this?

Thanks,
Valentyn

Re: Why downloading sources of pyarrow and its requirements takes several minutes?

Posted by Antoine Pitrou <an...@python.org>.
PyArrow has always required Numpy, so this sounds like a red herring.
If Numpy wasn't downloaded as part of source dependencies before, it was
certainly a bug.

Regards

Antoine.


Le 29/05/2020 à 18:29, Wes McKinney a écrit :
> It's possible it's related to
> 
> https://github.com/apache/arrow/commit/6a583e553de28e3341987911bb63fc19f99a6fb0#diff-23eeeb4347bdd26bfc6b7ee9a3b755dd
> 
> Is the issue still present with 0.17.0 or 0.17.1? In any case please
> do open an issue if it is not resolved in master and/or the latest
> releases.
> 
> On Fri, May 29, 2020 at 10:41 AM Brian Hulette <bh...@apache.org> wrote:
>>
>> +1 fo a jira to track this. I looked into it a little bit just out of
>> curiosity.
>>
>> I passed --verbose to pip to get insight into what's going on in in the
>> "Installing build dependencies..." step. I did this for both 0.15.1 and
>> 0.16. They took 4:10 and 5:57 respectively.  It looks like 0.16.0 spent
>> 2:43 installing numpy, which is absent from the 0.15.1 log. I'm not sure
>> what changed to cause this.
>>
>> I collected logs with the following command (note it relies on ts in
>> moreutils for adding timestamps):
>>   python -m pip download --dest /tmp pyarrow==0.16.0 --no-binary :all:
>> --verbose 2>&1 | ts | tee /tmp/0.16.0.log
>> I found the numpy difference and measured its runtime by grepping for
>> "Running setup.py" in these logs.
>>
>> The logs are uploaded to google drive:
>> https://drive.google.com/drive/folders/1rPoYAsVul3HGdrviiCLGPf_P8dOlBCd1?usp=sharing
>>
>> On Fri, May 29, 2020 at 5:49 AM Wes McKinney <we...@gmail.com> wrote:
>>
>>> hi Valentyn,
>>>
>>> This is the first I've ever heard of anyone doing what you are doing,
>>> so safe to say that we've given little to no consideration to this use
>>> case. We have been focused on providing binary packages for pip and
>>> conda. Could you please open a JIRA and provide more detailed
>>> information about what you are seeing?
>>>
>>> Thanks
>>> Wes
>>>
>>> On Thu, May 28, 2020 at 4:47 PM Valentyn Tymofieiev
>>> <va...@google.com.invalid> wrote:
>>>>
>>>> Hi Arrow dev community,
>>>>
>>>> Do you have any insight why
>>>>
>>>>           python -m pip download --dest /tmp pyarrow==0.16.0 --no-binary
>>>> :all:
>>>>
>>>> takes several minutes to execute? From the output we can see that pip get
>>>> stuck on:
>>>>
>>>>   File was already downloaded /tmp/pyarrow-0.16.0.tar.gz
>>>>   Installing build dependencies ... |
>>>>
>>>> There is a significant increase in runtime between 0.15.1 and 0.16.0. I
>>>> suspect  some build dependencies need to be installed before pip
>>>> understands the dependencies of pyarrow.  Is there some inefficiency in
>>>> Avro's setup.py that is causing this?
>>>>
>>>> Thanks,
>>>> Valentyn
>>>

Re: Why downloading sources of pyarrow and its requirements takes several minutes?

Posted by Wes McKinney <we...@gmail.com>.
It's possible it's related to

https://github.com/apache/arrow/commit/6a583e553de28e3341987911bb63fc19f99a6fb0#diff-23eeeb4347bdd26bfc6b7ee9a3b755dd

Is the issue still present with 0.17.0 or 0.17.1? In any case please
do open an issue if it is not resolved in master and/or the latest
releases.

On Fri, May 29, 2020 at 10:41 AM Brian Hulette <bh...@apache.org> wrote:
>
> +1 fo a jira to track this. I looked into it a little bit just out of
> curiosity.
>
> I passed --verbose to pip to get insight into what's going on in in the
> "Installing build dependencies..." step. I did this for both 0.15.1 and
> 0.16. They took 4:10 and 5:57 respectively.  It looks like 0.16.0 spent
> 2:43 installing numpy, which is absent from the 0.15.1 log. I'm not sure
> what changed to cause this.
>
> I collected logs with the following command (note it relies on ts in
> moreutils for adding timestamps):
>   python -m pip download --dest /tmp pyarrow==0.16.0 --no-binary :all:
> --verbose 2>&1 | ts | tee /tmp/0.16.0.log
> I found the numpy difference and measured its runtime by grepping for
> "Running setup.py" in these logs.
>
> The logs are uploaded to google drive:
> https://drive.google.com/drive/folders/1rPoYAsVul3HGdrviiCLGPf_P8dOlBCd1?usp=sharing
>
> On Fri, May 29, 2020 at 5:49 AM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Valentyn,
> >
> > This is the first I've ever heard of anyone doing what you are doing,
> > so safe to say that we've given little to no consideration to this use
> > case. We have been focused on providing binary packages for pip and
> > conda. Could you please open a JIRA and provide more detailed
> > information about what you are seeing?
> >
> > Thanks
> > Wes
> >
> > On Thu, May 28, 2020 at 4:47 PM Valentyn Tymofieiev
> > <va...@google.com.invalid> wrote:
> > >
> > > Hi Arrow dev community,
> > >
> > > Do you have any insight why
> > >
> > >           python -m pip download --dest /tmp pyarrow==0.16.0 --no-binary
> > > :all:
> > >
> > > takes several minutes to execute? From the output we can see that pip get
> > > stuck on:
> > >
> > >   File was already downloaded /tmp/pyarrow-0.16.0.tar.gz
> > >   Installing build dependencies ... |
> > >
> > > There is a significant increase in runtime between 0.15.1 and 0.16.0. I
> > > suspect  some build dependencies need to be installed before pip
> > > understands the dependencies of pyarrow.  Is there some inefficiency in
> > > Avro's setup.py that is causing this?
> > >
> > > Thanks,
> > > Valentyn
> >

Re: Why downloading sources of pyarrow and its requirements takes several minutes?

Posted by Brian Hulette <bh...@apache.org>.
+1 fo a jira to track this. I looked into it a little bit just out of
curiosity.

I passed --verbose to pip to get insight into what's going on in in the
"Installing build dependencies..." step. I did this for both 0.15.1 and
0.16. They took 4:10 and 5:57 respectively.  It looks like 0.16.0 spent
2:43 installing numpy, which is absent from the 0.15.1 log. I'm not sure
what changed to cause this.

I collected logs with the following command (note it relies on ts in
moreutils for adding timestamps):
  python -m pip download --dest /tmp pyarrow==0.16.0 --no-binary :all:
--verbose 2>&1 | ts | tee /tmp/0.16.0.log
I found the numpy difference and measured its runtime by grepping for
"Running setup.py" in these logs.

The logs are uploaded to google drive:
https://drive.google.com/drive/folders/1rPoYAsVul3HGdrviiCLGPf_P8dOlBCd1?usp=sharing

On Fri, May 29, 2020 at 5:49 AM Wes McKinney <we...@gmail.com> wrote:

> hi Valentyn,
>
> This is the first I've ever heard of anyone doing what you are doing,
> so safe to say that we've given little to no consideration to this use
> case. We have been focused on providing binary packages for pip and
> conda. Could you please open a JIRA and provide more detailed
> information about what you are seeing?
>
> Thanks
> Wes
>
> On Thu, May 28, 2020 at 4:47 PM Valentyn Tymofieiev
> <va...@google.com.invalid> wrote:
> >
> > Hi Arrow dev community,
> >
> > Do you have any insight why
> >
> >           python -m pip download --dest /tmp pyarrow==0.16.0 --no-binary
> > :all:
> >
> > takes several minutes to execute? From the output we can see that pip get
> > stuck on:
> >
> >   File was already downloaded /tmp/pyarrow-0.16.0.tar.gz
> >   Installing build dependencies ... |
> >
> > There is a significant increase in runtime between 0.15.1 and 0.16.0. I
> > suspect  some build dependencies need to be installed before pip
> > understands the dependencies of pyarrow.  Is there some inefficiency in
> > Avro's setup.py that is causing this?
> >
> > Thanks,
> > Valentyn
>

Re: Why downloading sources of pyarrow and its requirements takes several minutes?

Posted by Wes McKinney <we...@gmail.com>.
hi Valentyn,

This is the first I've ever heard of anyone doing what you are doing,
so safe to say that we've given little to no consideration to this use
case. We have been focused on providing binary packages for pip and
conda. Could you please open a JIRA and provide more detailed
information about what you are seeing?

Thanks
Wes

On Thu, May 28, 2020 at 4:47 PM Valentyn Tymofieiev
<va...@google.com.invalid> wrote:
>
> Hi Arrow dev community,
>
> Do you have any insight why
>
>           python -m pip download --dest /tmp pyarrow==0.16.0 --no-binary
> :all:
>
> takes several minutes to execute? From the output we can see that pip get
> stuck on:
>
>   File was already downloaded /tmp/pyarrow-0.16.0.tar.gz
>   Installing build dependencies ... |
>
> There is a significant increase in runtime between 0.15.1 and 0.16.0. I
> suspect  some build dependencies need to be installed before pip
> understands the dependencies of pyarrow.  Is there some inefficiency in
> Avro's setup.py that is causing this?
>
> Thanks,
> Valentyn

Re: Why downloading sources of pyarrow and its requirements takes several minutes?

Posted by Joris Van den Bossche <jo...@gmail.com>.
I think this is due to numpy starting to have a pyproject.toml file since
1.18 (https://github.com/numpy/numpy/pull/14053)
And apparently, when a package includes a pyproject.toml, pip will create a
build environment, just to get the metadata (and in case of numpy, this
means creating an environment with setuptools, wheel and cython packages
installed). And this is what takes some more time, compared to older
versions of numpy.

On Fri, 29 May 2020 at 20:02, Valentyn Tymofieiev
<va...@google.com.invalid> wrote:

> Thanks for the input. Opened
> https://issues.apache.org/jira/browse/ARROW-8983, we can continue the
> conversation there.
>
> On Thu, May 28, 2020 at 2:46 PM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
> > Hi Arrow dev community,
> >
> > Do you have any insight why
> >
> >           python -m pip download --dest /tmp pyarrow==0.16.0 --no-binary
> > :all:
> >
> > takes several minutes to execute? From the output we can see that pip get
> > stuck on:
> >
> >   File was already downloaded /tmp/pyarrow-0.16.0.tar.gz
> >   Installing build dependencies ... |
> >
> > There is a significant increase in runtime between 0.15.1 and 0.16.0. I
> > suspect  some build dependencies need to be installed before pip
> > understands the dependencies of pyarrow.  Is there some inefficiency in
> > Avro's setup.py that is causing this?
> >
> > Thanks,
> > Valentyn
> >
>

Re: Why downloading sources of pyarrow and its requirements takes several minutes?

Posted by Valentyn Tymofieiev <va...@google.com.INVALID>.
Thanks for the input. Opened
https://issues.apache.org/jira/browse/ARROW-8983, we can continue the
conversation there.

On Thu, May 28, 2020 at 2:46 PM Valentyn Tymofieiev <va...@google.com>
wrote:

> Hi Arrow dev community,
>
> Do you have any insight why
>
>           python -m pip download --dest /tmp pyarrow==0.16.0 --no-binary
> :all:
>
> takes several minutes to execute? From the output we can see that pip get
> stuck on:
>
>   File was already downloaded /tmp/pyarrow-0.16.0.tar.gz
>   Installing build dependencies ... |
>
> There is a significant increase in runtime between 0.15.1 and 0.16.0. I
> suspect  some build dependencies need to be installed before pip
> understands the dependencies of pyarrow.  Is there some inefficiency in
> Avro's setup.py that is causing this?
>
> Thanks,
> Valentyn
>