You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Steven Anton (JIRA)" <ji...@apache.org> on 2017/08/19 00:39:00 UTC
[jira] [Updated] (ARROW-1374) Compatibility with xgboost

     [ https://issues.apache.org/jira/browse/ARROW-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Anton updated ARROW-1374:
--------------------------------
    Description: 
Traditionally I work with CSV's and really suffer with slow read/write times. Parquet and the Arrow project obviously give us huge speedups.

One thing I've noticed, however, is that there is a serious bottleneck when converting a DataFrame read in through pyarrow to a DMatrix used by xgboost. For example, I'm building a model with about 180k rows and 6k float64 columns. Reading into a pandas DataFrame takes about 20 seconds on my machine. However, converting that DataFrame to a DMatrix takes well over 10 minutes.

Interestingly, it takes about 10 minutes to read that same data from a CSV into a pandas DataFrame. Then, it takes less than a minute to convert to a DMatrix.

I'm sure there's a good technical explanation for why this happens (e.g. row vs column storage). Still, I imagine this use case may occur to many and it would be great to improve these times, if possible.


{code:none}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import xgboost as xgb

# Reading from parquet:
table = pq.read_table('/path/to/parquet/files')  # 20 seconds
variables = table.to_pandas()  # 1 second
dtrain = xgb.DMatrix(variables.drop(['tag'], axis=1), label=variables['tag'])  # takes 10-15 minutes

# Reading from CSV:
variables = pd.read_csv('/path/to/file.csv', ...)  # takes about 10 minutes
dtrain = xgb.DMatrix(variables.drop(['tag'], axis=1), label=variables['tag'])  # less than 1 minute
{code}


  was:
Traditionally I work with CSV's and really suffer with slow read/write times. Parquet and the Arrow project obviously give us huge speedups.

One thing I've noticed, however, is that there is a serious bottleneck when converting a DataFrame read in through pyarrow to a DMatrix used by xgboost. For example, I'm building a model with about 180k rows and 6k float64 columns. Reading into a pandas DataFrame takes about 20 seconds on my machine. However, converting that DataFrame to a DMatrix takes well over 10 minutes.

Interestingly, it takes about 10 minutes to read that same data from a CSV into a pandas DataFrame. Then, it takes less than a minute to convert to a DMatrix.

I'm sure there's a good technical explanation for why this happens (e.g. row vs column storage). Still, I imagine this use case may occur to many and it would be great to improve these times, if possible.


{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import xgboost as xgb

# Reading from parquet:
table = pq.read_table('/path/to/parquet/files')  # 20 seconds
variables = table.to_pandas()  # 1 second
dtrain = xgb.DMatrix(variables.drop(['tag'], axis=1), label=variables['tag'])  # takes 10-15 minutes

# Reading from CSV:
variables = pd.read_csv('/path/to/file.csv', ...)  # takes about 10 minutes
dtrain = xgb.DMatrix(variables.drop(['tag'], axis=1), label=variables['tag'])  # less than 1 minute
{code}



> Compatibility with xgboost
> --------------------------
>
>                 Key: ARROW-1374
>                 URL: https://issues.apache.org/jira/browse/ARROW-1374
>             Project: Apache Arrow
>          Issue Type: Wish
>            Reporter: Steven Anton
>            Priority: Minor
>
> Traditionally I work with CSV's and really suffer with slow read/write times. Parquet and the Arrow project obviously give us huge speedups.
> One thing I've noticed, however, is that there is a serious bottleneck when converting a DataFrame read in through pyarrow to a DMatrix used by xgboost. For example, I'm building a model with about 180k rows and 6k float64 columns. Reading into a pandas DataFrame takes about 20 seconds on my machine. However, converting that DataFrame to a DMatrix takes well over 10 minutes.
> Interestingly, it takes about 10 minutes to read that same data from a CSV into a pandas DataFrame. Then, it takes less than a minute to convert to a DMatrix.
> I'm sure there's a good technical explanation for why this happens (e.g. row vs column storage). Still, I imagine this use case may occur to many and it would be great to improve these times, if possible.
> {code:none}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> import xgboost as xgb
> # Reading from parquet:
> table = pq.read_table('/path/to/parquet/files')  # 20 seconds
> variables = table.to_pandas()  # 1 second
> dtrain = xgb.DMatrix(variables.drop(['tag'], axis=1), label=variables['tag'])  # takes 10-15 minutes
> # Reading from CSV:
> variables = pd.read_csv('/path/to/file.csv', ...)  # takes about 10 minutes
> dtrain = xgb.DMatrix(variables.drop(['tag'], axis=1), label=variables['tag'])  # less than 1 minute
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)