You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2016/09/16 17:31:21 UTC
[jira] [Updated] (MADLIB-1019) Scalability of SVEC

     [ https://issues.apache.org/jira/browse/MADLIB-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank McQuillan updated MADLIB-1019:
------------------------------------
    Priority: Major  (was: Minor)

> Scalability of SVEC
> -------------------
>
>                 Key: MADLIB-1019
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1019
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Sparse Vectors
>            Reporter: Frank McQuillan
>
> Entered on behalf of a user doing text analytics work...
> We're testing with some MADlib functions (we're running this install of MADlib madlib-ossv1.9_pv1.9.5_gpdb4.3orca-rhel5-x86_64 )  While testing, we are running into some performance issues as we try to scale up our data set. 
> We took a subset of our data and ran on a varying number of rows with the rows being between 900 and 1000 bytes long.  The following bullets show the number of rows in our base dataset (which feeds into SVEC) and the time it took to run:
> {code}
> 1,000 rows -> 1/2 sec
> 10,000 rows -> 20 sec
> 100,000 rows -> @15mins we killed the process.
> {code}
> This is not scaling anywhere near linearly and also it is demonstrating severe skew in that only one postgres process on one node is used during this processing.  The query that we're running is:
> {code}
> CREATE TABLE public.tfidf AS (
>    SELECT
> 	doc_id as document_id,
>         madlib.svec_mult( sparse_vector, logidf ) tf_idf
>     FROM
>         public.corpus,
>         ( SELECT madlib.svec_log(
>             madlib.svec_div(
>     	    count(sparse_vector)::madlib.svec,
>             madlib.svec_count_nonzero(sparse_vector)
>             )
>         ) logidf FROM public.corpus ) foo
>     --ORDER BY document_id
> ) DISTRIBUTED BY (document_id)
> {code}
> After some investigation, we determined that the madlib.svec_mult() is the performance bottleneck here.  The internal select that calls svec_div() and svec_count_nonzero() runs relatively quickly.  We're looking for guidance/help on (a) the skew issue and (b) the performance in general because ultimately, we need to scale up to 45M base table rows and that will greatly increase the SVEC size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)