You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sqoop.apache.org by "Attila Szabo (JIRA)" <ji...@apache.org> on 2016/05/11 14:15:13 UTC

[jira] [Issue Comment Deleted] (SQOOP-2906) Optimization of AvroUtil.toAvroIdentifier

     [ https://issues.apache.org/jira/browse/SQOOP-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Attila Szabo updated SQOOP-2906:
--------------------------------
    Comment: was deleted

(was: Hi Joeri,

I've joined the Sqoop community only a few weeks ago, so maybe I don't see all of the pitfalls, but let me raise a few suggestions/concerns:
You're fix seems to be okay, but I would suggest a bit more changes processing wise:
- First of all, I would not do the conversion for all of the column names, but rather create a Map<String, String> which would contain the "original" VS. "converted" names, and thus in most of the cases we would just have to lookup the name in O(1) time, rather doing the conversion all the time (even if it's now much faster and cheaper).
- I would also not convert those entry.getKey() values if those got a hit in the schema (schema.getField returns not null), as in that case they're valid values, but maybe this optimization is neglectable if you implement the first proposal. 
- I was also considering to do the mapping in advance before the import (after we've got the DB metadata and the avro schema), but for not RDBMS system it might cause problems (different sets of columns for each row e.g.), so I'm not sure that would help, but from algorithmic/clean code POV that would be the cleanest solution if possible.

Would you tell what do you think about these suggestions?
My 2cents,
Attila (Maugli))

> Optimization of AvroUtil.toAvroIdentifier
> -----------------------------------------
>
>                 Key: SQOOP-2906
>                 URL: https://issues.apache.org/jira/browse/SQOOP-2906
>             Project: Sqoop
>          Issue Type: Improvement
>            Reporter: Joeri Hermans
>            Assignee: Joeri Hermans
>              Labels: avro, hadoop, optimization
>         Attachments: diff.txt
>
>
> Hi all
> Our distributed profiler indicated some inefficiencies in the AvroUtil.toAvroIdentifier method, more specifically, the use of Regex patterns. This can be directly observed from the FlameGraph generated by this profiler (https://jhermans.web.cern.ch/jhermans/sqoop_avro_flamegraph.svg). We implemented an optimization, and compared this with the original method. On our testing machine, the optimization by itself is about 500% (on average) more efficient compared to the original implementation. We have yet to test how this optimization will influence the performance of user jobs.
> Any suggestions or remarks are welcome.
> Kind regards,
> Joeri
> https://github.com/apache/sqoop/pull/18
> Writeup:
> https://db-blog.web.cern.ch/blog/joeri-hermans/2016-04-hadoop-performance-troubleshooting-stack-tracing-introduction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)