You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Christophe Préaud (Jira)" <ji...@apache.org> on 2022/05/23 13:13:00 UTC

[jira] [Commented] (TEZ-4415) Hadoop archives created with Tez miss index files

    [ https://issues.apache.org/jira/browse/TEZ-4415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540943#comment-17540943 ] 

Christophe Préaud commented on TEZ-4415:
----------------------------------------

On the other hand, hadoop archives created with MapReduce have no such issue:
{code:java}
# create hadoop archive with MapReduce
hadoop archive -D mapreduce.framework.name=yarn -archiveName data.har -p /user/preaudc/data /user/preaudc
(...)
22/05/23 13:10:19 INFO mapreduce.JobSubmitter: number of splits:1
(...)

# _index and _masterindex files are created
hdfs dfs -ls /user/preaudc/data.har
Found 4 items
-rw-r--r--   3 preaudc preaudc          0 2022-05-23 13:11 /user/preaudc/data.har/_SUCCESS
-rw-r--r--   3 preaudc preaudc       8104 2022-05-23 13:11 /user/preaudc/data.har/_index
-rw-r--r--   3 preaudc preaudc         24 2022-05-23 13:11 /user/preaudc/data.har/_masterindex
-rw-r--r--   3 preaudc preaudc 2537147461 2022-05-23 13:11 /user/preaudc/data.har/part-0

# the hadoop archive is perfectly readable
hdfs dfs -ls har:/user/preaudc/data.har
Found 12 items
-rw-r--r--   3 preaudc preaudc   11289211 2021-04-19 15:37 har:///user/preaudc/data.har/sp-pos-quot-dep-2021-04-18-19h10.csv
drwxr-xr-x   - preaudc preaudc          0 2021-04-20 14:40 har:///user/preaudc/data.har/sp-pos-quot-dep-2021-04-18-19h10.parquet
-rw-r--r--   3 preaudc preaudc   11390298 2021-04-21 16:22 har:///user/preaudc/data.har/sp-pos-quot-dep-2021-04-21-18h11.csv
-rw-r--r--   3 preaudc preaudc   24262903 2022-05-03 07:56 har:///user/preaudc/data.har/sp-pos-quot-dep-2022-05-01-19h01.csv
drwxr-xr-x   - preaudc preaudc          0 2022-05-03 13:00 har:///user/preaudc/data.har/sp-pos-quot-dep-en.parquet
drwxr-xr-x   - preaudc preaudc          0 2022-05-03 08:36 har:///user/preaudc/data.har/sp-pos-quot-dep.parquet
drwxr-xr-x   - preaudc preaudc          0 2022-05-05 13:32 har:///user/preaudc/data.har/sp-pos-quot-en.parquet
-rw-r--r--   3 preaudc preaudc    3386674 2021-04-21 16:22 har:///user/preaudc/data.har/sp-pos-quot-reg-2021-04-21-18h11.csv
-rw-r--r--   3 preaudc preaudc    7327904 2022-05-04 14:54 har:///user/preaudc/data.har/sp-pos-quot-reg-2022-05-03-19h01.csv
drwxr-xr-x   - preaudc preaudc          0 2021-04-21 16:33 har:///user/preaudc/data.har/sp-pos-quot-reg.parquet
drwxr-xr-x   - preaudc preaudc          0 2021-04-26 07:38 har:///user/preaudc/data.har/sp-pos-quot.parquet
drwxr-xr-x   - preaudc preaudc          0 2021-04-22 15:30 har:///user/preaudc/data.har/sp-pos-quot.parquet2{code}

> Hadoop archives created with Tez miss index files
> -------------------------------------------------
>
>                 Key: TEZ-4415
>                 URL: https://issues.apache.org/jira/browse/TEZ-4415
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.9.2
>            Reporter: Christophe Préaud
>            Priority: Minor
>
> When a hadoop archive is created with Tez, the _index and _masterindex files are not created:
> {code:java}
> # create hadoop archive with Tez
> hadoop archive -D mapreduce.framework.name=yarn-tez -archiveName data.har -p /user/preaudc/data /user/preaudc 
> (...)
> 22/05/23 13:04:39 INFO client.TezClient: Tez Client Version: [ component=tez-api, version=0.9.2, revision=10cb3519bd34389210e6511a2ba291b52dcda081, SCM-URL=scm:git:https://gitbox.apache.org/repos/asf/tez.git, buildTime=2019-03-19T20:44:07Z ]
> (...)
> # _index and _masterindex files are not created
> hdfs dfs -ls /user/preaudc/data.har
> Found 2 items
> -rw-r--r--   3 preaudc preaudc          0 2022-05-23 13:06 /user/preaudc/data.har/_SUCCESS
> -rw-r--r--   3 preaudc preaudc 2537147461 2022-05-23 13:06 /user/preaudc/data.har/part-0
> # the hadoop archive is thus unreadable
> hdfs dfs -ls har:/user/preaudc/data.har
> ls: Invalid path for the Har Filesystem. No index file in har:/user/preaudc/data.har{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)