Hive 将数据插入hive动态分区表或hdfs动态分区目录的优化 distri

发布时间：2023-01-14 11:32:08 所属栏目：大数据来源：未知

导读：
Hive 将数据插入hive动态分区表或hdfs动态分区目录的优化 distribute by分区排序的应用
将数据插入动态分区可能会导致短时间内（map任务）产生大量的分区（大于分区列的值去重后的数量），

Hive 将数据插入hive动态分区表或hdfs动态分区目录的优化 distribute by分区排序的应用,第1张

Hive 将数据插入hive动态分区表或hdfs动态分区目录的优化 distribute by分区排序的应用

将数据插入动态分区可能会导致短时间内（map任务）产生大量的分区（大于分区列的值去重后的数量），导致资源消耗过大，因此可以设置以下3个用于保护自己的参数。

Troubleshooting and best practices故障排除和最佳实践:

    beeline> set hive.exec.dynamic.partition.mode=nonstrict;
    beeline> FROM page_view_stg pvs
          INSERT OVERWRITE TABLE page_view PARTITION(dt, country)
                 SELECt pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip,
                        from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country;
...
2010-05-07 11:10:19,816 Stage-1 map = 0%,  reduce = 0%
[Fatal Error] Operator FS_28 (id=41): fatal error. Killing the job.
Ended Job = job_201005052204_28178 with errors
...

The problem of this that one mapper will take a random set of rows and it is very likely that the number of distinct (dt, country) pairs will exceed the limit of hive.exec.max.dynamic.partitions.pernode. One way around it is to group the rows by the dynamic partition columns in the mapper and distribute them to the reducers where the dynamic partitions will be created. In this case the number of distinct dynamic partitions will be significantly reduced. The above example query could be rewritten to:

beeline> set hive.exec.dynamic.partition.mode=nonstrict;
beeline> FROM page_view_stg pvs
      INSERT OVERWRITE TABLE page_view PARTITION(dt, country)
             SELECt pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip,
                    from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') dt, pvs.country
             DISTRIBUTE BY dt, country;

This query will generate a MapReduce job rather than Map-only job. The SELECT-clause will be converted to a plan to the mappers and the output will be distributed to the reducers based on the value of (dt, country) pairs. The INSERT-clause will be converted to the plan in the reducer which writes to the dynamic partitions.

实际工作中，情况不够复杂大数据排序，不需要使用distribute by来优化，应为每天执行定时任务处理昨日一省的数据，以日期和省份两个字段作为分区字段，每个程序本来就只处理一个分区的数据，所以mapper和reducer不会产生过多分区。

参考

（编辑：百客网 - 百科网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

2022年优秀预测分析工	当大数据平台遇到K8s
大数据如何改变制造业	反映数据质量的八个指