Monday, 18 June 2018

How to create Hive Partition Table

Hive Partitions is a way to organizes tables into partitions by dividing tables into different parts based on partition keys.

Here I have a data file "employee_ctc_data.txt" file which has following data:


1)Creation of Table employee_ctc


create table employee_ctc(Empid int, Name string,Age int, Gender string, Department string, ctc int)
row format delimited
fields terminated by ',';




2)Loading data into created table employee_ctc


Load data local inpath '/home/amir/employee_ctc_data.txt' into table employee_ctc;

3)Creation of partition table


create table dept_part(Empid int, Name string,Age int, Gender string, ctc int) PARTITIONED BY(Department string);




4)For partition we have to set this property


set hive.exec.dynamic.partition.mode=nonstrict

5)Loading data into partition table


INSERT OVERWRITE TABLE dept_part PARTITION(Department)
SELECT Empid,Name,Age,Gender,ctc,Department from  employee_ctc;




Now you can check whether partitions have been created on HDFS System. Actual processing and formation of partition tables based on Department as partition key



Here you can see 4 partition tables have been created as we have 4 different types of Department in the data file.

6) Querying the data from partition tables:


select * from dept_part where Department = 'Finance';
select count(*) from dept_part where Department = 'Finance';

NOTE: The main purpose of using partition is to save the Query time. Instead of going through the entire dataset to find the required data we can just query the data based on partition table.

Example:

i) select * from employee_ctc where Department = 'Finance';



ii) select * from dept_part where Deparment = 'Finance';



As you can clearly see the time difference between execution of both the queries.

That is why, partition in Hive table is important.















No comments:

Post a Comment