Apache Hive is a data warehouse infrastructure tool build to process structured data in Hadoop. It accelerates data summarization, analyzing datasets, and ad-hoc queries. It provides an easy way to structure an abundance of unstructured data and executes SQL-like queries with the given data. It can easily merge with traditional data center technologies with the help of the JDBC/ODBC interface.

  • In Hive, data are separated by the bucketing process.
  • Hive is designed for managing and querying the structured data that is stored in the table.
  • Hive framework has features like UDFs, and it can increase the performance of the database through effective query optimization.
  • The SQL-inspired queries of Hive diminish the complexity of map-reduce programming and also decrease the familiar concept of a relational database such as a row, column, schema, and table for making the learning easy.
  • Hive can use the directory for the partitioning of data because Hadoop’s programming works on flat flies. It improves the performance of the database queries.

Q1. Explain the difference between Apache Pig and Hive.

Ans. Following are the differences between Apache Pig and Hive:

Apache Pig Apache Hive
●       It is a procedural data flow programming language

●       It supports the Avro file format

●       It is used by researchers and programmers

●       It works on the client-side of a cluster

●       It does not consist of a fixed metadata database

●       It is a declarative SQL Language

●       It does not support the Avro File format

●       It is used by the data analysts

●       It is used for creating reports

●        It uses an exact variation of SQL DDL Language

 

Q2. What is the Hive variable? How do we set a variable?

Ans. Hive variables are similar to the variables in other programming languages. They are developed in the Hive environment that is assigned by the Hive scripting language.

By using the keyword set

set foo=bar;

set system:foo=bar

Similarly, you can set the variable in command line for hiveconf namespace:

    beeline –hiveconf foo=bar

Q3. What are the different modes of Hive?

Ans. There are three modes in Hive:

  • Embedded Metastore
  • Local Metastore
  • Remote Metastore

Q4. Explain the difference between HBase and Hive.

Ans. Following are the differences between HBase and Hive:

HBase Hive
·        It does not allow execution of SQL Query

·        It is a NoSQL database

·        It runs on top of HDFS

·        It is free from the schema model

·        It follows real-time processing

·        It allows execution of most SQL queries

·        It is a data warehouse Framework

·        It runs on top of Hadoop MapReduce

·        It has the schema model

·        It does not follow real-time processing

 

Q5. What is the use of partition in Hive?

Ans. Hive partitions are defined as the division of similar type of data on the basis of partition keys, Hive design tables into partitions.

Partition is only helpful when it has partition keys. These are the basic keys that determine the data storage in the table. It is the subdirectory in the table directory.

Q6. What are the data types supported by Hive?

Ans. Type of data types in Hive:

  1. Primitive data type

  • Numeric data type
  • String data type
  • Date/Time data type
  • Miscellaneous data type
  1. Complex data types
  • Arrays
  • Maps
  • Structs
  • Union

 

Q7. What is the precedence order in Hive configuration?

Ans. There is a precedence hierarchy for setting properties:

  • The Hive SET command
  • The command line -hiveconf option
  • hive-site.xml
  • hive-default.xml
  • hadoop-site.xml (or, equivalently, hdfs-site.xml, core-site.xml, and mapred-site.xml)
  • hadoop-default.xml (or, equivalently, hdfs-default.xml, core-default.xml, and mapred-default.xml)

 

Q8. Is it possible to change the default location of a managed table?

Ans. Yes, it is possible to change the default location of a managed table. We can change the location by using – LOCATION ‘<hdfs_path>’.

Q9. Explain the mechanism for connecting applications when we run Hive as a server.

Ans. The mechanism is done by following the below steps:

  • Thrift client: By using thrift client, we can call Hive commands from different programming languages such as Java, Python, C++, Ruby
  • JDBC driver: It enables accessing data and supports Type 4 JDBC driver
  • ODBC driver: ODBC API Standards apply for the Hive DBMS. It supports ODBC protocols.

Q10. How to remove header rows from a table in Hive?

Ans. By using the TBLPROPERTIES clause, we can remove N number of rows from the top or bottom from a text file without using the Hive file. TBLPROPERTIES clause can provide multiple features that we can set as per our needs. It can be used when files are generated with additional header or footer records.

Following are the header records in a table:

System=…

Version=…

Sub-version=…

To skip the header lines in a Hive file, we can use the following table property:

CREATE EXTERNAL TABLE employee (

name STRING,

job STRING,

dob STRING,

Q11. Explain the need for buckets in Apache Hive.

Ans. The concept of bucketing provides a way of differentiating Hive table data into various files or directories. It provides effective results only if –

  • There are a limited number of partitions
  • Partitions are of almost similar sizes

To solve the problem of partitioning, Hive provides a bucketing concept. It is an effective way to decompose tables into manageable parts.

Q12. How can you recursively access subdirectories?

Ans. We can access subdirectories recursively by using the following command:

hive> Set mapred.input.dir.recursive=true;

hive> Set hive.mapred.supports.subdirectories=true;

Hive tables can be directed to the higher level directory and it suitable for the directory structure:

/data/country/state/city/

Q13. How Hive distributes rows into buckets?

Ans. Rows can be divided into buckets by using:

hash_function (bucketing_column) modulo (num_of_buckets)

Here, Hive lead the bucket number in the table

Function used for column data type:

hash_function

Function used for integer data type:

hash_function (int_type_column)= value of int_type_column

Q14. What are the commonly used Hive services?

Ans. Following are the commonly used Hive services:

  • Command Line Interface (cli)
  • Printing the contents of an RC file with the use of rcfilecat tool
  • HiveServer (hiveserver)
  • Hive Web Interface (hwi)
  • Metastore
  • Jar

Q15. Is it possible to change the settings within the Hive session?

Ans. Yes, it is possible to change the settings in the Hive session by using the SET command. It helps with the change in Hive job settings for an exact query.

Following command shows the occupied buckets in the table:

hive> SET hive.enforce.bucketing=true;

By using SET command, we can see the present value of any property

hive> SET hive.enforce.bucketing;

hive.enforce.bucketing=true

We cannot target the defaults of Hadoop with the above command, so we can use –

 SET -v

Q16. Mention the components used in the Hive query processor.

Ans. Following are the components used in the Hive query processor:

  • Parse and Semantic Analysis (ql/parse)
  • Map/Reduce Execution Engine (ql/exec)
  • Optimizer (ql/optimizer)
  • Sessions (ql/session)
  • Hive Function Framework (ql/udf)
  • Plan Components (ql/plan)
  • Type Interfaces (ql/typeinfo)
  • Metadata Layer (ql/metadata)
  • Tools (ql/tools)

Q17. What are the Trim and reverse functions in Hive?

Ans. The trim function removes the spaces related to the strings.

Example:

TRIM(‘ NAUKRI ‘);

Output:

NAUKRI

To remove the leading space:

LTRIM(‘NAUKRI’);

To remove the trailing space:

RTRIM(‘NAUKRI ‘);

The reverse function will reverse the characters into strings.

Example:

REVERSE(‘NAUKRI’);

Output:

IRKUAN

Q18. Explain the default metastore for Hive?

Ans. It provides an embedded Derby database instance that can only be supported by one user where it can store metadata. If you run your Hive query by using the default derby database. Then, you will get a default subdirectory in your current directory with the name metastore_db . It will also create the metastore if it does not already exist. Here, the property of interest is javax.jdo.option.ConnectionURL.

And the default value is jdbc:derby:;databaseName=metastore_db;create=true.  This value identifies that you are using embedded derby as your Hive metastore, and its location is metastore_db.

Q19. Is multiline comment supported in Hive?

Ans. No, Hive can only support single-line comments.

Q20. What is the possible way to improve the performance with ORC format tables in Hive?

Ans. We can improve the performance by using the ORC file format by storing Hive data in a highly efficient manner. The performance can also be improved by using ORC files while writing, reading, and processing data.

Set hive.compute.query.using.stats-true;

Set hive.stats.dbclass-fs;

CREATE TABLE orc_table (

idint,

name string)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘\:’

LINES TERMINATED BY ‘\n’

STORES AS ORC;