Big Data interviews may be conducted on general lines (wherein you must have a general idea about the popular Big Data frameworks and tools) or they may be focused on a particular framework or tool. Today, we are going to focus on one widely used Big Data framework – Apache Hive.
We have created this list of Apache Hive interview questions to help you get a better idea about the kind of questions that employers usually ask during Hadoop interviews pertaining to Hive.
So, if you are someone who wishes to nail Hive interview, keep reading till the end!
- What is Apache Hive?
Apache Hive is a data warehousing framework built on top of Hadoop. It is primarily used for analyzing structured and semi-structured data. Hive is designed to project structure on the data and execute queries written in HQL (Hive Query Language), similar to that of SQL statements. Further, the Hive compiler transforms these queries into map-reduce jobs.
- What kind of applications can Hive support?
Hive can support any application written in Python, Java, C++, Ruby, and PHP.
- What do you mean by a Metastore? Why does Hive not store the metadata in HDFS?
Metastore is a repository in Hive that stores the metadata information. It does so by leveraging RDBMS along with an open-source ORM (Object Relational Model) layer called Data Nucleus that turns the object representation into the relational schema and vice versa.
Hive stores metadata information using RDBMS and not HDFS since reading/writing operations using HDFS is a time-consuming process. RDBMS has an advantage over it since it helps achieve low latency.
- Differentiate between Local and Remote Metastore.
A local metastore runs in the same JVM in which the Hive service runs. It can either connect to a database running in a separate JVM on the same machine or a remote machine. On the contrary, a remote metastore runs in a separate JVM and not in the one where the Hive service runs.
- What do you mean by a Partition in Hive? What is its importance?
In Hive, tables are classified and organized into partitions to organize similar type of data together, either according to a column or partition key. So, a partition is actually a sub-directory in the table directory. A table may have more than one partition keys for a particular partition.
Through partitioning, you can achieve granularity in a Hive table. This helps to reduce the query latency as it only scans relevant partitioned data instead of the whole dataset.
- What is a Hive Variable?
A Hive variable is created in the Hive environment developed by Hive scripting languages. Using the source command, it transfers values to hive queries when the query starts executing.
- What kind of data warehouse applications is Hive suitable for?
The design regulations of Hadoop and HDFS put certain limitations on Hive’s abilities. Also, it doesn’t have the necessary features required for OLTP (Online Transaction Processing). Hive is best suited for data warehouse applications in massive data sets that require:
- Analysis of the relatively static data.
- Less response time.
- No dynamic changes in data.
- What is a Hive Index?
Hive index is a Hive query optimization method. It is used to speed up the access of a specific column or set of columns in a Hive database. By utilizing a Hive index, the database system does not require to read all rows in a table to find the chosen data.
- Why do you need Hcatolog?
Hcatalog is required for sharing data structures with external systems. It provides access to the Hive metastore, so you can read/write data to Hive data warehouse.
- Name the components of a Hive query processor?
The components of a Hive query processor are:
- Logical Plan of Generation.
- Physical Plan of Generation.
- Execution Engine.
- UDF’s and UDAF’s.
- Semantic Analyzer.
- Type Checking.
- How do ORC format tables help Hive to enhance the performance?
Using the ORC (Optimized Row Columnar) file format, you can store the Hive data efficiently as it helps to simplify numerous limitations of the Hive file format.
- What is the function of the Object-Inspector?
In Hive, the Object-Inspector helps to analyze the internal structure of a row object and individual structure of columns. Furthermore, it also offers ways to access complex objects that can be stored in different formats in memory.
- What’s the difference between Hive and HBase?
The key differentiating points between Hive and HBase are:
- Hive is a data warehouse framework whereas HBase is a NoSQL database.
- While Hive can run most SQL queries, HBase does not allow SQL queries.
- Hive doesn’t support record-level insert, update, and delete operations on a table, but HBase supports these functions.
- Hive runs on top of MapReduce, but HBase runs on top of HDFS.
- What is a Managed Table and an External Table?
In a managed table, both the metadata information and the table data is deleted from the Hive warehouse directory if you leave/exit a managed table. However, in an external table, only the metadata information associated with the table is deleted while the table data is retained in the HDFS.
- Name the different components of a Hive architecture.
There are 5 components of a Hive Architecture:
- User Interface – It allows the user to submit queries and other operations to the Hive system. The user interface supports Hive web UI, Hive command line, and Hive HD Insight.
- Driver – It creates a session handle for the queries and then sends the queries to the compiler to create an execution plan for the same.
- Metastore – It contains the structured data along with all the information on different tables and partitions in the warehouse (with attributes). On receiving the metadata request, it sends the metadata to the compiler to execute the queries.
- Compiler – It generates the execution plan to parse the queries, perform semantic analysis on different query blocks, and generate query expression.
- Execution Engine – While the compiler makes the execution plan, the execution engine implements it. It manages the dependencies of the various stages of the plan.
Obviously, there is more to Hive than just these 15 questions. These are just the basic concepts that’ll help you ease into learning about Hive.
If you are interested to know more about Big Data, check out our PG Diploma in Software Development Specialization in Big Data program which is designed for working professionals and provides 7+ case studies & projects, covers 14 programming languages & tools, practical hands-on workshops, more than 400 hours of rigorous learning & job placement assistance with top firms.
How is Hive different from Pig?
Hive is a declarative SQL-type language that is primarily used to generate reports. Pig is a procedural data flow language that is used mainly for programming. Hive is extensively used by data analysts and executes on the server-side of the cluster. In contrast, Pig is generally used by programmers and researchers and executes on the client-side. Pig has no metadata database, and data types and schemas are all defined in the script itself. Hive uses a variation of the SQL DLL language and predefines and stores the tables and schema details in a database. Since Hive is very similar to SQL, it can be learnt easily compared to Pig.
What are the prerequisites to learning Hadoop?
Hadoop is one of the most popular data analytics platforms today. You can quickly learn Hadoop if you have a fundamental knowledge of programming. To learn and understand the principal concepts of the Hadoop ecosystem, you need to know Linux and Java. You do not need to be an expert at both, but a basic understanding can help you better grasp Hadoop concepts. This is because Hadoop needs to be configured in a Linux-based operating system. And your Java skills will also count because that can help you set up a lucrative career in Hadoop and Big Data. Additionally, knowing Perl, C, Python, and Ruby will help you write Hadoop functions.
Why do you need Hive?
The primary concept of Big Data lies in dealing with enormous volumes of data that traditional software applications cannot support. So applications that can handle the sheer volume and support scalability are needed for Big Data storage and analytics. Apache Hive is a fault-tolerant, distributed platform that allows data analytics operations on a massive scale. It allows users to write, read and efficiently manage petabytes of data with the help of SQL. Hive is built on Hadoop to store and process massive datasets and uses a SQL-like interface that makes learning easier.