Apache Hive – Definition & Overview

What is Apache Hive?

Apache Hive is open-source software built for use in data warehousing. It allows for the analyzing and querying of vast amounts of data. It was created to use with Hadoop and has become one of the most popular methods for SQL queries over petabytes of data. Data analysts can then query and analyze data through Hive to turn that data into actionable insights for a business. Apache Hive is optimized for performing standard data warehousing tasks. These include extract/transform/load (ETL) and reporting.

Apache Hive creates a SQL-like interface, which uses HiveQL to query data stored in Hadoop. Hive’s three main functions are summarizing, querying, and analyzing data. With regards to data warehouses it helps with reading, writing, and managing large datasets. These are generally located in distributed storage.

Hive was first created because of the difficulty users had with using Java programming to query data. Apache Hive set out to make query development easier. This would make Apache’s Hadoop easier to use, especially for organizations using unstructured data. Another advantage is that response times using Hive are decreased. This is because of its use of indexing and compressed data. Query time is also reduced by storing metadata in a relational database management system. Recent versions of Hive have reported analytics processing of 100 million rows per second, per node.

The flexibility of Hive is one of its most attractive features. This means that there is not a set format that must be used for storing data. Instead, many connectors for different formats are built into the software. It can then put a uniform structure on data arriving in varied formats.

Hive has a number of component modules which perform different tasks. These include the driver, compiler, and executor. These define the different stages of carrying out a task. Another module is the metastore, which holds metadata in order to speed up query requests.