x
    Glossary

    What is Hadoop?

    HADOOP is a big data processing paradigm that provides a reliable, scalable storage and processing system.

    A big data processing framework like Hadoop has changed how we process, store, and use data. In comparison to traditional processing tools like RDBMS, Hadoop proved to be able to deal with Big Data challenges such as,

    • Variety of Data: Hadoop can store, process, and visualize structured, semistructured, and unstructured data.
    • The volume of Data: Hadoop was specifically designed to handle petabytes of data.
    • The velocity of Data: A major advantage of Hadoop is its ability to process petabytes of data at a fast pace, as compared with other tools such as RDBMS, i.e. it is less time-consuming to process data in Hadoop.

    What are some advantages of Hadoop?

    1. Open Source

    Apache Hadoop is an open-source project, meaning its source code is accessible for free. As per our business requirements, we can modify the source code. Hadoop is also available in proprietary versions, such as Cloudera and Hortonworks.

    2. Easily scalable

    The Hadoop cluster consists of a number of machines. Scalability is a key feature of Hadoop. Adding new nodes to the cluster without any downtime allows us to increase its size as needed. In Horizontal Scaling, new machines are added to the cluster while Vertical Scaling involves increasing components such as RAM and hard disks.

    3. A fault-tolerant system

    The most salient feature of Hadoop is its fault tolerance. HDFS assigns a Replication Factor of 3 to each and every block by default. HDFS makes two copies of every data block and stores them in different locations within the cluster. We still have two copies of any block that goes missing due to machine failure, so we can still use them. This is how Hadoop achieves fault tolerance.

    4. Independent Schema

    There are different types of data that can be processed by Hadoop. Besides being able to store a variety of data formats, it can also work with both structured and unstructured data.

    5. Low latency and high throughput

    The definition of throughput is the amount of work done per unit of time, and the definition of low latency is to process data with little or no delay. As Hadoop is based on distributed storage and parallel processing, each block of data is processed independently and simultaneously. Additionally, code is moved to the cluster instead of data. High Throughput and Low Latency are the results of these two factors.

    Hadoop consists of what components?

    To store and manage Big Data, Hadoop utilizes distributed storage and parallel processing. Big Data is handled most commonly with this software. Hadoop consists of three components.

    • Hadoop HDFS - Hadoop Distributed File System (HDFS) is Hadoop's storage system.
    • Hadoop MapReduce - Hadoop MapReduce is its processing unit.
    • Hadoop YARN - As a part of Hadoop, Hadoop YARN provides resource management.

    READ MORE: Big Data Management: Hadoop Or Snowflake