Hadoop Interview Questions and Answers

Find 100+ Hadoop interview questions and answers to assess candidates’ skills in HDFS, MapReduce, YARN, Hive, and big data ecosystem fundamentals.
By
WeCP Team

As organizations process and analyze massive volumes of structured and unstructured data, recruiters must identify Hadoop professionals who can build and manage scalable big data ecosystems. Hadoop remains a foundational technology for distributed storage, batch processing, and large-scale analytics across enterprise data platforms.

This resource, "100+ Hadoop Interview Questions and Answers," is tailored for recruiters to simplify the evaluation process. It covers a wide range of topics—from Hadoop fundamentals to advanced ecosystem components, including HDFS, YARN, MapReduce, and Hadoop performance tuning.

Whether you're hiring Big Data Engineers, Hadoop Developers, Data Engineers, or Analytics Engineers, this guide enables you to assess a candidate’s:

  • Core Hadoop Knowledge: HDFS architecture, NameNode/DataNode roles, block storage, replication, YARN components, and MapReduce basics.
  • Advanced Skills: Hadoop ecosystem tools (Hive, HBase, Pig, Sqoop, Oozie), resource management, cluster optimization, and handling fault tolerance.
  • Real-World Proficiency: Designing data pipelines, processing large datasets, integrating Hadoop with Spark/Kafka, and maintaining reliable, scalable big data clusters.

For a streamlined assessment process, consider platforms like WeCP, which allow you to:

  • Create customized Hadoop assessments tailored to data engineering and big data analytics roles.
  • Include hands-on tasks such as writing MapReduce jobs, querying data with Hive, or troubleshooting cluster issues.
  • Proctor exams remotely while ensuring integrity.
  • Evaluate results with AI-driven analysis for faster, more accurate decision-making.

Save time, enhance your hiring process, and confidently hire Hadoop professionals who can build scalable, fault-tolerant, and production-ready big data systems from day one.

Hadoop Interview Questions

Hadoop – Beginner (1–40)

  1. What is Hadoop, and what problem does it solve?
  2. Explain the core components of Hadoop.
  3. What is HDFS, and why is it used?
  4. Explain the concept of blocks in HDFS.
  5. What is replication in HDFS?
  6. What is the default HDFS block size?
  7. Explain the role of NameNode.
  8. Explain the role of DataNode.
  9. What happens if a NameNode fails?
  10. What is Secondary NameNode, and what does it do?
  11. What is MapReduce?
  12. Explain Mapper and Reducer.
  13. What is the difference between InputSplit and Block?
  14. What is YARN in Hadoop?
  15. Explain the functions of ResourceManager.
  16. Explain NodeManager in YARN.
  17. What is a JobTracker?
  18. What is a TaskTracker?
  19. Explain the Hadoop ecosystem.
  20. What is Apache Pig?
  21. What is Apache Hive?
  22. What is a Hive external table?
  23. What is a Hive partition?
  24. What is a Hive bucket?
  25. What is Apache HBase?
  26. What is Zookeeper used for?
  27. What is Sqoop?
  28. What is Flume?
  29. What is the difference between SQL and HiveQL?
  30. What is Hadoop Streaming?
  31. What is SerDe in Hive?
  32. Explain NameNode metadata.
  33. What is a Job in Hadoop?
  34. What is shuffle and sort in MapReduce?
  35. Explain speculative execution.
  36. What is a combiner, and when do we use it?
  37. Explain rack awareness.
  38. What is schema-on-read?
  39. What is schema-on-write?
  40. What are the main Hadoop configuration files?

Hadoop – Intermediate (1–40)

  1. Explain the complete HDFS write path.
  2. Explain the complete HDFS read path.
  3. How does NameNode handle failover (HA architecture)?
  4. What is JournalNode?
  5. Explain Hadoop Federation.
  6. What is the difference between NameNode HA vs Federation?
  7. What is the role of ApplicationMaster in YARN?
  8. Explain ResourceManager Scheduler types (FIFO, Capacity, Fair).
  9. What is a NodeLabel in YARN?
  10. Explain MapReduce execution workflow.
  11. What is InputFormat in MapReduce?
  12. Explain OutputFormat in MapReduce.
  13. What is RecordReader?
  14. What is Writable in Hadoop?
  15. Explain the difference between MapReduce 1 and MapReduce 2 (YARN).
  16. What is a distributed cache in Hadoop?
  17. How does Hadoop ensure data locality?
  18. Explain the difference between HDFS block and InputSplit.
  19. How does speculative execution work internally?
  20. Explain small file problem in Hadoop.
  21. How to solve the small file problem?
  22. What is sequence file format?
  23. What is Avro, and why is it used?
  24. What is Parquet, and why is it used?
  25. Compare Avro vs Parquet.
  26. What are Hadoop compression codecs?
  27. Difference between Snappy, GZip, and LZO.
  28. What is Hadoop balancer?
  29. What is distcp, and how does it work?
  30. Explain Hadoop security architecture.
  31. What is Kerberos authentication in Hadoop?
  32. What is HDFS snapshot?
  33. What is HDFS fsck?
  34. Explain block report and heartbeat.
  35. What is checkpointing in Hadoop?
  36. Explain the role of Checkpoint Node vs Backup Node.
  37. What is NodeManager local resource management?
  38. Explain YARN container lifecycle.
  39. How does Hadoop handle corrupted blocks?
  40. What is Hadoop capacity planning?

Hadoop – Experienced (1–40)

  1. Explain the internal architecture of HDFS in depth.
  2. Describe the lifecycle of a MapReduce job internally.
  3. Explain the HDFS write pipeline with acknowledgement protocol.
  4. Explain the HDFS read pipeline with checksum verification.
  5. How NameNode stores metadata inside memory (FSImage, EditLogs)?
  6. How does NameNode merge EditLogs during checkpointing?
  7. Explain Zookeeper’s role in NameNode HA failover.
  8. Explain internals of YARN ResourceManager failover.
  9. How does Hadoop ensure consistency (WORM model)?
  10. Explain how MapReduce handles fault tolerance.
  11. What are advanced YARN scheduling strategies?
  12. Explain container allocation algorithm in YARN.
  13. Discuss advanced data locality optimization.
  14. Describe Hadoop’s speculative execution algorithm.
  15. Explain combiner vs reducer internal differences.
  16. Describe custom partitioner design strategies.
  17. Explain the shuffle and sort architecture with spill mechanism.
  18. Explain MapReduce task memory tuning.
  19. How do you optimize MapReduce jobs for performance?
  20. Explain multi-stage MapReduce pipelines.
  21. Describe advanced Hive query optimization on Hadoop.
  22. Explain vectorization in Hive.
  23. Explain ORC file internals (stripes, indexes, metadata).
  24. Explain Parquet file internals (row groups, pages, encodings).
  25. Compare ORC vs Parquet for analytical workloads.
  26. Explain HBase read/write path internally.
  27. Explain LSM tree and compaction in HBase.
  28. What is region splitting and region balancing in HBase?
  29. Explain distributed coordination using Zookeeper in Hadoop.
  30. Explain Erasure Coding in HDFS and its advantages.
  31. Describe HDFS encryption zones.
  32. What is Ranger/Knox, and how do they secure Hadoop?
  33. How does Hadoop manage millions of small files at scale?
  34. Explain performance bottlenecks in Hadoop clusters.
  35. Describe Hadoop cluster monitoring strategies.
  36. What is Hadoop auto-scaling, and how is it implemented?
  37. Explain Hadoop multi-tenant architecture.
  38. How to perform disaster recovery for HDFS across datacenters?
  39. Explain HDFS distcp internals for large migration.
  40. Describe Big Data architecture integrating Hadoop, Spark, Kafka, and NoSQL.

Hadoop Interview Questions and Answers

Beginner (Q&A)

1. What is Hadoop, and what problem does it solve?

Hadoop is an open-source, distributed computing framework developed by the Apache Software Foundation. It is designed to store and process massive volumes of structured, semi-structured, and unstructured data across clusters of commodity hardware.

Traditional data systems like RDBMS fail when data becomes too large (terabytes to petabytes), too fast, or too unstructured. They struggle due to:

  • Limited scalability
  • High storage costs
  • Inability to parallelize workloads
  • Performance bottlenecks
  • Schema rigidity

Hadoop solves these challenges through horizontal scaling, where thousands of cheap machines work together in parallel. Its advantages include:

  • Scalable storage via HDFS
  • Distributed processing via MapReduce
  • Fault tolerance using replication
  • Cost-effectiveness using commodity servers
  • Schema-on-read flexibility for all data formats

In short, Hadoop enables organizations to store and process huge datasets efficiently, reliably, and cost-effectively.

2. Explain the core components of Hadoop.

Hadoop’s architecture consists of four major components, each with a specific role:

1. Hadoop Distributed File System (HDFS)

  • A distributed, scalable storage system.
  • Stores large datasets across multiple nodes.
  • Provides fault tolerance using replication.

2. Yet Another Resource Negotiator (YARN)

  • A cluster resource management layer.
  • Handles scheduling and job execution.
  • Allocates CPU, memory, and compute resources to applications.

3. MapReduce

  • A distributed programming and processing model.
  • Processes large-scale data in parallel across clusters.
  • Divides tasks into Map (filtering, sorting) and Reduce (aggregating, summarizing).

4. Hadoop Common

  • Shared utilities and libraries.
  • Provides essential Java framework, scripts, and configuration files.

Together, these components allow Hadoop to operate as a powerful, distributed big data platform.

3. What is HDFS, and why is it used?

HDFS (Hadoop Distributed File System) is the primary storage layer of Hadoop. It is engineered for storing very large datasets across a distributed cluster in a fault-tolerant way.

HDFS is used because:

  • It stores huge files reliably by splitting them into blocks and distributing them across cluster nodes.
  • High throughput: designed for streaming read/write rather than random access.
  • Fault tolerance: data is replicated, so node failures do not cause data loss.
  • Horizontal scalability: capacity increases simply by adding more nodes.
  • Optimized for batch processing with large datasets.

HDFS ensures that even if multiple nodes fail, data remains safe and accessible.

4. Explain the concept of blocks in HDFS.

HDFS stores files by dividing them into fixed-size blocks (e.g., 128 MB). Each block is stored independently on different DataNodes.

Key characteristics:

  • A file larger than a block size is broken into multiple blocks.
  • A file smaller than the block size still occupies one block.
  • Blocks enable parallel processing, as different nodes can process different blocks.
  • Blocks allow HDFS to scale effortlessly across thousands of nodes.

Benefits of HDFS block architecture:

  • Simplified storage management
  • Efficient replication
  • High throughput
  • Supports massive files (TB to PB)

Block-based architecture is core to Hadoop’s ability to process data in a distributed fashion.

5. What is replication in HDFS?

Replication is the mechanism by which HDFS ensures data reliability and fault tolerance. Every block stored in HDFS is copied to multiple DataNodes.

Important aspects:

  • Default replication factor is 3 (can be configured).
  • Blocks are stored on different nodes and racks for high availability.
  • If a DataNode fails, HDFS automatically creates additional replicas to maintain the target replication factor.
  • Replication helps avoid data loss, maintain cluster performance, and enable parallel read operations.

Replication ensures data remains safe even if nodes crash, disks fail, or racks go offline.

6. What is the default HDFS block size?

The default HDFS block size is:

  • 128 MB in Hadoop 2.x and later
  • 64 MB in older Hadoop versions

Large block sizes provide several benefits:

  • Reduced overhead for metadata storage in NameNode
  • Fewer blocks to manage, improving performance
  • Higher throughput for large-scale data processing
  • Better utilization of sequential disk reads

The block size can be changed based on workload, file size, and cluster configuration.

7. Explain the role of NameNode.

The NameNode is the master server in HDFS. It manages the filesystem namespace and controls all file operations.

Responsibilities of NameNode:

  • Stores filesystem metadata (file names, directories, permissions).
  • Maintains a mapping of file blocks → DataNodes.
  • Handles client requests such as open(), close(), rename(), delete().
  • Coordinates block placement and replication.
  • Manages cluster health through heartbeats from DataNodes.

Critical insight:

The NameNode does not store actual data—only metadata. Without NameNode metadata, HDFS cannot locate or access any blocks.

Thus, the NameNode is the “brain” of the Hadoop cluster.

8. Explain the role of DataNode.

DataNodes are the worker nodes in HDFS that actually store blocks of data. They are responsible for performing low-level read/write operations.

Key functions:

  • Store HDFS blocks on local disks.
  • Send heartbeat signals to NameNode to confirm availability.
  • Send block reports containing block metadata.
  • Serve client read/write requests.
  • Handle block replication, deletion, and recovery tasks as instructed by the NameNode.

If a DataNode fails:

  • NameNode redistributes block replication tasks to maintain redundancy.

DataNodes form the physical storage backbone of Hadoop.

9. What happens if a NameNode fails?

A NameNode failure is critical because it stores the metadata required to locate blocks and manage the filesystem.

Impact of failure (in non-HA setups):

  • Entire HDFS becomes unavailable.
  • Clients cannot read/write data.
  • Cluster operations freeze until NameNode is restored.

Modern Hadoop uses a High Availability (HA) architecture with:

  • Active NameNode
  • Standby NameNode

Metadata is synchronized between them via JournalNodes.

In case of failure:

  • Standby NameNode automatically becomes Active.
  • Services continue without downtime.

Thus, NameNode HA eliminates the single point of failure.

10. What is Secondary NameNode, and what does it do?

The Secondary NameNode is not a backup NameNode. Instead, it performs checkpointing to prevent the NameNode metadata from growing too large.

Functions:

  • Periodically merges FSImage (file system snapshot) and EditLogs (record of filesystem changes).
  • Creates a new, compact FSImage file.
  • Transfers this FSImage back to the NameNode.
  • Helps reduce NameNode restart time after a crash.

Why it is important:

  • NameNode metadata grows continuously.
  • Without checkpointing, NameNode startup becomes slow or impossible.

Misconception clarification:

  • It does not replace the NameNode during failure.
  • It is only used for housekeeping and metadata cleanup.

Modern Hadoop HA setups replace Secondary NameNode with:

  • Checkpoint Node
  • Backup Node

But the fundamental concept of checkpointing remains.

11. What is MapReduce?

MapReduce is a distributed data processing framework in Hadoop that enables large-scale data computation across clusters of commodity hardware. It breaks complex data processing tasks into smaller, parallelizable sub-tasks and executes them efficiently.

MapReduce works in two phases:

  1. Map Phase
    • Takes input data and transforms it into intermediate key-value pairs.
    • Performs filtering, grouping, sorting, or data extraction.
  2. Reduce Phase
    • Takes intermediate key-value pairs and aggregates them.
    • Performs tasks like counting, summing, joining, or merging.

Key features:

  • Horizontal scalability: runs across thousands of nodes.
  • Fault tolerance: failed tasks are automatically re-executed.
  • Data locality: computation moves to the node where data resides, reducing network traffic.
  • Parallel processing: boosts performance for large datasets.

MapReduce is ideal for batch processing workloads such as log analysis, indexing, ETL, and data aggregation.

12. Explain Mapper and Reducer.

Mapper

The Mapper processes raw input data and generates intermediate key-value pairs.

Responsibilities:

  • Reads input splits.
  • Transforms data through filtering, parsing, or computation.
  • Emits key-value pairs for further processing.

Example:
Input: "apple banana apple"
Mapper output:
(apple, 1), (banana, 1), (apple, 1)

Reducer

The Reducer collects intermediate key-value pairs and performs aggregation.

Responsibilities:

  • Receives values grouped by key.
  • Applies functions like sum, count, join, average, merge.
  • Writes final output to HDFS.

Example:
Reducer receives:
apple: [1, 1], banana: [1]
Reducer output:
apple: 2, banana: 1

Mapper = data extraction
Reducer = data aggregation

13. What is the difference between InputSplit and Block?

Though often confused, InputSplit and Block serve completely different purposes in Hadoop.

HDFS Block

  • A physical chunk of data stored on DataNodes.
  • Default size: 128 MB.
  • Used for storage and replication.
  • Represents how data is distributed across the cluster.

InputSplit

  • A logical piece of data used by MapReduce to assign work to Mapper tasks.
  • InputSplit does not store data.
  • It tells the framework:
    “The mapper should read from block X to block Y.”

Key Differences:

FeatureHDFS BlockInputSplitNaturePhysicalLogicalPurposeStorageInput for MapperSizeFixedFlexibleReplicationYesNoDeterminesData distributionMapper count

In short:
Block = storage
InputSplit = processing

14. What is YARN in Hadoop?

YARN stands for Yet Another Resource Negotiator. It is the cluster resource management layer of Hadoop that improves scalability and multi-tenancy.

Before YARN (Hadoop 1.x), MapReduce handled both processing and resource management, causing bottlenecks.
YARN decouples computation from resource management.

Core responsibilities:

  • Manage and allocate cluster resources (CPU, memory).
  • Schedule jobs based on priority and availability.
  • Handle failures and re-execution of tasks.
  • Allow multiple processing engines (Spark, Flink, Storm, Hive, etc.) to run on Hadoop.

YARN transformed Hadoop into a general-purpose big data platform, enabling real-time, batch, and interactive workloads simultaneously.

15. Explain the functions of ResourceManager.

ResourceManager (RM) is the central authority in YARN responsible for cluster-wide resource management and scheduling.

Key responsibilities:

1. Resource Allocation

  • Allocates CPU, memory, and containers to applications.
  • Ensures fair distribution among users and queues.

2. Job Scheduling

  • Uses schedulers like FIFO, Capacity, or Fair Scheduler.
  • Prioritizes jobs based on policies and cluster load.

3. Application Coordination

  • Communicates with ApplicationMaster for launching tasks.
  • Ensures applications get required resources.

4. Cluster Health Monitoring

  • Monitors NodeManagers via heartbeats.
  • Handles unhealthy node decommissioning.

5. Failover Handling

  • In HA mode, secondary RM automatically takes over during failure.

ResourceManager = brain of YARN, controlling all resource decisions across the cluster.

16. Explain NodeManager in YARN.

NodeManager (NM) runs on every node in the Hadoop cluster and acts as the agent responsible for managing node-level resources.

Primary functions:

1. Resource Monitoring

  • Tracks CPU, memory, and disk usage.
  • Reports node health to ResourceManager.

2. Container Management

  • Launches and terminates containers as instructed by the ResourceManager.
  • Ensures each container receives the allocated resources.

3. Logging & Task Monitoring

  • Collects logs for containers.
  • Provides logs to ResourceManager and ApplicationMaster.

4. Failure Handling

  • If a node becomes unhealthy, NM marks it offline.
  • Kills running containers safely.

Essentially:

  • ResourceManager = global manager
  • NodeManager = local manager

They work together to maintain cluster efficiency and stability.

17. What is a JobTracker?

JobTracker is part of the Hadoop 1.x MapReduce framework (before YARN existed). It handled two major roles:

1. Job Scheduling

  • Accepted new jobs from clients.
  • Assigned tasks (Map/Reduce) to TaskTrackers.

2. Resource Management

  • Monitored TaskTrackers.
  • Rescheduled tasks on failure.
  • Maintained job progress and status.

Limitations:

  • Single point of failure (SPOF)
  • Scalability constraints
  • Could not support multiple processing engines

Hadoop 2.x replaced JobTracker with:

  • ResourceManager (resource management)
  • ApplicationMaster (job-specific scheduling)

18. What is a TaskTracker?

TaskTracker ran on each node in Hadoop 1.x and worked under JobTracker.

Responsibilities:

1. Execute Tasks

  • Runs Map and Reduce tasks assigned by the JobTracker.
  • Uses task slots for managing concurrency.

2. Send Heartbeats

  • Reports task status and node health to JobTracker.
  • Helps detect node failures.

3. Handle Local Data

  • Reads/writes data stored in local DataNode.
  • Ensures data locality optimization.

With Hadoop 2.x:

TaskTracker → NodeManager
JobTracker → ResourceManager + ApplicationMaster

Thus, the system became more fault-tolerant and scalable.

19. Explain the Hadoop ecosystem.

The Hadoop Ecosystem is a collection of tools and frameworks built around Hadoop for handling storage, processing, ingestion, analysis, and orchestration of large-scale data.

Main components include:

1. Storage

  • HDFS – Distributed storage
  • HBase – NoSQL column-family database
  • HCatalog – Table management for Hive/Pig

2. Processing

  • MapReduce – Batch processing
  • Apache Spark – In-memory processing
  • Tez – DAG-based execution for Hive/Pig

3. Data Ingestion

  • Sqoop – RDBMS ↔ Hadoop data transfer
  • Flume – Log and streaming data ingestion
  • Kafka – Streaming platform

4. Data Query & Analysis

  • Hive – SQL-like analytics
  • Pig – Scripting-based ETL
  • Impala/Presto – Low-latency SQL engines

5. Coordination & Management

  • ZooKeeper – Coordination
  • Oozie – Workflow scheduler
  • Ambari – Cluster provisioning & monitoring

The ecosystem allows Hadoop to handle diverse workloads across multiple industries and use cases.

20. What is Apache Pig?

Apache Pig is a high-level data processing framework built on top of Hadoop. It allows developers to write complex data transformations using a simplified scripting language called Pig Latin.

Why Pig is used:

  • Reduces the complexity of writing MapReduce jobs.
  • Provides concise syntax for ETL workflows.
  • Automatically compiles scripts into optimized MapReduce jobs.
  • Supports complex data types like tuples, bags, and maps.

Key features:

  • Ease of use: Minimal coding compared to Java MR.
  • Extensibility: Users can write UDFs (User Defined Functions).
  • Schema flexibility: Works well with semi-structured data.
  • Optimized execution: Uses logical and physical optimization.

Pig is widely used for:

  • Data cleaning
  • ETL pipelines
  • Log analysis
  • Data preparation for machine learning

21. What is Apache Hive?

Apache Hive is a data warehousing and SQL-like query system built on top of Hadoop. It allows users to analyze and query large datasets stored in HDFS using a SQL-style language called HiveQL (HQL) rather than writing complex MapReduce programs.

Key features:

  • SQL-like interface: Enables non-programmers to work with Hadoop.
  • Schema-on-read: Data is interpreted at query time, providing flexibility.
  • Supports large-scale analytical queries: Ideal for BI and reporting.
  • Automatic MapReduce generation: Hive converts HQL queries into MapReduce, Tez, or Spark jobs.
  • Highly scalable: Processes petabytes of data.

Hive is widely used for:

  • Data analysis
  • Reporting
  • ETL (Extract, Transform, Load) pipelines
  • Creating structured views on top of unstructured data

In summary, Hive bridges the gap between traditional SQL analysts and the distributed processing capabilities of Hadoop.

22. What is a Hive external table?

A Hive external table is a table in Hive whose data is stored outside the Hive warehouse directory. Hive only stores metadata about the table; the actual data remains in a specified HDFS directory.

Characteristics:

  • When an external table is dropped,
    only the metadata is removed
    the underlying data remains intact

Use cases:

  • When data must be shared across multiple tools.
  • When you don’t want Hive to take ownership of the data.
  • When working with data ingested by Flume, Kafka, or external applications.
  • For scenarios where datasets should not be deleted automatically.

Example:

CREATE EXTERNAL TABLE logs (...) 
LOCATION '/data/logs/';

External tables offer maximum flexibility and prevent accidental data loss.

23. What is a Hive partition?

A Hive partition divides a table into subdirectories based on partition keys such as date, country, or category. Partitioning helps Hive scan only the relevant subset of data, improving query performance.

Example table partitioning by year and month:

/sales/year=2023/month=01
/sales/year=2023/month=02

Benefits:

  • Faster query execution
  • Reduces disk scanning
  • Supports efficient filtering through partition pruning

Ideal use cases:

  • Time-series data
  • Log files
  • Country or region-based datasets

Partitioning is essential for optimizing big data queries in Hive.

24. What is a Hive bucket?

A Hive bucket divides data into multiple files (buckets) based on the hashing of a column. Bucketing helps in sampling, efficient joins, and sorting.

Example:

CLUSTERED BY (user_id) INTO 8 BUCKETS;

How bucketing works:

  • Hive computes a hash on the bucket column.
  • Rows with the same hash go into the same file.
  • Number of buckets is fixed.

Benefits:

  • Improves join performance (bucket map join).
  • Reduces shuffle time.
  • Enables efficient data sampling.
  • Provides structured file organization.

Bucketing is especially useful when working with large datasets requiring frequent joins.

25. What is Apache HBase?

Apache HBase is a distributed, scalable, NoSQL column-family database built on top of HDFS. It supports real-time read/write operations for massive datasets.

Key characteristics:

  • Modeled after Google BigTable.
  • Supports billions of rows and millions of columns.
  • Provides random read/write access (unlike Hive which is batch-oriented).
  • Stores data in column families, not tables and rows like RDBMS.
  • Strongly consistent reads and writes.

Use cases:

  • Real-time applications
  • Time-series data
  • IoT sensor data
  • Online recommendation systems
  • User profile storage

HBase complements Hadoop by providing low-latency access to big data.

26. What is Zookeeper used for?

ZooKeeper is a centralized coordination and configuration service used by distributed systems like Hadoop, HBase, Kafka, etc.

Responsibilities:

1. Distributed synchronization

  • Helps manage locks and leader selection.

2. Configuration management

  • Stores system configuration in a consistent, fault-tolerant manner.

3. Naming and metadata service

  • Maintains metadata for applications like HBase.

4. High availability

  • Supports NameNode and ResourceManager failover in Hadoop.

Why Zookeeper is important:

  • Ensures consistency across distributed systems.
  • Prevents conflicts and race conditions.
  • Provides reliable coordination for cluster operations.

Without ZooKeeper, many of Hadoop’s HA and coordination features would not work.

27. What is Sqoop?

Sqoop (SQL-to-Hadoop) is a tool designed to transfer data between relational databases and Hadoop.

Key functions:

1. Import data into HDFS

From databases like MySQL, Oracle, SQL Server, Postgres.

2. Export data from HDFS to RDBMS

Useful for reporting and analytics pipelines.

3. Integration with Hive and HBase

Directly imports tables into Hive or loads data into HBase.

Benefits:

  • Parallel data transfer for speed.
  • Automatic code generation for mappers.
  • Supports incremental imports (append or last-modified).

Sqoop is essential for moving structured data to and from Hadoop ecosystems efficiently.

28. What is Flume?

Apache Flume is a distributed data ingestion tool used for collecting, aggregating, and transporting large volumes of log and event data into Hadoop.

Key components:

  • Source: Receives incoming data (HTTP, syslogs, custom apps).
  • Channel: Temporary storage (memory or file).
  • Sink: Writes data to HDFS or HBase.

Use cases:

  • Log collection (web servers, applications).
  • Clickstream ingestion.
  • IoT and sensor data pipelines.
  • Streaming data transfer to HDFS.

Advantages:

  • Highly scalable and fault tolerant.
  • Supports real-time and batch ingestion.
  • Configurable with simple property files.

Flume helps bring high-speed, unstructured logs into Hadoop reliably.

29. What is the difference between SQL and HiveQL?

Though HiveQL is similar to SQL, they differ in purpose, execution model, and capabilities.

1. Execution Engine

  • SQL runs on relational databases.
  • HiveQL is executed on Hadoop
    (MapReduce, Tez, or Spark).

2. Latency

  • SQL is optimized for low-latency queries.
  • HiveQL is designed for batch processing; queries are slower.

3. Schema Handling

  • SQL uses schema-on-write (strict schemas).
  • HiveQL uses schema-on-read (flexible schemas).

4. Transaction Support

  • SQL: Full ACID transactional support.
  • Hive: Limited ACID support, mostly for batch operations.

5. Data Type Flexibility

  • SQL: Works with structured data.
  • HiveQL: Works with structured, semi-structured, and large unstructured datasets.

6. Use Cases

  • SQL: OLTP, interactive queries.
  • HiveQL: OLAP, big data analytics.

Conclusion:
SQL is for real-time operations; HiveQL is for large-scale batch analytics.

30. What is Hadoop Streaming?

Hadoop Streaming is a utility that allows users to write MapReduce programs in any programming language, not just Java.

It works by passing data through stdin and stdout, making it language-agnostic.

Supported languages include:

  • Python
  • Perl
  • Ruby
  • Bash
  • Go
  • C++

How it works:

  • User writes mapper and reducer scripts in any language.
  • Hadoop Streaming wraps them and executes them as MapReduce tasks.

Example command:

hadoop jar hadoop-streaming.jar \
-input input.txt \
-output output \
-mapper mapper.py \
-reducer reducer.py

Advantages:

  • Quick prototyping
  • Easier than writing Java MapReduce
  • Supports existing scripts and tools

Hadoop Streaming makes Hadoop accessible to developers who prefer scripting languages.

31. What is SerDe in Hive?

SerDe stands for Serializer/Deserializer. It is a core component in Hive that enables Hive to read data from HDFS into table rows and write table rows back into HDFS.

Role of SerDe:

  1. Serialization
    Converts Hive table rows into a format suitable for storage (e.g., text, binary, JSON).
  2. Deserialization
    Converts raw file data (stored in HDFS) into Hive table rows when executing queries.

Why SerDe is important:

  • Hive does not understand raw data directly.
  • SerDe defines how Hive should interpret and format the data.
  • Allows users to work with various formats such as:
    • CSV
    • JSON
    • Avro
    • ORC
    • Parquet

Custom SerDe:

Hive allows developers to write custom SerDes to support proprietary or complex data formats used in real-world applications.

Example:

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde';

In summary, SerDe is what makes Hive flexible enough to read and write diverse data formats in Hadoop.

32. Explain NameNode metadata.

NameNode metadata represents all the information about the HDFS filesystem, excluding the actual data blocks.

Metadata stored by NameNode includes:

  1. File and directory structure
    All file names, folder hierarchy, permissions, and ownership.
  2. Block mapping
    Which blocks belong to which file and on which DataNode they are stored.
  3. Replication information
    Replication factor of each file and location of each replica.
  4. Namespace information
    Used for managing directories and file operations.

Metadata Storage Mechanisms:

NameNode stores metadata in two main files:

  • FSImage
    A snapshot of the entire filesystem at a point in time.
  • EditLogs
    A log of all filesystem changes since the last snapshot.

During startup, NameNode merges both files to reconstruct the complete metadata state. This merging prevents EditLogs from growing indefinitely.

Since metadata is stored in memory for fast access, NameNode requires sufficient RAM.

33. What is a Job in Hadoop?

In Hadoop, a Job represents a complete MapReduce task submitted by a user. A job contains:

  • The logic of Mapper and Reducer
  • Configuration settings
  • Input and output paths
  • Optional combiner and partitioner
  • Job parameters (memory, replication, etc.)

Once submitted, a job is broken into smaller units called tasks:

  • Map tasks
  • Reduce tasks

YARN executes these tasks in parallel across the cluster.

Job Lifecycle:

  1. User submits the job
  2. Job is accepted by ResourceManager
  3. Containers are allocated
  4. Tasks execute
  5. Output is written to HDFS
  6. Job completes with success or failure

Jobs are the basic unit of computation in Hadoop’s batch-processing model.

34. What is shuffle and sort in MapReduce?

Shuffle and sort is a critical phase between Map and Reduce tasks in MapReduce.

It ensures that all values belonging to the same key are grouped and sent to the same reducer.

1. Shuffle

Shuffle refers to the movement of data from map outputs to reduce inputs.

Steps:

  • Map tasks write output to local disk
  • Output is partitioned by key (using Partitioner)
  • Reduce tasks fetch map outputs over the network

Shuffle is network-heavy and affects performance significantly.

2. Sort

The reducer sorts the intermediate key-value pairs:

  • Keys are sorted
  • Values for each key are grouped together
  • Data is prepared for reduce() function

Importance:

  • Ensures deterministic ordering
  • Reduces computational work for Reducer
  • Guarantees all data for a key is processed together

Shuffle and sort is often considered the most complex and expensive phase in MapReduce.

35. Explain speculative execution.

Speculative execution is a performance optimization technique in Hadoop to handle slow or “straggler” tasks.

A straggler is a task that runs abnormally slow due to:

  • Hardware issues
  • Network bottlenecks
  • Disk slowdowns
  • Node overload

How it works:

  • Hadoop runs duplicate copies of slow tasks on different nodes.
  • Whichever finishes first is accepted.
  • The slower one is killed.

Benefits:

  • Improves overall job completion time
  • Reduces risk of delays caused by a single slow machine
  • Increases cluster reliability

Speculative execution is especially useful in large clusters where failures or slowness are more common.

36. What is a combiner, and when do we use it?

A combiner is a mini-reducer that runs on the output of the Mapper before sending data across the network.

Its purpose is to reduce the amount of data transferred during the shuffle phase.

How it works:

  • The combiner receives mapper output
  • Performs partial aggregation
  • Sends smaller, aggregated data to reducers

Example use case: Word Count
Map output might be:

apple 1  
apple 1  
apple 1

Combiner output:

apple 3

When to use a combiner:

  • When the reduce function is commutative and associative
    Example: sum, count, max, min

When NOT to use:

  • When order matters
  • When partial aggregation would change the logic

A combiner can significantly reduce network traffic and speed up processing.

37. Explain rack awareness.

Rack awareness is Hadoop’s strategy for placing data replicas across different racks in the cluster.

Goals:

  1. Fault tolerance:
    If one rack fails, replicas on other racks remain available.
  2. Efficient bandwidth usage:
    Local rack communication is faster than remote.

Replica Placement Policy:

Default Hadoop policy:

  • First replica: same rack as writing client
  • Second replica: different rack
  • Third replica: same rack as second but different node

Benefits:

  • Increases data availability
  • Reduces inter-rack traffic
  • Enhances performance by utilizing data locality

Rack awareness makes Hadoop resilient to rack-level failures.

38. What is schema-on-read?

Schema-on-read means that data is stored without a predefined schema, and the structure is applied only when the data is read.

Used in systems like:

  • Hadoop
  • Hive
  • Spark
  • NoSQL databases

Advantages:

  • Highly flexible
  • Accepts raw data of any format
  • Ideal for unstructured and semi-structured data
  • Enables rapid ingestion
  • Supports evolving and dynamic schemas

Schema-on-read is perfect for data lakes where analytical requirements change frequently.

39. What is schema-on-write?

Schema-on-write means that data must conform to a predefined schema before being stored.

Used in systems like:

  • Traditional RDBMS (MySQL, Oracle, PostgreSQL)
  • Data warehouses

Advantages:

  • Ensures data quality and consistency
  • Enables optimized query performance
  • Enforces strict validation rules

Disadvantages:

  • Requires significant design effort upfront
  • Less flexible
  • Slower ingestion of diverse data

Schema-on-write is ideal for transactional systems where data integrity is crucial.

40. What are the main Hadoop configuration files?

Hadoop contains several XML-based configuration files that control its behavior.

1. core-site.xml

Defines core Hadoop properties:

  • Default filesystem (HDFS)
  • I/O settings
  • Hadoop temp directory

2. hdfs-site.xml

Contains HDFS-specific configurations:

  • Block size
  • Replication factor
  • NameNode & DataNode settings
  • Permissions

3. mapred-site.xml

Configures MapReduce components:

  • Job history
  • Speculative execution
  • Shuffle settings

4. yarn-site.xml

Contains YARN resource management properties:

  • ResourceManager settings
  • NodeManager memory
  • Scheduler configuration

Optional Important Files:

  • hadoop-env.sh – environment variables (Java home, heap sizes)
  • slaves/ workers – list of worker nodes
  • masters – NameNode or ResourceManager nodes

Together, these files control system-level behavior, cluster performance, and application configuration.

Intermediate (Q&A)

1. Explain the complete HDFS write path.

The HDFS write path describes how data flows from a client to the Hadoop cluster when writing a file. It ensures fault tolerance, data integrity, and parallelism through a multi-step pipeline mechanism.

Step-by-Step HDFS Write Workflow:

  1. Client contacts NameNode
    • Client requests permission to write a file.
    • NameNode checks:
      • Whether the file already exists
      • Directory permissions
    • If valid, NameNode responds with metadata and block details.
  2. NameNode allocates DataNodes
    • For each block, NameNode selects a pipeline of DataNodes based on:
      • Rack awareness
      • Available storage
      • Load balancing
    • Example (replication factor = 3):
      DN1 → DN2 → DN3
  3. Client splits data into packets
    • Usually 64 KB packets.
    • These are pushed sequentially into the pipeline.
  4. Data flows through the pipeline
    • Client sends packet to DN1.
    • DN1 writes the packet to disk, then forwards to DN2.
    • DN2 forwards to DN3.
    • This ensures pipelined, parallel replication.
  5. Acknowledgement chain
    • DN3 → DN2 → DN1 → Client
    • All replicas must acknowledge to complete the write.
  6. Error handling (automatic re-replication)
    • If a DataNode in pipeline fails:
      • NameNode selects a new DataNode
      • Pipeline is rebuilt dynamically
      • Client retries sending unacknowledged packets
  7. Write completion
    • After all blocks are written and acknowledged, client informs NameNode to finalize the file.

Key Guarantees:

  • Data durability through replication
  • Consistency through acknowledgment chain
  • Fault tolerance through automatic pipeline recovery

2. Explain the complete HDFS read path.

The HDFS read path shows how a client retrieves data from the cluster with high throughput and data locality.

Step-by-Step HDFS Read Workflow:

  1. Client contacts NameNode
    • Sends a request to read a file.
    • NameNode returns:
      • Block metadata
      • List of DataNodes for each block (nearest one first)
  2. Client selects closest DataNode
    • Based on:
      • Rack locality
      • Node locality
      • Network distance
  3. Client reads block directly from DataNode
    • Data is read in chunks (packets)
    • DataNode streams bytes to the client
  4. Checksum verification
    • DataNode sends checksums
    • Client verifies integrity
    • If checksum fails:
      • Client requests the block from another DataNode
  5. Sequential block reading
    • Client reads blocks in order until the entire file is downloaded.
  6. No NameNode involvement in actual data transfer
    • NameNode only provides metadata
    • Data transfer occurs DataNode ↔ Client directly

Key Features:

  • High throughput
  • Low network cost due to data locality
  • Automatic recovery from corrupt block replicas

3. How does NameNode handle failover (HA architecture)?

Hadoop NameNode HA solves the biggest issue in early Hadoop:
Single Point of Failure (SPOF).

HA architecture includes:

Two NameNodes:

  • Active NameNode – handles all client operations
  • Standby NameNode – synchronized backup ready to take over

How failover works:

  1. Shared edit logs via JournalNodes
    • Both Active and Standby write/read EditLogs through a set of JournalNodes (typically 3 or 5).
    • Standby constantly reads new edits to keep metadata in sync.
  2. Heartbeat and health monitoring
    • ZooKeeper monitors Active NameNode.
    • If Active fails, ZooKeeper triggers failover.
  3. Failover process
    • Standby NameNode:
      • Loads latest metadata
      • Promotes itself to Active
      • Starts serving client requests
  4. Automatic client failover
    • Clients use a logical nameservice URI like:
hdfs://mycluster/
    • Hadoop routes clients to the new Active NameNode automatically.

This ensures continuous HDFS availability even during NameNode crashes.

4. What is JournalNode?

A JournalNode is a component in Hadoop’s HA architecture responsible for storing edit logs from the NameNode.

Purpose:

  • Acts as a distributed edit log storage
  • Synchronizes metadata between Active and Standby NameNodes

How it works:

  1. Active NameNode writes edits
    • Every filesystem change is written to a majority of JournalNodes.
  2. Standby NameNode reads edits
    • Continuously pulls new edits
    • Updates its metadata to match Active node
  3. Minimum number of JournalNodes
    • Typically 3 or 5
    • Follows quorum-based replication:
      • Majority must acknowledge write

Why JournalNodes are essential:

  • Enable fast failover
  • Remove dependency on NFS (used in older versions)
  • Ensure metadata consistency

JournalNodes are the backbone of NameNode HA synchronization.

5. Explain Hadoop Federation.

Hadoop Federation allows multiple NameNodes to operate simultaneously within a single cluster, each managing a portion of the namespace.

Motivation:

  • NameNode memory limits restrict cluster size
  • Single NameNode creates performance bottlenecks
  • Organizations want to isolate data (e.g., analytics vs. operations)

How Federation works:

  1. Multiple independent NameNodes
    • Each NameNode manages a separate namespace (set of directories).
    • But all use the same DataNode pool.
  2. DataNodes register with all NameNodes
    • Store blocks for all namespaces
    • Maintain block reports per namespace
  3. Clients interact with multiple namespaces
    Example:
hdfs://nn1/user/data
hdfs://nn2/logs

Benefits:

  • Horizontal scalability
  • Performance isolation
  • No interdependence between namespaces
  • Allows massive clusters (thousands of nodes)

Federation improves scalability but does NOT provide automatic failover.

6. What is the difference between NameNode HA vs Federation?

FeatureNameNode HAHDFS FederationGoalRemoves SPOFIncreases scalabilityNumber of NameNodes1 Active + 1 StandbyMultiple independent NameNodesFailoverYes (automatic)NoNamespaceSingleMultipleDataNodesRegister to two NameNodesRegister to all NameNodesWhen usedHigh availability neededVery large clusters

Simple Explanation:

  • HA = Reliability
    Ensures HDFS never goes down.
  • Federation = Scalability
    Allows extremely large clusters with multiple NameNodes.

Together, both can be combined in enterprise deployments:
Multiple federated NameNodes, each with HA enabled.

7. What is the role of ApplicationMaster in YARN?

The ApplicationMaster (AM) manages the lifecycle of a single application in YARN.

Responsibilities:

  1. Resource Negotiation
    • Requests containers from ResourceManager.
  2. Task Scheduling
    • Decides how many map/reduce tasks to launch.
    • Assigns tasks to available nodes.
  3. Task Monitoring
    • Tracks container health and progress.
    • Reschedules failed tasks.
  4. Data Locality Optimization
    • Tries to launch tasks on nodes containing input data.
  5. Reporting
    • Updates ResourceManager with job status.
  6. Finalization
    • Cleans up resources after application completion.

Key Point:

Every MapReduce job gets its own ApplicationMaster, ensuring job-level independence.

8. Explain ResourceManager Scheduler types (FIFO, Capacity, Fair).

ResourceManager uses schedulers to allocate cluster resources among multiple users.

1. FIFO Scheduler (First-In-First-Out)

  • Jobs are queued based on submission order.
  • First job gets all available resources until completion.
  • Simple but not suitable for multi-tenant clusters.

Best for: Small clusters, simple workloads.

2. Capacity Scheduler

  • Cluster is divided into queues with fixed capacity percentages.
  • Guarantees resources for organizations or teams.
  • Supports hierarchical queues and priority-based scheduling.

Best for: Large enterprises with multi-team requirements.

3. Fair Scheduler

  • Resources are distributed fairly among all running jobs.
  • If one job finishes, others take unused resources.
  • Provides preemption and minimum guaranteed resources.

Best for: Shared clusters needing equitable resource distribution.

9. What is a NodeLabel in YARN?

NodeLabels allow administrators to categorize nodes in a YARN cluster, controlling which applications run on which nodes.

Use cases:

  1. Resource Isolation
    • Premium nodes for high-priority jobs
    • Standard nodes for normal workloads
  2. Multi-tenancy
    • Assign nodes to specific teams or departments
  3. Hardware Specialization
    • GPU nodes
    • High-memory nodes

Example:

Nodes labeled:

  • gpu
  • highmem
  • default

Applications can request:

yarn.node-label-expression = highmem

NodeLabels ensure the right jobs run on the right hardware.

10. Explain MapReduce execution workflow.

The MapReduce execution workflow describes how a job progresses from submission to completion.

Step 1: Job Submission

  • Client submits job to ResourceManager.
  • Job configuration and jar files are uploaded to HDFS.

Step 2: Launch ApplicationMaster

  • ResourceManager allocates a container.
  • NodeManager launches the ApplicationMaster (AM).

Step 3: Input Splitting

  • AM splits input files into logical InputSplits.
  • Number of map tasks = number of InputSplits.

Step 4: Resource Negotiation

  • AM requests containers for map tasks.
  • ResourceManager allocates containers based on scheduling.

Step 5: Map Phase Execution

  • NodeManagers launch mapper containers.
  • Mappers process InputSplits and write output to local disk.
  • Output is partitioned for reducers.

Step 6: Shuffle & Sort

  • Reducers fetch map outputs.
  • Data is merged, sorted, and grouped by key.

Step 7: Reduce Phase Execution

  • Reduce tasks process grouped key-value pairs.
  • Final results are written to HDFS.

Step 8: Completion

  • AM notifies ResourceManager of job success.
  • Containers are freed.
  • Client receives job status.

Summary:

The MapReduce workflow is a distributed pipeline consisting of job submission → resource allocation → map tasks → shuffle/sort → reduce tasks → output to HDFS.

11. What is InputFormat in MapReduce?

InputFormat is a crucial component in the MapReduce framework that defines how input data is logically split and fed to the Mapper.

Key responsibilities of InputFormat:

  1. Generate InputSplits
    • Splits input data into logical chunks.
    • Determines how many mapper tasks will run.
  2. Provide RecordReader
    • Converts raw data inside InputSplit into key-value pairs for Mapper.
  3. Define data location
    • Ensures InputSplits are mapped to the closest DataNodes for data locality.

Common InputFormats:

  1. TextInputFormat (default)
    • Treats each line as a record.
    • Key = byte offset, Value = line text.
  2. KeyValueTextInputFormat
    • Splits each line into key and value based on a delimiter.
  3. SequenceFileInputFormat
    • For binary key-value storage.
  4. NLineInputFormat
    • Defines exactly N lines per map task.
  5. WholeFileInputFormat
    • Treats entire file as a single record
    • Useful for processing images, PDFs, XML.

Why InputFormat is important:

  • Determines parallelism
  • Impacts data locality
  • Influences MapReduce performance
  • Enables processing of custom file formats

InputFormat is essentially the gateway between HDFS files and MapReduce logic.

12. Explain OutputFormat in MapReduce.

OutputFormat defines how Mapper and Reducer outputs are formatted and written to storage (usually HDFS).

Responsibilities of OutputFormat:

  1. Define output location
    • Determines where final results are stored.
  2. Validate output paths
    • Ensures output directory does not already exist (avoids overwrites).
  3. Serialize key-value outputs
    • Converts reducer output into a storage-friendly format.
  4. Create RecordWriter
    • Writes output key-value pairs to files.

Common OutputFormats:

  1. TextOutputFormat (default)
    • Writes key and value as text lines.
  2. SequenceFileOutputFormat
    • Stores binary key-value pairs.
    • Efficient for chaining MR jobs.
  3. MultipleOutputs
    • Allows writing to multiple files dynamically.
  4. NullOutputFormat
    • Produces no output; used in testing.

Why OutputFormat matters:

  • Controls how results are stored
  • Enables custom output formats
  • Influences compression and performance

OutputFormat = bridge between MapReduce results and long-term storage.

13. What is RecordReader?

A RecordReader converts the raw data in an InputSplit into key-value pairs usable by the Mapper.

Functions of RecordReader:

  1. Parse InputSplit data
    • Reads line-by-line, row-by-row, or custom logic.
  2. Generate key and value objects
    • Example (TextInputFormat):
      • Key: byte offset
      • Value: text line
  3. Track progress
    • Helps YARN understand how much work is complete.
  4. Ensure correct data boundaries
    • Avoids cutting records across block boundaries.

Why RecordReader is essential:

Because MapReduce does not understand raw files, it only understands key-value pairs.

RecordReader ensures:

  • Clean data parsing
  • Proper record boundaries
  • Efficient data reading

Without it, mappers cannot process data correctly.

14. What is Writable in Hadoop?

Writable is Hadoop’s serialization framework used for sending data across the network between map and reduce tasks.

Serialization means:

Converting objects into byte streams that can be:

  • Sent over the network
  • Written to disk
  • Reconstructed later

Characteristics of Writable:

  • Optimized for Hadoop’s distributed environment
  • Much faster than Java’s native serialization
  • Lightweight, compact binary format

Common Writable Types:

  • IntWritable
  • LongWritable
  • Text
  • BooleanWritable
  • ArrayWritable
  • MapWritable

Why Writable is needed:

MapReduce tasks run on different nodes; data must be transferred efficiently. Java serialization is slow and heavy, making Writable the preferred mechanism.

Writable = fast, compact, Hadoop-native serialization.

15. Explain the difference between MapReduce 1 and MapReduce 2 (YARN).

MapReduce 1 (MRv1) and MapReduce 2 (MRv2/YARN) differ fundamentally in architecture and scalability.

MapReduce 1 (MRv1):

  • Resource management + job scheduling done by JobTracker.
  • Task execution done by TaskTrackers.
  • JobTracker = Single point of failure.
  • Scalability limited to ~4000 nodes.

MapReduce 2 (YARN/MRv2):

  • ResourceManager handles cluster resources.
  • NodeManager supervises tasks on each node.
  • ApplicationMaster manages lifecycle of each job.
  • Highly scalable (10,000+ nodes).
  • Enables non-MapReduce engines (Spark, Flink, Tez).

Key Differences Table:

FeatureMR1MR2/YARNFault ToleranceWeakStrongArchitectureMonolithicModularSingle Point of FailureYesNoScalabilityLimitedVery HighSupported EnginesOnly MapReduceSpark, Hive, Tez, MR, etc.Task ExecutionTaskTrackerNodeManager

YARN made Hadoop a true multi-engine distributed platform, not just a MapReduce system.

16. What is a distributed cache in Hadoop?

The distributed cache is a mechanism to distribute files (like jars, text files, lookup tables) to all nodes executing a MapReduce job.

Use Cases:

  • Distributing machine learning models
  • Distributing lookup dictionaries
  • Distributing libraries (jar files)
  • Side-data required by all mappers/reducers

How it works:

  1. Files are copied to HDFS.
  2. Each worker node downloads the file locally.
  3. Cached files are available in the local filesystem.

Example:

job.addCacheFile(new URI("/config/lookup.txt"));

Benefits:

  • High performance—local read instead of HDFS read
  • Ensures all nodes have identical reference data
  • Avoids distributing data manually

Distributed cache = shared read-only files for MapReduce tasks.

17. How does Hadoop ensure data locality?

Data locality ensures computation happens on the node where the data resides, reducing network traffic and improving performance.

Techniques Hadoop uses:

  1. InputSplit assignment
    • Mappers run on nodes holding the block whenever possible.
  2. Rack-awareness
    • Scheduler prefers assigning tasks within the same rack.
  3. Delay scheduling
    • Scheduler waits briefly for a preferred node to become free before assigning tasks elsewhere.

Levels of locality:

  1. Data-local
    Task runs on the same node as the data. (Fastest)
  2. Rack-local
    Task runs on same rack but different node.
  3. Off-rack
    Worst-case scenario, increases network traffic.

Benefits:

  • Reduced data movement
  • Lower network congestion
  • Higher overall throughput

Data locality is a cornerstone of Hadoop’s performance.

18. Explain the difference between HDFS block and InputSplit.

These two are often misunderstood but serve completely different purposes.

HDFS Block:

  • A physical division of data stored on DataNodes.
  • Default size: 128 MB.
  • Used for storage, replication, and fault tolerance.
  • Blocks are chunks on disks.

InputSplit:

  • A logical chunk of data used by MapReduce.
  • Defines the scope of one Mapper task.
  • Does NOT store data.
  • Can span multiple blocks.

Key Differences:

FeatureHDFS BlockInputSplitPurposeStorageProcessingTypePhysicalLogicalSizeFixed (128MB/256MB)FlexibleReplicationYesNoCreated byHDFSInputFormat

Simple analogy:

  • HDFS Block = how data is stored
  • InputSplit = how data is processed

19. How does speculative execution work internally?

Speculative execution handles slow-running tasks (stragglers) by launching duplicate copies to finish the job faster.

When does it trigger?

  • A task runs significantly slower than the average.
  • Cluster has idle resources.
  • Task progress is below a threshold.

Internal mechanism:

  1. Progress monitoring
    • Hadoop compares progress of tasks of the same stage.
  2. Identify slow tasks
    • If a task lags abnormally, Hadoop marks it as a straggler.
  3. Launch duplicate task
    • Another node runs the same task.
  4. Task completion
    • First successful output is accepted.
    • Other task is killed.

Benefits:

  • Avoids long waits caused by slow machines
  • Improves job reliability and speed

Drawbacks:

  • Can waste resources if too many speculative tasks launch
  • Not suitable for non-idempotent tasks

Speculative execution is especially helpful in heterogeneous clusters.

20. Explain the small file problem in Hadoop.

Hadoop is designed for large files, not small ones. When many small files exist, it creates several performance issues.

Why small files are a problem:

1. NameNode memory overload

  • NameNode stores metadata for each file.
  • Millions of small files overwhelm NameNode RAM.

2. Inefficient block usage

  • Even a 10 KB file occupies a full 128 MB block metadata entry.

3. Poor MapReduce performance

  • Each small file creates one InputSplit → one mapper.
  • Too many mappers cause scheduling overhead.

4. Increased job latency

  • Cluster spends more time setting up tasks than processing data.

Solutions to the small file problem:

  1. HAR Files (Hadoop Archives)
    • Combines small files into larger archive files.
  2. SequenceFile
    • Stores small files as key-value entries.
  3. CombineInputFormat
    • Processes multiple files with a single mapper.
  4. Use HBase
    • Designed for storing millions of small records.
  5. Use object stores like S3 / GCS
    • Metadata does not overload a single node.

Summary:

Small files stress NameNode memory and degrade MapReduce performance. The best practice is to combine, archive, or redesign data ingestion to avoid millions of small files.

21. How to solve the small file problem?

The small file problem occurs when Hadoop stores millions of tiny files, causing NameNode memory overload and poor MapReduce performance. Hadoop is optimized for large files (GB–TB), not thousands of KB-sized files.

Solutions to mitigate small file issues:

1. Hadoop Archive (HAR Files)

HAR merges many small files into a single large archive while maintaining the directory structure.

  • Reduces NameNode metadata load
  • Ideal for long-term storage (read-heavy workloads)

2. SequenceFile Format

Stores small files as key-value pairs.

  • Key = filename
  • Value = file content
  • Great for MR processing of many small files

3. CombineFileInputFormat / CombineTextInputFormat

Creates one mapper for multiple small files instead of one mapper per file.

  • Boosts MapReduce performance
  • Often used in jobs that read logs or small datasets

4. Use HBase

HBase is optimized for storing millions of small records.

  • Avoids filesystem overhead
  • Provides fast random read/write

5. Use Object Storage (S3, GCS, ADLS)

Object stores don’t maintain metadata in memory like HDFS.

  • Suitable for large quantities of small files
  • Reduces NameNode pressure

6. Data Ingestion Strategies

Fix small file creation at the source:

  • Batch data before uploading
  • Merge logs hourly/daily
  • Compress and combine files programmatically

7. Use Spark Structured Streaming or Kafka

Stream small event records into large parquet/ORC data files.

Summary:

The small file problem is solved by combining files, using optimized formats (SequenceFile/ORC/Parquet), using HBase, or shifting to object storage.

22. What is sequence file format?

A SequenceFile is a binary key-value file format used in Hadoop to efficiently store large numbers of small files or large datasets requiring fast serialization.

Characteristics:

  • Stores data as (key, value) pairs
  • Binary format → more compact than plain text
  • Splittable → suitable for MapReduce
  • Supports compression: record-level, block-level

Why SequenceFile is used:

  1. Efficient storage of small files
    • Stores entire file content as value
    • File path as key
  2. Fast serialization
    • Uses Writable types
    • Much faster than text parsing
  3. Ideal for chaining MapReduce jobs
    • Eliminates overhead of multiple small text files

Use cases:

  • Storage of millions of small images, logs, XML files
  • Intermediate output for MR pipelines
  • Machine learning training data

SequenceFile is a foundational format for performance-optimized Hadoop applications.

23. What is Avro, and why is it used?

Apache Avro is a row-oriented, binary serialization format used for efficient data serialization and communication in Hadoop ecosystems.

Key characteristics:

  • Compact binary format → smaller storage footprint
  • Schema stored separately → highly flexible
  • Supports schema evolution → fields can be added/removed
  • Interoperable across languages: Java, Python, C, Ruby

Why Avro is used:

  1. Schema evolution
    • Allows forward and backward compatibility
    • Perfect for long-running data pipelines
  2. Row-based serialization
    • Good for write-heavy workloads
    • Ideal for Kafka message formats
  3. Fast RPC communication
    • Built-in Avro RPC framework
  4. Great for Big Data ingestion
    • Efficient for streaming data
    • Used by Kafka, Flume, NiFi, Spark

Typical use cases:

  • Kafka event data
  • ETL pipelines
  • Metadata serialization
  • Log ingestion

Avro is the backbone of schema-based big data streaming.

24. What is Parquet, and why is it used?

Parquet is a columnar storage file format optimized for analytical workloads in Hadoop and modern big data engines.

Key characteristics:

  • Columnar format → reads only required columns
  • Highly compressed using encodings like RLE, dictionary encoding
  • Splittable → works well with distributed processing
  • Schema stored in metadata
  • Works with Spark, Hive, Impala, Presto, AWS Athena

Why Parquet is used:

  1. High compression ratio
    • Column-wise compression reduces storage cost drastically
  2. Faster analytical queries
    • Only query needed columns, reducing I/O
  3. Efficient for Spark & Hive
    • Accelerates SQL workloads
    • Reduces shuffle
  4. Efficient scanning and vectorization

Use cases:

  • Data lakes (S3, ADLS, GCS)
  • BI analytics
  • Machine learning feature stores
  • ETL transformations

Parquet is one of the most efficient formats for modern analytical pipelines.

25. Compare Avro vs Parquet.

FeatureAvroParquetStorage TypeRow-basedColumn-basedBest ForWrite-heavy workflows, streamingRead-heavy analyticsSchema EvolutionExcellentGood (but more complex)CompressionGoodExcellentUse CasesKafka, logs, ETLSQL analytics, ML, BIPerformanceGood for sequential readsBest for selective readsSpark/Hive CompatibilityYesYes (preferred)

Summary:

  • Use Avro for data ingestion, logs, event streams, row-based storage.
  • Use Parquet for analytics, reporting, machine learning, and SQL queries.

Together, many pipelines store raw data as Avro and processed data as Parquet.

26. What are Hadoop compression codecs?

Compression codecs define how Hadoop compresses and decompresses files. Compression helps reduce:

  • Storage costs
  • Network transfer time
  • I/O overhead

Hadoop supports both:

  • Splittable codecs (work with MapReduce input splits)
  • Non-splittable codecs (do not allow parallel reads)

Common Hadoop codecs:

CodecSplittableRemarksGZip❌ NoHigh compression, slow, not splittableBZip2✔ YesSlow but splittableSnappy❌ NoVery fast, moderate compressionLZO✔ Yes (with index)Fast and splittableLZ4❌ NoVery fast, low compressionDeflate❌ NoGZip-based

Compression Types in Hadoop:

  1. Record-level compression
    Compresses each record individually.
  2. Block-level compression
    Compresses groups of records.
    → Best performance for Parquet/ORC.

Compression improves performance when used properly in a Hadoop cluster.

27. Difference between Snappy, GZip, and LZO.

FeatureSnappyGZipLZOCompression SpeedVery fastSlowestFastCompression RatioLowHighestMediumSplittableNoNoYes (with index)Use CasesReal-time, Hadoop, SparkArchival storageMapReduce performanceCPU UsageLowHighLow

Which one to use?

  • Snappy → Fastest runtime (Spark, Hive)
  • GZip → Best compression (storage savings)
  • LZO → Best balance + splittable (MapReduce jobs)

28. What is Hadoop balancer?

The Hadoop Balancer is a cluster-wide data rebalancing tool used to evenly distribute HDFS data across DataNodes.

Why rebalancing is needed?

Uneven distribution happens when:

  • New nodes are added
  • Nodes recover from downtime
  • Large files are deleted/added

What the balancer does:

  • Moves data blocks from heavily loaded DataNodes to under-loaded ones
  • Ensures uniform disk usage
  • Improves performance and fault tolerance

How it works:

hdfs balancer -threshold 10
  • Threshold defines acceptable imbalance level.
  • Balancer runs slowly to avoid cluster impact.

29. What is distcp, and how does it work?

distcp (Distributed Copy) is a Hadoop tool used for large-scale, parallel copying of directories/files across HDFS clusters or between HDFS and cloud storage.

How distcp works:

  1. Uses MapReduce to parallelize file copying
  2. Each mapper copies a chunk of files
  3. Ensures high-speed, fault-tolerant transfer
  4. Supports resuming interrupted transfers

Use cases:

  • Migrating data between Hadoop clusters
  • Copying petabytes of data across data centers
  • Replicating data to the cloud (S3, ADLS)
  • Disaster recovery setups

Example:

hadoop distcp hdfs://cluster1/data hdfs://cluster2/data

distcp is essential for large-scale data movement at enterprise scale.

30. Explain Hadoop security architecture.

Hadoop security architecture provides mechanisms to secure data access, authentication, and communication across the cluster.

Core Components of Hadoop Security:

1. Kerberos Authentication (Primary)

Hadoop uses Kerberos to ensure only authenticated users access the cluster.

  • Prevents impersonation attacks
  • Ensures secure identity verification

2. HDFS Permissions & ACLs

HDFS supports:

  • POSIX-like permissions
  • Access Control Lists

Allows fine-grained access control on files and directories.

3. Service-Level Authorization

Each Hadoop daemon (NameNode, ResourceManager, etc.) validates access requests.

4. Data Encryption

Hadoop supports:

  • Encryption at rest (HDFS Transparent Encryption Zones)
  • Encryption in transit (TLS/SSL)

Ensures data confidentiality.

5. Hadoop Auditing

Tracks all access events:

  • File read/write events
  • Resource access attempts
  • Authentication logs

Tools like Ranger and Sentry enhance policy management and auditing.

6. Firewall & Network-level Security

  • Isolate Hadoop nodes using VLANs
  • Secure NameNode/UI endpoints

7. Proxy Users & Delegation Tokens

Enable secure submission of jobs on behalf of other users.

Overall Flow:

  1. User authenticates via Kerberos
  2. Authorization checked using ACL/permissions
  3. Tokens issued for MR/YARN operations
  4. Data encrypted and transferred
  5. Audit logs stored

Summary:

Hadoop security combines Kerberos + permissions + encryption + auditing to provide enterprise-grade data protection.

31. What is Kerberos authentication in Hadoop?

Kerberos authentication is the primary security mechanism used in Hadoop to verify user and service identities in a secure and distributed manner. It prevents unauthorized access in Hadoop clusters, which often run across many machines.

Why Hadoop Needs Kerberos:

Hadoop is a distributed environment with multiple services (NameNode, DataNode, ResourceManager, NodeManager, JobHistoryServer). Without strong authentication, attackers could:

  • Impersonate legitimate users
  • Gain unauthorized access to data
  • Submit malicious jobs
  • Disrupt cluster operations

Kerberos ensures mutual authentication—both the client and server verify each other's identity.

How Kerberos Works in Hadoop:

  1. User requests a Ticket-Granting Ticket (TGT)
    • Sends username & password to Key Distribution Center (KDC)
    • KDC validates credentials and issues TGT.
  2. User requests service ticket
    • Uses TGT to request a ticket for a Hadoop service (e.g., NameNode).
  3. User communicates with service
    • Presents service ticket to NameNode / ResourceManager
    • Service verifies the ticket before granting access.
  4. Delegation Tokens
    • Once authenticated, MapReduce/YARN jobs use delegation tokens for subsequent communication.

Benefits of Kerberos in Hadoop:

  • Protects from impersonation attacks
  • Secures all Hadoop services
  • Works seamlessly across distributed nodes
  • Mandatory for production Hadoop clusters

Kerberos is the foundation of Hadoop’s authentication and identity security.

32. What is HDFS snapshot?

An HDFS snapshot is a read-only, point-in-time copy of a directory or entire filesystem. It enables data protection and quick recovery from accidental deletions or corruption.

Key Features:

  • Extremely space-efficient—uses copy-on-write.
  • Can store thousands of snapshots with minimal overhead.
  • Supports rollback to previous states.

How HDFS Snapshot Works:

  1. User enables snapshot on a directory:
hdfs dfsadmin -allowSnapshot /data

Creating a snapshot:

hdfs dfs -createSnapshot /data snapshot1
  1. When a file is modified:
    • Original blocks are retained for snapshot
    • New version blocks are created
  2. Snapshots do not duplicate data unless files change (copy-on-write).

Use Cases:

  • Protecting data from accidental deletes
  • Recovering corrupted datasets
  • Versioning large datasets
  • Backup strategies for ETL pipelines

Snapshots are an essential part of Hadoop’s data protection capabilities.

33. What is HDFS fsck?

fsck (File System Check) is a diagnostic tool for checking the health and integrity of files stored in HDFS.

What HDFS fsck does:

  • Detects corrupted blocks
  • Reports under-replicated or missing blocks
  • Displays block locations
  • Identifies files with replication issues
  • Helps administrators monitor cluster health

Usage Example:

hdfs fsck /user/data -files -blocks -locations

What fsck does NOT do:

  • It does NOT modify or fix corrupted files
  • It does NOT repair the filesystem like Unix fsck
  • It only reports issues (NameNode handles repairs automatically)

Why fsck is important:

fsck helps admins ensure that:

  • All files are safe
  • All blocks are properly replicated
  • No blocks are missing or corrupted

It is a vital operational tool for HDFS maintenance.

34. Explain block report and heartbeat.

DataNodes continually communicate with NameNode to maintain cluster health through:

1. Block Report

A block report is a complete list of all HDFS blocks stored on a DataNode.

  • Sent to NameNode when DataNode starts
  • Updated periodically (default every hour)
  • Helps NameNode maintain metadata accuracy

Contains:

  • Block ID
  • Generation stamp
  • Block length
  • Replica state

Purpose:
Ensure NameNode metadata reflects the actual storage layout.

2. Heartbeat

Heartbeat is a periodic signal (every 3 seconds) sent by each DataNode to NameNode.

Why it's important:

  • Confirms that DataNode is alive
  • Reports capacity, used space, and load
  • Helps NameNode schedule tasks on active nodes

If NameNode does not receive heartbeat:

  • After 10 minutes → DataNode is marked dead
  • Replication of lost blocks starts on other nodes

Summary:

  • Heartbeat → Node health
  • Block report → Block inventory

Both are essential for reliability and fault tolerance of HDFS.

35. What is checkpointing in Hadoop?

Checkpointing is the process of merging fsimage and edit logs to create a new filesystem snapshot (fsimage) in the NameNode.

Why checkpointing is important:

  1. Prevents EditLogs from growing too large
    • EditLogs store every HDFS operation
    • Can become huge, slowing NameNode startup
  2. Improves NameNode restart time
    • Smaller logs → faster recovery
  3. Reduces risk of metadata corruption

How checkpointing works:

  • The Checkpoint Node downloads fsimage and edits
  • Merges edits into fsimage
  • Uploads new fsimage to NameNode
  • NameNode replaces old metadata files

This keeps metadata compact and manageable.

36. Explain the role of Checkpoint Node vs Backup Node.

Hadoop introduced two special nodes for metadata management.

1. Checkpoint Node

  • Periodically merges fsimage + edit logs
  • Uploads new fsimage to NameNode
  • Does not keep in-memory namespace
  • Used for reducing NameNode recovery time

Primary purpose: metadata maintenance.

2. Backup Node

  • Same as Checkpoint Node plus:
    • Keeps an in-memory copy of the filesystem namespace
    • Receives edit logs in real-time
    • Acts as a hot standby for NameNode restart (NOT failover)

Primary purpose: faster NameNode recovery after restart.

Key Differences:

FeatureCheckpoint NodeBackup NodeKeeps namespace in memoryNoYesReal-time edit log syncNoYesActs as hot standbyNoYesPurposeMetadata checkpointFaster NN restart

Both are replaced by Standby NameNode in modern HA setups.

37. What is NodeManager local resource management?

The NodeManager is responsible for managing local node resources for YARN containers.

Resources managed:

  1. Memory
  2. CPU cores
  3. Disk spills and logs
  4. Network bandwidth
  5. Local container directories

NodeManager tasks:

  1. Launch containers sent by ApplicationMaster
  2. Monitor container resource usage
  3. Kill containers exceeding resource limits
  4. Report node health to ResourceManager
  5. Manage logs for each container
  6. Handle local caching for files/jars

NodeManager ensures fair resource sharing and prevents rogue tasks from destabilizing the cluster.

38. Explain YARN container lifecycle.

A container is a bundle of resources (CPU + memory) allocated by YARN to execute a task.

Container Lifecycle Steps:

1. Resource request

  • ApplicationMaster requests containers from ResourceManager.

2. Allocation

  • ResourceManager grants containers based on scheduling policies.

3. Launch

  • NodeManager launches a container with:
    • Environment variables
    • Localized resources (jars, configs)
    • Execution commands

4. Execution

  • Actual task (map, reduce, Spark executor) runs inside the container.

5. Monitoring

  • NodeManager tracks resource usage:
    • Memory
    • CPU
    • Health
  • AM monitors task progress.

6. Completion

  • Task finishes, writes results
  • NodeManager releases container resources

7. Cleanup

  • Temporary files removed
  • Logs preserved for history server

The container lifecycle enables resource isolation and efficient job execution.

WeCP Team
Team @WeCP
WeCP is a leading talent assessment platform that helps companies streamline their recruitment and L&D process by evaluating candidates' skills through tailored assessments