Here are the list of questions and answers that can help you prepare for your Big Data Hadoop job interview. Remember to check on this page regularly as it gets updated continuously with more questions and answers.
Q. Explain how Hadoop is different from other parallel computing solutions.
Answer:
Hadoop vs Parallel Computing
Hadoop | Parallel Computing Systems |
Has a Master -Slave architecture. | Massively Parallel architecture |
Fault-Tolerant, Shared Memory | Independent Memory and Processor Space. |
Centralized Job Distribution. | Random Job Distribution |
Coordinate resource management | Self managed resources and worker |
Process Structured and Semi-Structured Data. | Process Unstructured Data. |
In parallel computing, different activities happen at the same time, i.e. a single application is spread across multiple processes so that it gets done faster. The differences between Hadoop and other parallel computing solutions might not be evidently clear. Smartphone is the best example of parallel computing as it has multiple cores and specialized computing chips. Hadoop is an implementation of the “Map and Reduce” abstract idea where specific calculations consist of a parallel “map” followed by gathering data. Hadoop is usually applied to voluminous amount of data and in a distributed context. The distinction between Hadoop and parallel computing solutions is still foggy and has a very thin boundary line.
Q. What are the modes Hadoop can run in?
Answer:
Hadoop can run in three different modes-
Local/Standalone Mode
- This is the single process mode of Hadoop, which is the default mode, wherein which no daemons are running.
- This mode is useful for testing and debugging.
Pseudo Distributed Mode
- This mode is a simulation of fully distributed mode but on single machine. This means that, all the daemons of Hadoop will run as a separate process.
- This mode is useful for development.
Fully Distributed Mode
- This mode requires two or more systems as cluster.
- Name Node, Data Node and all the processes run on different machines in the cluster.
- This mode is useful for the production environment.
Q. What will a Hadoop job do if developers try to run it with an output directory that is already present?
Answer:
By default, Hadoop Job will not run if the output directory is already present, and it will give an error. This is for saving the user- if accidentally he/she runs a new job, they will end up deleting all the effort and time spent. Having said that, this does not pose us with the limitation, we can achieve our goal from other workarounds depending upon the requirements.
Q. How can you debug your Hadoop code?
Answer:
Hadoop has a web interface to debug the code or users can make use of counters to debug Hadoop codes. There are some other simple ways to debug Hadoop code –
The simplest way is to use System.out.println () or System.err.println () commands, available in Java. To view all the stdout logs, easiest way is go to job tracker page, click on the completed jobs, then on map/reduce task, the task no. comes handy now, click on task id and then task logs, finally stdout logs.
But if your code produces huge number of logs, in that case there are various other methods to debug the code-
i) To check the details of the failed tasks, one can simply add the variable keep.failed.task.files in config. Once you do that, you can go to the failed tasks directory and run that particular task in isolation, which will run on single jvm.
ii)Other option is to run the same job on small cluster with same input. This will keep all the logs in one place, but we need to make sure that logging level is set to INFO.
Q. Give some examples of companies that are using Hadoop architecture extensively.
Answer:
- PayPal
- JPMorgan Chase
- Walmart
- Yahoo
- AirBnB
Q. Explain about the functioning of Master Slave architecture in Hadoop?
Answer:
Hadoop applies Master Slave architecture at both storage level and computation level.
Master: NameNode
- File system namespace management and access control from client.
- Executes and manages operation of file system namespace like closing, opening, renaming of directories and files.
Slave: DataNode
- A cluster has only one DataNode – per node. File system namespace are exposed to the clients, which allow them to store the files. These files are split into two or more blocks, which are stored in the DataNode. DataNode is responsible for replication, creation and deletion of block, as and when instructed by NameNode.
Q. What is distributed cache and what are its benefits?
Answer:
For the execution of a job many a times the application requires various files, jars and archives. Distributed Cache is a mechanism provided by the MapReduce framework that copies the required files to the slave node, much before the execution of the task starts.All the required files will be copied only once per job.
All the required files will be copied only once per job. The efficiency we gain from Distributed Cache is that the necessary files are copied before the execution of the task before a particular job starts on that node. Other than this, all the cluster machines can use this cache as local file system.However, under rare circumstances it would be better for the tasks to use the standard HDFS I/O to copy the files instead of depending on the Distributed Cache. For instance, if a particular application has very few reduces and requires huge artifacts of size greater than 512MB in the distributed cache, then it is a better to opt for HDFS I/O.
Q. How do you benchmark your Hadoop Cluster with Hadoop tools?
Answer:
There are couple of benchmarking tools which come with Hadoop Distributions like TestDFSIO, nnbench, mrbench, TeraGen or TeraSort.
- For Network bottlenecks and IO related performance issues TestDFSIO can be used to stress test the cluster.
- NNBench can used to do load test on NameNode by creating, deleting, and renaming files on HDFS.
- Once the cluster passes TestDFSIO tests, TeraSort benchmarking tool can be used to test the configuration. Yahoo used TeraSort and created a record of sorting 1PB of data in 16 hours on a cluster of 3800 nodes.
Q. Explain the major difference between an HDFS block and an InputSplit.
Answer:
HDFS block is physical chunk of data in a disk. For e.g. if you have block of 64 MB and a file of size 50 MB, then block 1 will be taken by record 1 but record 2 will not fit completely and ends in block 2. InputSplit is a Java class that points to start and end location in the block.
Q. Does HDFS make block boundaries between records?
Answer:
Yes
Q. What is streaming access?
Answer:
Data is not read as chunks or packets but rather comes in at a constant bit rate. The application starts reading data from the beginning of the file and continues in a sequential manner.
Q. What do you mean by “Heartbeat” in HDFS?
Answer:
All slaves send a signal to their respective master node i.e. DataNode will send signal to NameNode, and TaskTracker will send signal to JobTracker that they are alive. These signals are the “heartbeat” in HDFS.
Q. If there are 10 HDFS blocks to be copied from one machine to another. However, the other machine can copy only 7.5 blocks, is there a possibility for the blocks to be broken down during the time of replication?
Answer:
No, the blocks cannot be broken, it is the responsibility of the master node to calculate the space required and accordingly allocate the blocks.Master node monitors the number of blocks that are in use and keeps track of the available space.
Q. What is Speculative execution in Hadoop?
Answer:
Hadoop ecosystem works on dividing tasks into smaller sub tasks, which are then spread over the nodes for processing. While processing a task, there is a possibility that some of the systems could be slow, which may slow down the overall process thus requiring lot of time to complete a particular task.Multiple copies of MapReduce tasks are run on other DataNodes. As most of the tasks finish, Hadoop creates redundant copies of the remaining tasks, and assigns it to the nodes that are not executing any other task. This process is referred to as Speculative Execution. This way if the same task is finished by some other node then Hadoop will stop all the other nodes which are processing that task.
Q. What is WebDAV in Hadoop?
Answer:
WebDAV are a set of extensions provided on HTTP that can be mounted as file systems, so that HDFS can be accessed just like any other standard file system.
Q. What is fault tolerance in HDFS?
Answer:
Whenever a system fails, the whole MapReduce process has to be executed again. Even if the fault occurs after the mapping process, the process has to be restarted. The backup intermediary key value pairs help improve the performance at failure time. The intermediary key value pairs help retrieve or resume the job when there is any fault in the system.Apart from this, since HDFS assumes that the data stored in nodes is unreliable, it creates copies of the data which are available across all the nodes that can be used on failure.
Q. How are HDFS blocks replicated?
Answer:
The default replication factor is 3 which means that the data is safe if 3 copies are created. HDFS follows a simple procedure to replicate blocks. One replica is present on the machine on which the client is running, the second replica is created on a randomly chosen rack (ensuring that it is different from the first allocated rack) and a third replica is created on a randomly selected machine on the same remote rack on which second replica is created.
Q. Which command is used to do a file system check in HDFS?
Answer:
hadoop fsck
Q. Explain about the different types of “writes” in HDFS.
Answer:
There are two different types of asynchronous writes–posted and non-posted. In posted writes we do not wait for acknowledgement after we write whereas in non-posted writes we wait for the acknowledgement after we write, which is more expensive in terms IO and network bandwidth.
Q. What is a NameNode and what is a DataNode?
Answer:
DataNode is the place where the data actually resides before any processing takes place. NameNode is the master node that contains file system metadata and has information about – which file maps to which block locations and which blocks are stored on the DataNode.
Q. What is Shuffling in MapReduce?
Answer:
The process of transferring data (outputs) from mapper to the reducers is known as Shuffling.
Q. How does NameNode tackle DataNode failures?
Answer:
All the data nodes periodically send notifications a.k.a Heartbeat signal to the NameNode, which implies that the DataNode is alive. Apart from Heartbeat, NameNode also receives Block report from DataNodes, which consists of all the blocks on a DataNode. In case NameNode does not receive this, it marks that DataNode as a dead node.As soon as the DataNode is marked as non-functional or dead, block transfer is initiated to the DataNode with which replication was done initially.
Q. What is InputFormat in Hadoop?
Answer:
As the name suggests, it specifies the process of reading data from files into an instance of the Mapper. There are various implementations of InputFormat, like for reading text files, binary data and etc.We can even create our own custom InputFormat implementation. Another important job of InputFormat is to split the data and provide inputs to map tasks.
Q. What is the purpose of RecordReader in Hadoop?
Answer:
RecordReader is a class that loads the data from files and converts it into key, value pair format as required by the mapper. InputFormat class instantiates RecordReader after it splits the data.
Q. What is InputSplit in MapReduce?
Answer:
InputSplit is a Java class that points to the start and end location in the block.
Q. In Hadoop, if custom partitioner is not defined then, how is data partitioned before it is sent to the reducer?
Answer:
In this case default partitioner is used, which does all the work of hashing and partitioning assignment to the reducer.
Q. What is replication factor in Hadoop and what is default replication factor level Hadoop comes with?
Answer:
For being fault tolerant, HDFS is designed in such a way that it replicates the data blocks on all the nodes in the cluster. Replication factor is that property which decides how many copies of the blocks have to be created on the nodes. Hadoop will keep n – 1 copies of the data.Default value of replication factor is 3, which means, it will keep 2 copies of the data.
Q. What is SequenceFile in Hadoop and Explain its importance?
Answer:
A SequenceFile can be treated as a container or a zip archive for storing small files. For e.g. if we have metadata (filename, path, size etc.) we can store that in a SequenceFile as key/value pairs.It is required when we have very small amounts of data for processing, as creating mappers and reducers for small amounts of data would be an overhead. This will also increase memory overhead for NameNode as it has to store information of huge number of small files.
Q. Explain about the different parameters of the mapper and reducer functions.
Answer:
The basic parameters to the Mapper and Reducer functions are –
- KEYIN
- VALUEIN
- KEYOUT
- VALUEOUT
Q. How can you set random number of mappers and reducers for a Hadoop job?
Answer:
Mappers and reducers are calculated by Hadoop, based on the DFS block size. It is possible to set an upper limit for the mappers using conf.setNumMapTasks (int num) function. However, it is not possible to set it to a lower value than the one calculated by Hadoop.
During command line execution of jar, use the following command to set the number of mappers and reducers -“-D mapred.map.tasks=4” and “-D mapred.reduce.tasks=3”
The above command will allocate 4 mappers and 3 reducers for the task.
Q. How many Daemon processes run on a Hadoop System?
Answer:
5 Daemon processes run on a Hadoop system, of which 3 daemon processes run on the master node and 2 run on the slave node.
On Master Node, the 3 daemon processes are-
1) NameNode- This daemon process maintains the metadata for HDFS.
2) JobTracker – Manages MapReduce Jobs.
3) Secondary NameNode –Deals with the organization functions of NameNode.
On Slave Node, 2 daemon processes that run are-
4) DataNode- Consists of the actual HDFS data blocks before any processing is done.
5) TaskTracker- This daemon process instantiates and monitors each map reduce task.
3 more daemon processes have been added in Hadoop 2.x
Q. What happens if the number of reducers is 0?
Answer:
When the number of reducers is set to 0, no reducers will run. The output of mappers will be stored in separate file on HDFS.
Q. What is meant by Map-side and Reduce-side join in Hadoop?
Answer:
Map – Side Join
As the name suggests, if a join is performed at mapper side it is termed as Map – Side Join. To perform this join, the data has to be partitioned equally, sorted by the same key and records for the key should be in the same partition.
Reduce – Side Join
It is much simpler than Map – Side Join, as reducers get the structured data after processing, unlike Map-Side join which require data to be sorted and partitioned. Reduce Side joins are not as efficient as Map-Side joins because they have to go through the sort and shuffle phase.
Join implementation depends on the size of the dataset and how they are partitioned. If the size of dataset is too large to be distributed across all the nodes in a cluster or it is too small to be distributed – in either case, Side Data Distribution technique is used.
Q. How can the NameNode be restarted?
Answer:
- Go to /etc/init.d/hadoop-namenode stop
- Then, hadoop-namenode start
Q. Hadoop attains parallelism by isolating the tasks across various nodes; it is possible for some of the slow nodes to rate-limit the rest of the program and slows down the program. What method Hadoop provides to combat this?
Answer:
This can be achieved using Speculative Execution mechanism.
Q. What is the significance of conf.setMapper class?
Answer:
setMapper class sets up all the required parameters for a job to execute.
Q. What are combiners and when are these used in a MapReduce job?
Answer:
Power of Hadoop can be seen on all dimensions of data. When the output of a Map task is huge, transferring it over the network can slow down whole process. Hadoop has a concept of combiners, also known as semi – reducers. Their task is to create a summary of the map output with same key and provide it to reducers for further processing.
Q. How does a DataNode know the location of the NameNode in Hadoop cluster?
Answer:
In the configuration file – conf/*-site.xml contains the NameNode location on each DataNode.
Q. How can you check whether the NameNode is working or not?
Answer:
Use the command -/etc/init.d/hadoop-0.20-namenode status or use the ‘jps’ command.
Q. When doing a join in Hadoop, you notice that one reducer is running for a very long time. How will address this problem in Pig?
Answer:
To address this problem, we can use Skew Join of Pig. Skew join identifies the largest dataset on the right side of join. It then splits the dataset and passes it through different reducers on the cluster. For the rest of the dataset, a regular standard Join is performed.
Q. Are there any problems which can only be solved by MapReduce and cannot be solved by PIG? In which kind of scenarios MR jobs will be more useful than PIG?
Answer:
There are several scenarios or problems which can be solved only by MapReduce. Basically, wherever there is a requirement of a custom partitioner, then MapReduce is more useful than PIG, as the latter does not allow one.
Q. Give an example scenario on the usage of counters.
Answer:
The power of Hadoop lies in splitting the task and executing it on different nodes in the cluster, where the cluster can comprise of n number of nodes. If a record fails in any of the MapReduce phases, then it becomes very difficult to check how many records have failed. Although Hadoop provides exhaustive logging control, but again, it’s difficult to check the logs on individual nodes. In these scenarios, Counters come handy, whenever there is a failed record, the developer just has to increment only the counter. But, the main advantage of using a counter is that, it provides total value of the whole job.
Q. Explain the difference between ORDER BY and SORT BY in Hive?
Answer:
ORDER BY
This operation is performed on the complete query result set. The whole data set has to be passed to a single reducer to perform this operation, which may slow down the whole process.
SORT BY
We can treat this as a local ORDER BY operation in reducer. This means that SORT BY is performed on the data inside each reducer but the whole dataset is not ordered.