10 Essential Hadoop Interview Questions *

最好的Hadoop开发人员和工程师可以回答的全面来源的基本问题. 在我们社区的推动下,我们鼓励专家提交问题并提供反馈.

Hire a Top Hadoop Developer Now
Toptal logois an exclusive network of the top freelance software developers, designers, finance experts, product managers, and project managers in the world. Top companies hire Toptal freelancers for their most important projects.

Interview Questions

1.

How can one define custom input and output data formats for MapReduce jobs?

View answer

Hadoop MapReduce内置了对许多常见文件格式的支持,比如SequenceFile. To implement custom types, one has to implement the InputFormat and OutputFormat Java interfaces for reading and writing, respectively.

A class implementing InputFormat (and similarly OutputFormat), 是否应该实现逻辑来分割数据,以及如何从每个分割中读取记录的逻辑. The latter should be an implementation of the RecordReader (and RecordWriter) interfaces.

Implementations of InputFormat and OutputFormat may retrieve data by means other than from files on HDFS. For instance, Apache Cassandra ships with implementations of InputFormat and RecordReader.

2.

What is HDFS?

View answer

HDFS (Hadoop Distributed File System)是一个分布式文件系统,是Hadoop软件集合的核心部分. HDFS试图抽象出分布式文件系统的复杂性, including replication, high availability, and hardware heterogeneity.

Two major components of HDFS are NameNode and a set of DataNodes. NameNode exposes the filesystem API, persists metadata, and orchestrates replication amongst DataNodes.

MapReduce原生地使用HDFS的数据局部性API来调度MapReduce任务到数据所在的位置运行.

3.

What read and write consistency guarantees does HDFS provide?

View answer

Even though data is distributed amongst multiple DataNodes, NameNode是文件元数据和复制的中心权威, a single point of failure). The configuration parameter dfs.NameNode.replication.min 定义一个块应该复制到多少个副本,这样写操作才能成功返回.

Apply to Join Toptal's Development Network

and enjoy reliable, steady, remote Freelance Hadoop Developer Jobs

Apply as a Freelancer
4.

MapReduce编程范式是什么?如何使用它来设计并行程序?

View answer

MapReduce is a programming model used to implement parallel programs. 它提供了一种编程模型,可以在一组分布式机器上运行程序. 类似的“Hadoop MapReduce”是MapReduce模型的实现.

Input and output data in MapReduce are modeled as records of key-value pairs.

Central to MapReduce are map and reduce programs, reminiscent of map and reduce in functional programming. 它们分两个阶段转换数据,每个阶段并行运行且线性可扩展.

The map function takes each key-value pair and outputs a list of key-value pairs. The reduce 的实例的所有输出中为每个键发出的所有值的集合 map invocations and reduces them to a single final value.

MapReduce与HDFS集成,为其处理的数据提供数据局部性. For sufficiently large data, a map or reduce 程序最好发送到数据所在的地方运行,而不是将数据带到数据所在的地方.

Hadoop对MapReduce的实现提供了对JVM运行时的原生支持,并扩展了对其他运行时通过标准输入/输出通信的支持.

5.

在HDFS中存储数据有哪些常见的数据序列化格式?它们的属性是什么?

View answer

HDFS can store any type of file regardless of format; however, 某些属性使某些文件格式更适合分布式计算.

HDFS organises and distributes files in blocks of fixed size. 例如,给定一个128MB的块大小,一个257MB的文件被分成三个块. Records at block boundaries, as a result, may be split. File formats designed to be consumed when split, also called “splittable,,包括记录组之间的“同步标记”,以便可以使用文件的任何连续块. Furthermore, compression may be desired in conjunction with splittability.

对压缩的支持尤其重要,因为它权衡了IO和CPU资源. A compressed file is quicker to load from disk but takes extra time to decompress.

例如,CSV文件是可分割的,因为它们在记录之间包含“行分隔符”. 但是,它们不适合二进制数据,也不支持压缩.

The SequenceFile format, native to the Hadoop ecosystem, is a binary format that stores key-value records, is splittable, and supports compression at the block and record levels.

Apache Avro, a data serialization and RPC framework, defines the Avro Object Container File format that stores Avro-encoded records. It is both splittable and compressible. Having also a flexible schema definition language, it’s widely used.

The Parquet file format, another Apache project, supports columnar data, where fields belonging to each column are stored efficiently together.

6.

What availability guarantees does HDFS provide?

View answer

HDFS依赖于NameNode来存储存储不同块的datanode的元数据. Since NameNode runs on a single node, it’s a single point of failure and its failure makes HDFS unavailable.

备用NameNode可以配置为能够故障转移到,以实现高可用性. In order to achieve this, the Active NameNode streams a log of mutations to a group of JournalNodes, 备用NameNode从中接收对文件系统元数据的最新更改.

可以通过在Zookeeper集群的仲裁上维护一个临时锁来配置主备namenode之间的自动故障转移. namenode上的故障转移控制器进程负责检查namenode的运行状况, for maintaining the ephemeral lock, and for executing a fencing mechanism that makes sure that upon failover, the previous NameNode does indeed act passively.

7.

What’s the purpose of Hadoop Streaming and how does it work?

View answer

Hadoop Streaming是Hadoop的MapReduce API的扩展,它使得在运行时中运行的程序可以像JVM一样运行 map and reduce programs. Hadoop Streaming定义了一个接口,在这个接口中,数据可以通过 standard out and standard in streams provided by operating systems (and hence its name).

8.

What is speculative execution and when can it be used?

View answer

一个MapReduce程序可能会在不同的HDFS datanode上转换成对mapper和reducer任务的多次调用. If a task is slow to respond, MapReduce “speculatively” runs the same task on another replica, as the first node might have been overloaded or faulty.

For speculative execution to work correctly, tasks need to have no side effects; or if they do they need to be “idempotent.” A side-effect-free task is one that besides producing the expected output, does not mutate any external state (such as writing into a database). 在这种情况下,幂等性意味着如果副作用被重复应用(由于推测执行), it would not change the end result. Nevertheless, 对于MapReduce任务来说,副作用通常是不希望出现的,无论推测执行情况如何.

9.

What is the “small files problem” with Hadoop?

View answer

NameNode is the registry for all metadata in HDFS. The metadata, although journaled on disk, 是从内存中提供的,因此受到运行时的限制. NameNode, being a Java application, 使用JVM运行时运行,不能有效地使用较大的堆分配.

10.

Explain rack awareness in Hadoop.

View answer

HDFS replicates blocks onto multiple machines. 为了对机架故障(网络或物理)具有更高的容错能力, HDFS is able to distribute replicas across multiple racks.

Hadoop通过调用用户定义的脚本或加载Java类来获取网络拓扑信息,Java类应该是 DNSToSwitchMapping interface. It’s the administrator’s responsibility to choose the method, to set the right configuration, and to provide the implementation of said method.

There is more to interviewing than tricky technical questions, so these are intended merely as a guide. Not every “A” candidate worth hiring will be able to answer them all, nor does answering them all guarantee an “A” candidate. At the end of the day, hiring remains an art, a science — and a lot of work.

Why Toptal

Tired of interviewing candidates? Not sure what to ask to get you a top hire?

Let Toptal find the best people for you.

Hire a Top Hadoop Developer Now

Our Exclusive Network of Hadoop Developers

Looking to land a job as a Hadoop Developer?

Let Toptal find the right job for you.

Apply as a Hadoop Developer

Job Opportunities From Our Network

Submit an interview question

Submitted questions and answers are subject to review and editing, and may or may not be selected for posting, at the sole discretion of Toptal, LLC.

* All fields are required

Looking for Hadoop Developers?

Looking for Hadoop Developers? Check out Toptal’s Hadoop developers.

Adrian Dominiczak

Freelance Hadoop Developer
PolandToptal Member Since July 21, 2020

Adrian是一名拥有近十年专业经验的高级大数据工程师. Adrian started his career as a software engineer at Samsung's R&他在Santander和Lingaro从事过一系列项目,从银行和制药行业的机器学习和大数据工程到大数据和云架构. Adrian's areas of expertise lie mainly with Hadoop and Spark.

Show More

Selahattin Gungormus

Freelance Hadoop Developer
TurkeyToptal Member Since May 4, 2021

Selahattin是一名数据工程师,拥有多年使用开源技术构建可扩展数据集成解决方案的实践经验. 他擅长使用Hadoop等分布式处理平台开发数据应用程序, Spark, and Kafka. Selahattin在AWS和Azure等云架构类型方面也有实践经验, as well as developing microservices using Python and JavaScript frameworks

Show More

Dmitry Kozlov

Freelance Hadoop Developer
CanadaToptal Member Since February 24, 2021

Dmitry是一名高级大数据架构师,在数据仓库方面拥有16年以上的经验, BI, ETL, analytics, and the cloud. He's led teams in the delivery of 24 projects in the industries of finance, insurance, telecommunications, government, education, mining, manufacturing, and retail. Dmitry thrives in high-paced environments, has demonstrated the ability to lead effectively, manage, and support teams, and has consulted on several projects as a BI, data warehouse, and big data expert.

Show More

Toptal Connects the Top 3% of Freelance Talent All Over The World.

Join the Toptal community.

Learn more