Big Data has led to business growth in all industries spreading a powerful wisdom for the decision making process. Of all the tools that process Big Data, Hadoop MapReduce and Apache Spark attract the attention of the data experts and companies. In this article, we’ll learn the key differences between Hadoop and Spark and when we should choose one or another, or use them together.
Hadoop & Spark: Definitions and Numbers
Apache Hadoop is an open source framework that is used in cloud computing to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.
Hadoop consists of four main modules:
- Hadoop Distributed File System (HDFS) – A distributed file system that runs on standard or low-end hardware. HDFS provides better data throughput than traditional file systems, in addition to high native support of large datasets.
- Yet Another Resource Negotiator (YARN) – For managing compute resources in clusters and using them to schedule users’ applications.It schedules jobs and tasks.
- MapReduce – A MapReduce is a programming model for large-scale data processing. Using distributed and parallel computation algorithms, MapReduce makes it possible to carry over processing logic and helps to write applications that transform big datasets into one manageable set.
- Hadoop Common – Includes the libraries and utilities used and shared by other Hadoop modules.
Apache Spark is a unified analytics engine for large-scale data processing. Apache Spark is an open-source, distributed processing system used for big data workloads.It does not have its own storage system, but runs analytics on other storage systems like HDFS, or other popular stores like Amazon Redshift, Amazon S3, Couchbase, Cassandra, Kubernetes and others. Spark on Hadoop leverages YARN to share a common cluster and dataset as other Hadoop engines, ensuring consistent levels of service, and response. Data engineers use Spark for coding and building data processing jobs—with the option to program in an expanded language set.
The two are Open-source projects from Apache Software Foundation, and they form the leading products for Big Data Analytics. Hadoop has been the leading tool for Big Data Analytics for 5 years. Recent market research has shown that Hadoop has been installed by 50,000+ customers, while Apache Spark has only 10,000+ installations. However, the popularity of Apache Spark skyrocketed in 2013, overcoming that of Hadoop in only one year.
Language of support
Hadoop is developed in Java. MapReduce applications can be written in R, C++ and Python. Apache Spark is developed in Scala and supports languages like Java, C++ and Python. The last two languages described above are very simple to use.
Apache Spark is well-known for its speed. It runs 100 times faster in-memory and 10 times faster on disk than Hadoop MapReduce. The reason is that Apache Spark processes data in-memory (RAM), while Hadoop MapReduce has to persist data back to the disk after every Map or Reduce action.
Apache Spark’s processing speed delivers near Real-Time Analytics, making it a suitable tool for IoT sensors, credit card processing systems, marketing campaigns, security analytics, machine learning, social media sites, and log monitoring. It could cause more degradation.
Apache Spark comes with in-built APIs for Scala, Java, and Python, and it also includes Spark SQL for SQL users. Apache Spark also has simple building blocks, which make it easy for users to write user-defined functions. You can use Apache Spark in intermediate feedback for queries.
On the other hand, Hadoop MapReduce is generally slow: it was written in Java and is difficult to program. It needs to handle low level APIs to process data.
In other terms, a lot of coding!Unlike Apache Spark, Hadoop MapReduce cannot deliver real-time analytics from the data. Considering the above-stated factors, it can be concluded that Apache Spark is easier to use than Hadoop MapReduce.
With Apache Spark, you can do more than just plain data processing. Apache Spark can process graphs and also its own Machine Learning Library called MLlib.
Due to its high-performance capabilities, Apache Spark is very helpful for Batch Processing as well as near Real-Time Processing. Apache Spark is a “one size fits all” platform, built-in machine learning library, it can be used to perform all tasks instead of splitting tasks across different platforms. It can be used for classification, regression and building machine learning-pipelines.
Hadoop MapReduce is a good tool for Batch Processing. It operates in sequential steps by reading data from the cluster, performing its operation from data, writing the results back to the cluster, but if you want to get features like Real-Time and Graph Processing, you must use other tools as well as Mahout and Samsara.
Hadoop is highly scalable, adding n numbers nodes in the cluster. Yahoo reported to have more than 42,000 nodes.
However, Apache Spark uses Random Access Memory (RAM) for optimal performance setup. The largest Spark cluster has only 8,000 nodes. Since Big Data keeps on growing, cluster sizes should increase in order to maintain throughput expectations. The two platforms offer scalability through HDFS.
Handoop supports Kerberos and LDAP for authentication. It also uses a traditional file permission model.
Spark’s security model is currently sparse, but allows authentication via shared secret. Additionally, Spak can run on Yarn giving the use of Kerberos authentication.
Both Hadoop MapReduce and Apache Spark are Open-source platforms. However, you have to invest in hardware and personnel or outsource the development.
Business requirements should guide you on whether to choose Hadoop MapReduce or Apache Spark. If you want to process huge volumes of data, consider using Hadoop MapReduce.
We can say Hadoop MapReduce requires more memory on disk and it’s less expensive than Apache Spark. Spark requires a lot of RAM to run. This increases the cluster size and its cost. The reason is that hard disk space is cheaper than RAM.
Top 5 companies which use Spark
eBay uses Apache Spark to provide targeted offers, enhance customer experience, and to optimize the overall performance. Apache Spark is leveraged at eBay through Hadoop YARN. EBay spark users leverage the Hadoop clusters in the range of 2000 nodes, 20,000 cores and 100TB of RAM through YARN.
The largest streaming video company Conviva uses Apache Spark to learn about the network conditions in real-time. The video player is able to manage live video traffic coming from close to 4 billion video feeds every month, to ensure maximum play-through, helping Conviva by providing its customers with a great video viewing experience.
Netflix uses Apache Spark for real-time stream processing to provide online recommendations to its customers. Streaming devices at Netflix send events which capture all member activities and play a vital role in personalization. It processes 450 billion events per day which flow to server side applications and are directed to Apache Kafka.
Pinterest is using Apache Spark to discover trends in high value user engagement data so that it can react to developing trends in real-time by getting an in-depth understanding of user behaviour on the website.
TripAdvisor, a leading travel website that helps users plan a perfect trip, is using Apache Spark to speed up its personalized customer recommendations. TripAdvisor uses Apache Spark to help millions of travellers by comparing hundreds of websites to find the best hotel prices for its customers.
Top 5 companies which use Hadoop MapReduce
- Amazon Web Services
- British Airways
Elastic MapReduce provides a managed, easy to use analytics platform built around the powerful Hadoop framework. Focus on your map/reduce queries and take advantage of the broad ecosystem of Hadoop tools, while deploying to a high scale, secure infrastructure platform.
InfoSphere BigInsights makes it simpler for people to use Hadoop and build big data applications. It enhances this open source technology to withstand the demands of your enterprise, adding administrative, discovery, development, provisioning, and security features, along with best-in-class analytical capabilities from IBM Research.
Cloudera develops open-source software for a world dependent on Big Data. With Cloudera, businesses and other organizations can now interact with the world’s largest datasets.
British Airways deployed Hadoop in April 2015 as a data archive for legal cases. Previously these were stored on an enterprise data warehouse which was costly for the airline.
Since deploying Hortonworks 2.2 HDP, British Airways has gained ROI within a year, and is able to deliver 75% more free space for new projects, translating directly into cost reductions for the airline.
Expedia makes use of Hadoop clusters using Amazon Elastic MapReduce (Amazon EMR) to analyze high volumes of data coming from Expedia’s global network of websites. These include clickstream, user interaction, and supply data. Highly valuable for allocating marketing spend, this data is merged from web bookings, marketing departments and marketing spend logs to analyze whether the outlay has equated to increased bookings. The firm has seen costs drop and can process and analyze higher volumes of data.
Conclusion and the Big Question
The following are the limitations of both Hadoop MapReduce and Apache Spark:
- No Support for Real-time Processing: Hadoop MapReduce is only good for Batch Processing. Apache Spark only supports near Real-Time Processing.
- Requirement of Trained Personnel: The two platforms can only be used by users with technical expertise.
Finally, the big question: can we use them together? The answer is yes: Hadoop and Spark together build a very powerful system to address all the Big Data requirements. Apache Spark is not developed to replace Hadoop rather it’s developed to complement Hadoop. Spark comes to rescue Handoop with real-time, streaming, graph, interactive, iterative requirements.
And when you use Spark over Hadoop or you use them together?