Real-time data processing is the current state-of-the-art in business platform data engineering practices. Long gone are the days of batch processing and monolithic ETL engines that are turned on at midnight. Today’s demands come from the number of mobile users and things on the internet. Mobile phones and tablets have steadily increased in the realm of customer experience as companies create mobile only or mobile-first interfaces to interact with their commercial systems or business processes. Similarly, there are other “things” in the “Internet of Things” (IoT) such as rental bikes and scooters, key-less home locks, as smart home thermostats.
It used to be that users would go to a store physically and pay cash for something. Then they started using credit cards, which then became networked. Eventually, they started using websites to buy things, then their phones, and now back to using mobile pay at cash registers. These are just some examples of the numerous sources of “events” that can interact with businesses.
Today, every large-scale business platform or one that is aspiring to be must not only be an internet company, but also a data company as well as an IoT company. Imagine how the millions of McDonald’s mobile application users order online and pick up their food curbside from any of the 14,000 some restaurants around the country in the United States. Uber, Lyft, Via, are all IoT companies the way they track riders and drivers and make real-time decisions on pairing them for rides.
How do they do it? The secret is not really that much of a secret if you work in the industry. Unlike 20 years ago, the technologies that power these companies are a dime a dozen. The open-source movement and the proliferation of the Internet across the world has made the technology accessible to anyone who has the willingness to learn and try it out.
In our last post in this series we talked at a high level about the reasons for scaling business platforms, how to find and measure areas for growth, and the technologies used to scale by most companies. This post will focus on real-time data processing specifically with a well-known technology called Apache Spark (not the same as Capital One Spark Card, or the Spark Email Application).
In the last 10 years or so, “Big Data” has been used and abused by companies and government institutions across the world. Sometimes the misuse doesn’t hurt anyone, just the people that invested millions of dollars on something they didn’t understand. It all started with Google’s MapReduce, BigTable, and DistibutedFileSystem ideas which were published as papers for the world to see. Some went on to make things like DynamoDB and Cassandra (which we’ll cover in another article in this series). Others went on to make Hadoop MapReduce, HBase, and HDFS which became the collectively known as the “Hadoop” distribution which was later commercialized by Cloudera, Hortonworks, and cloud providers such AWS, and Azure.
This first generation of big data dealt with all the “Five Vs of Big Data” as most people in the industry know them or should know them as:
So where does Spark come in? While Hadoop and other contenders like MapR were initially able to do a lot of great things as batch processes, they were hard to use for normal people in technology. (Not everyone is a data geek in the technology community.) Although different tools built on top of Hadoop such as HIVE and PIG were able to make it slightly easier, the execution of these scripts still took time.
Spark came along and introduced the idea of a resilient distributed dataset (RDD) which was an in-memory representation of some other data-source that could come from HDFS, or HBase, or an outside data source. In Spark, the computations were easier to create using a domain-specific-language built on Scala for data-processing and engineering. It allowed folks to be able to write the Google PageRank (The reason why Google developed all that technology that started the Big Data movement) in mind-blowing scales.
Spark itself is very powerful as you can see. Later versions of Spark have amassed more and more power through open-source acquisitions where other libraries have merged in.
Apache Spark is the most open, free, and powerful distributed computing, data analysis, data processing, machine learning, graph data processing, and data stream processing framework. Developers from more than 300 companies actively develop Apache Spark including some of the heavy hitters such as Microsoft, Apple, Netflix, Uber, Facebook, Amazon, Intel, Alibaba, Ebay, and one of our favorites Datastax just to name a few. An order of magnitude more companies actively use Spark on a daily basis. It has widely becoming the defacto data framework for both big and fast data.
In our next article, we’ll cover how Akka, the Scala / Java / C# Actor model framework can be used to facilitate “fast” in the whole real-time world. If you want me or our company, to come and talk to your company about data modernization, real-time data platforms, or Apache Spark, feel free to email me or my team at Anant.