Diapers & Beer: How Spark Helps Businesses Access Machine Learning

Apache Spark™ is a leading platform for large-scale data mining, batch processing and stream processing. Touted as a “lightning-fast unified analytics engine,” Spark modernizes data analytics with machine learning to help businesses uncover patterns at new levels. Best of all, Spark is included within many other software solutions, so this powerful tool may already be part of your modern data analytics infrastructure.

The Platform That’s Changing Data Analytics

From its inception at the AMPLab at U.C. Berkeley in 2009, Spark has become one of the key big data distributed processing frameworks in the world. It’s used by banks, telecommunications companies, games companies, governments, and nearly all major tech giants, including Apple, Facebook, and Microsoft. As an open source engine that’s often OEMed, Spark is available in almost any Big Data implementation to significantly improve data mining. For example, Spark is included in most Hadoop distributions, running on Hadoop YARN with Cloudera and Hortonworks distributions running Spark jobs. TESCHGlobal’s own standard Big Data Sandbox includes Cloudera CDH, which comes bundled with Apache Spark.

What’s different about Spark is its ability to magnify the power of data analytics. It features multiple libraries with the most successful data mining algorithms available in modern computing, including AI-assisted models capable of mining structured and unstructured data from traditional databases to new data lakes or data vaults. Spark Batch runs data analytics on historical data to produce reports and also allows users to dynamically explore their own datasets. Spark Streaming analyzes data in real-time and feeds interactive dashboards, enabling users to identify trends in their business processes. Though Spark can access data from OLAP warehouses, its capabilities are far, far beyond 1990s-era mining and reporting technologies.

Machine Learning Feeds Business Growth

Machine learning (ML) data analytics is a new paradigm that can help businesses find meaningful patterns in their data. Spark enables companies to leverage supervised ML to accomplish a specific task based on defined business rules. For example, by coupling loan application business rules with Spark’s ML library, users can create and train a model that uses correlation rules and decision trees to approve or reject an online loan application in seconds.

Spark also enables unsupervised ML used for data exploration and pattern identification. The Market-basket analysis is a typical Spark unsupervised ML use case. Spark’s automated analysis of supermarket checkout data may uncover the fact that customers who buy beer also buy chips, perhaps not an unexpected discovery. It may also reveal that on Thursdays, customers often purchase diapers and beer together, an initially surprising result that, on reflection, makes some sense as young parents stock up for a weekend at home. Such information could be used for many purposes from planning store layouts or limiting special discounts to just one of a set of items that tend to be purchased together, to offering coupons for a matching product when one of them is sold alone. Spark’s ML data analytics unlocks an incredible potential for business innovation and growth.

Implementing Machine Learning Data Analytics

To access the value of ML data analytics, businesses need an infrastructure that includes a storage method such as a data vault or data lake, a data management tool such as Talend, and a modern data analytics engine like Spark. They also need expertise in defining a data analytics model for leveraging the power of their ML data analytics engine. With Spark’s ML data analytics in place, businesses are prepared to become even more data-driven.

Like to learn more? Connect with us! We’ve enabled Spark’s ML data analytics platform for customers across industries to determine what’s right for them.