Guest Post: Databricks Leaders on Today's Real Time Analytics Challenges

Databricks, a company founded by the creators of the popular open-source Big Data processing engine Apache Spark, has gained much momentum as Spark has gathered big backers and widespread development. But the folks at Databricks have a landscape view of the Big Data arena that extends well beyond Spark. The company is very focused on tools and strategies for performing analytics on streaming data and more.

We caught up with Databricks’ Kavitha Mariappan (shown in the photo) and Jules Damji for a guest post on today’s real time analytics challenges. Here are their thoughts:

Real Time Analytics: Choosing the Right Application

By Kavitha Mariappan & Jules Damji, Databricks

Everywhere you look today, there are data-generating devices such as smartphones, smart meters, sensors, RFID chips, or other connected devices. These devices produce massive streams of data that pose a critical problem for data professionals and executives: how do we make sense of it all in real-time? Streaming analytics is the answer.

When Moments Count

A system in a hospital intensive care unit monitors life-support metrics. Suddenly, it detects an anomaly; the patient’s heart hasn’t stopped, but experience shows that it may happen soon — possibly in seconds. Alerted to the danger, doctors and nurses take immediate action and prevent the patient from going into cardiac arrest. This is the vision of real-time analytics — synthesizing data from live sources to make immediate decisions, when moments count.

Saving patient lives is a dramatic application, but there are many other ways that businesses and governments can use real-time analytics. Organizations use real-time analytics to manage customer interactions, monitor patients, track vehicles, detect network breaches and reduce fraud losses. Here are a few examples:

— Credit Card FraudShopify offers an e-commerce platform-as-a-service for online stores and retail point of sale systems; the company handles more than $12 billion in gross sales at a rate of 14,000 events per second. Shopify uses real-time analytics to process transactions, filtering the riskiest and routing them to case management software for investigation.

— Credit Card Operations.  To detect unusual behavior, ING clusters stores according to usage patterns, then monitors the flow of transactions to identify stores whose group assignment changes.  Analysts investigate these anomalies to determine causes of the unusual behavior.

— Traffic Monitoring and Smart Cities: The City of Madrid uses real-time monitoring to predict the ebb and flow of traffic. Sensors deployed at strategic choke-points predict traffic congestion—and as a result, recommend alternate routes for public buses.

The digital economy accelerates the pace of business and creates a demand to do things faster and better. As a result, businesses and governments recognize the need for real-time analytics. Market research firm Markets and Markets estimates total spending on software and services for real-time analytics at $502 million for 2015 and predicts a 31% growth rate through 2020.

The Key to Real-Time Applications

Conventional analytics platforms process data in fixed and finite batches. That’s fine for many applications, but not if you want to deliver real-time analytics. For that, you need an analytics platform that can extract timely and useful information from streams of data continuously generated by sensors, vehicles, ATMs, networked devices or other sources, typically in high volume.
To support real-time applications, an analytics platform must have several distinct capabilities:

— “Windowing” and flow metrics. Business users rarely need to see every event in a stream, but they do care about measures of the overall flow. A streaming analytics platform must be able to define moving summary windows from the data stream, so that users can visualize trends.

— The ability to create alerts. Users may also want to see individual events in the stream that meet predefined characteristics or are distinctly different from other events. Alerts use fixed user-specified rules to identify unusual events; for example, if a cash transaction exceeds $1,000 or a sensor temperature is greater than 150 degrees. 

— Anomaly detection. Users may want to identify events that are atypical when you consider a very large number of characteristics. Anomaly detection uses machine learning to examine many features of the event; for example, an algorithm that analyzes events recorded by hundreds of sensors in a jet engine and flags any that are unlike the others.

— Complex processing for real-time decisions. To predict cardiac arrest or the probability of fraud, the system may have to compute complicated measures, join data from streams and other sources, and do so in a defined window of time.
To deliver real-time applications, you’re going to need a platform that is designed to support streaming analytics.

Explosion in Options

With all of the buzz about real-time applications, it’s not surprising that there are so many commercial and open source platforms for streaming analytics. A recent Forrester report on Big Data Streaming Analytics identifies fifteen vendors with offerings in the category. Established technology vendors like Cisco Systems, IBM, Oracle, SAP, and TIBCO Software all have offerings, as well as several startups.

For organizations that prefer rapid innovation, transparency, and flexibility, open source software is a better option for streaming analytics. Within the Apache Software Foundation ecosystem, there are five projects for streaming analytics: Apache Spark, Apache Storm, Apache Flink, Apache Samza and Apache Apex.

Apache Spark is a distributed engine for in-memory data processing. Spark Streaming is a module of Spark that works with streaming data sources, performs complex transformations, and writes to file systems, databases, and BI systems. Structured Streaming, introduced in Apache Spark 2.0, enables the user to query streams and static data in a single interface, facilitates writing end-to-end continuous applications, join streaming data with static data and score events in a stream with machine learning models. Spark is currently the most active project in the Big Data ecosystem. Many companies offer commercial support for Spark, including Amazon, Cloudera, Databricks, Google, Hortonworks, IBM, MapR, and Microsoft, among others.

Apache Storm is an open source computing system for stream processing. Storm represents applications as a directed acyclic graph, called a topology. Users identify streaming data sources and data transformations, then map a flow of data through the graph. Storm processes events through the topology continuously, distributing workload as it operates. Hortonworks and MapR provide commercial support for Storm. 
Apache Samza is a computational framework that offers fault-tolerant, durable and scalable stream processing with a simple API. A team at LinkedIn developed Samza together with the Apache Kafka message broker to support stream processing use cases.

Apache Flink is a distributed dataflow engine whose runtime supports batch and stream processing, as well as iterative algorithms. Flink includes libraries for graph analysis, machine learning, complex event processing and SQL. Data Artisans, a startup located in Berlin, Germany, offers commercial support for Flink.

Apache Apex is the open source version of a streaming and batch engine originally developed by DataTorrent. The Malhar function library supplements the core Apex runtime engine and supports operators for business logic. Most Apex contributors are DataTorrent employees.

Vendors in closely related categories also target the streaming analytics market. In-memory databases tend to position themselves as streaming analytics platforms; general-purpose in-memory processing platforms make similar claims. Real-time message brokers, such as Apache Kafka and Amazon Kinesis, offer basic analytics capabilities, which they may extend in the future.

What to Consider

There are ten things to consider when you evaluate a streaming analytics platform:

    1. Analytic operators
    2. High throughput & Low latency
    3. Fault tolerance guarantees
    4. Processing semantics
    5. Integration with Data Sources
    6. Ease of programming and debugging
    7. Unified batch and streaming interface
    8. SQL interface
    9. Commercial support
    10. Managed services

Let’s look at each of these in more detail.

Analytic operators. Pick a streaming analytics platform that supports high-level functions for the analytics you need without complex custom programming. Look for the capability to join streams, aggregate events, filter events, account for late or out-of-order events, support watermarking, perform windowing operations on even-time and process-time, and monitor and generate alerts.

High-throughput. For many applications, high throughput – the volume of data handled – is more important than low latency. Throughput is of particular importance for IoT applications. Choose a streaming analytics platform that can handle all of your data within your SLA.

Low latency. Latency is the interval between the time a message arrives in the system and the time the system completes processing. Low latency is better than high latency; however, don’t put it on a pedestal at the expense of all other considerations. The graphic below, from the OpsClarity 2016 State of Fast Data and Streaming Applications Survey, shows that most organizations define latency requirements in minutes or even hours, not fractions of a second.







 Fault tolerance guarantees. A streaming analytics platform must recover from partial or system failure without losing information; some platforms lose part of their results or give a different answer if machines fail. After a failure, the streaming engine should recover seamlessly. Most systems checkpoint their internal states so they can be replayable.

Processing Semantics. Additionally, the platform should support “exactly-once” delivery, which means that for each message received, there is exactly one output message; messages can neither be lost nor duplicated. Without this guarantee, messages may be lost or double-counted. Imagine the problems that would cause in a billing application!

Integration with Sinks, Sources, and Storage Systems. All attributes are as much part of your choosing criteria as engines’ tight integration with storage systems (such as S3, HDFS, NoSQL etc.) and input sinks (such as Kafka, Kinesis, Queuing Systems, network sockets etc.). Ensure that integration with these data sources support transaction updates. While some projects support integration via bundled connectors, others offer through third-party connectors.

Ease of programming and debugging. Your team needs to move quickly to build, test, debug and deploy real-time applications. Look for a streaming analytics platform that supports programming APIs for the simple and popular languages that data scientists and data engineers like to use —  such as Python and Scala.

Unified batch and streaming interface. In real-world applications, organizations always combine streaming analytics with analytics on stored information to perform a complete view. Choose a platform that delivers streaming and static analysis, and can combine them in a single API. Your streaming analytics platform should produce the same results whether you process the data in batch mode or streaming mode; some platforms can’t do that.

SQL interface. SQL plays a central role in business analytics. Choose a streaming analytics platform that supports an ANSI SQL interface—and one that supports issuing SQL queries on streaming data

Commercial support. When you license commercial software, the vendor usually provides technical assistance and maintenance. Open source software, of course, is free of licensing fees; most organizations prefer commercially supported open source software. Choose a streaming analytics platform backed by a vendor with the knowledge, skills, and scale to help your team.

Managed services. Streaming applications tend to be difficult to set up and maintain, and require continuous monitoring. Choose a platform that is available as a managed service in the cloud.

Choose Wisely

The sheer volume of streaming data will grow rapidly in the next few years, so get started now. Define your use cases, understand your needs, narrow your options and run a Proof-of-Concept. Above all, don’t be fooled by raw performance hype; analytic features, ease of programming, commercial support and the availability of managed services are your keys to success.

Kavitha heads up Databricks’ end-to-end global marketing efforts. She brings more than 20 years of extensive industry experience in product and outbound marketing, product management, and business development to Databricks. Prior to Databricks, Kavitha was the VP of Marketing at Maginatics (acquired by EMC), where she built and led the team responsible for all aspects of marketing and communications. Her previous professional experience includes leadership roles at Riverbed Technology and Cisco Systems, Inc. Kavitha has a Bachelor of Engineering in Communication Engineering from the Royal Melbourne Institute of Technology, Australia.

Jules S. Damji is a Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies building large-scale distributed systems.


Related Activities

Related Blog Posts

Sam Dean


p class=”wpematico_credit”>Powered by WPeMatico