Posted in Information Technology

The Role of Predictive Analytics in Cloud Operations

Infrastructure operations in the era of complex multi-cloud technologies have emerged as challenging and resource-intensive tasks for an average customer of cloud computing seeking to optimize cloud investments and resource performance. Organizations upgrade their infrastructure to operate at scale while reducing expenses, performance and security issues. However, the diagnostics, resolution and optimization of the infrastructure has emerged as a challenge considering the vast, dynamic and interconnected nature of the underlying hardware resources. Predictive analytics aims to modernize and simplify the complexity of infrastructure operations by leveraging the vast deluge of data associated with IT operations.

Predictive Analytics refers to the practice of using data to determine future patterns and behavior of a system. Predictive analytics tools may use advanced machine learning algorithms and statistical analysis techniques to identify a system model with high accuracy. The prediction models applied to historical and present data can unveil insights into future trends of the system behavior. Using this information, decisions can be made regarding the system proactively. Predictive analytics can help identify correlations between behaviors that otherwise may be overlooked or perceived as isolated. Algorithmic filtering also reduces the noise data and false alarms that may keep IT Ops teams rifling through vast data to identify the most useful insights. This capability offers immense opportunities in the IT infrastructure operations segment, where an isolated anomalous network traffic behavior can translate into a large-scale data leak and remain hidden from sight until it’s too late to react. Predictive analytics constitute the building blocks of a modern AIOps architecture, which includes data ingestion, auto-discovery, correlation, visualization and automated actions.

For cloud operations in particular, predictive analytics has a key role to play in the following domain applications:

Optimizing the Cloud Infrastructure

Many organizations use multiple cloud environments and possibly, a range of siloed infrastructure monitoring, management and troubleshooting solutions. In order to gain visibility into the hybrid multi-cloud environment, Cloud Ops teams using traditional analytics practices may rely on manual capabilities and overlook the correlating information pieces across the infrastructure environment. Data applications and IT workloads are increasingly dynamic in nature and the unpredictable changes in network traffic, infrastructure performance and scalability requirements impact IT operations decisions in real-time. Making the right decisions proactively requires Cloud Ops to collect the necessary information from various sources and being able to correlate the information across the siloed IT environments. Predictive analytics allows users to focus on the knowledge gleaned from data instead of collecting, processing and analyzing information from multiple cloud environments independently. Regardless of the complexity of the cloud network, advanced machine learning algorithms that power predictive analytics capabilities provide the necessary abstraction between the complex underlying infrastructure and data analysis. Cloud Ops are ultimately able to use the collective insights to proactively make the right decisions regarding resource provisioning, storage capacity, server instance selection and load balancing, among other key cloud operations decisions.

Application Assurance and Uptime

Software applications are increasingly an integral component of business processes. When the apps and IT services go down, business processes risk interruption. For this reason, IT shops continuously monitor an array of application and IT network performance metrics that correlate with business process performance. Any anomaly identified in the IT performance impacts business operations. With predictive analytics solutions in place, IT can proactively prepare for possible downtime or infrastructure performance issues. The organization can establish pre-defined policies and measures to apply corrective actions and policies automatically well before application assurance and uptime is potentially compromised. As a result, the organization reduces its dependence upon IT to troubleshoot issues or to improve their Mean-Time-To-Detect (MTTD) and Mean-Time-To-Repair (MTTR) capabilities. Predictive analytics algorithms further cut through the noise to ensure that only the impactful metrics threshold cause the organization to adapt its business processes when needed.

Application Discovery and Insights

Enterprise networks are typically distributed across regions and contain diverse infrastructure components, often operating in disparate silos. A holistic knowledge of the infrastructure and application discovery requires organizations to understand how those components interact and relate with each other, especially since network performance issues can spread across dependencies that are otherwise hidden from view. With the predictive analytics solutions in place, organizations can collect data from across the network, analyze multiple data sources and understand how one infrastructure system can impact the other. In hybrid IT environments, application and infrastructure discovery is a greater challenge considering the limited visibility and control available to customers of cloud-based services. Any lack of automated correlation between network incidents can limit the ability of organizations to steer cloud operations in real-time while responding to potential application and infrastructure performance issues.

Audit, Compliance and Security

Strictly regulated industries are often required to comply with regulations associated with application uptime, assurance, MTTR, end-user experience and satisfaction, among other parameters. Compliance becomes increasingly complex when these organizations have limited visibility and control over their IT network. Performing the audit activities at scale may require organizations to invest greater resources on IT. The regular business may not justify the increased operational overhead and organizations may be forced to cut the corners in audit, compliance and security of sensitive data, apps and IT network. Organizations using advanced artificial intelligence technologies can automate these functions and glean insightful knowledge that can translate into regulatory compliance of hybrid cloud IT environments without breaking the bank.

Security is another key enabler of regulatory compliance and requires more than automation solutions to accurately identify the root-cause of network traffic anomalies. Security infringements in the form of data leak tend to remain under the radar until unauthorized data transfers or network behavior is identified. In that case, it may be too late for organizations to respond without incurring data loss, non-compliance and potentially, the ability to operate in security sensitive industry segments such as healthcare, defense and finance. In complex cloud infrastructure environments, the role of predictive analytics is to unify the knowledge from the diverse, disparate and distributed networks and empower organizations to make better, faster and well-informed decisions.

 

Posted in Information Technology, News

List of Websites Where You Can Download Free Things

Download Free Software,Movies, Music, Templates, WordPress themes, Blogger Themes and Icon pack with these websites. Theses are totally free websites for any internet user. You can download your choice material and you don’t pay anymore.

List of Websites Where You Can Download Free Things

1. Filehippo.com – Download All type of  Windows PC Software for all Versions

2. Savefrom.net – Download any Youtube Video in HD

3. Pdfgallery.net – Download Pdf Book Collection

4. Imgur.com – Download Trending Photos from Internet

5. Evoji.com – Download Android apk directly to your pc

6. Frepowerpointtemplates.com – Download Free Powerpoint templates and themes

7. Btemplates.com – Download Free Blogger templates for your Blogger

8. Fabthemes.com – Download Free WordPress themes for WordPress Blog

9. Soundcloud.com – Download Latest music mix tracks.

10. Flaticon.com – Free Vector Icons in SVG, PNG, EPS and Icon Font for photoshop and illustrator.

11. Opendrivers.com – Download all type of drivers for your PC

12. Extratorrent.cc – Download Movies, Music, Pictures and Software for free.

13. Google Takeout – Download your All Google Profile ,Search and Email Data.

14. Google Fonts – Download Free Fonts Database from Google.

15. Download Rainmeter Skins – Download Rainmter Skins for free for Rainmeter Software.

16. Wikimedia – Download Wikipedia Data in Your Computer.

17. Github – Download Source of Open Source Software.

18. Hdwallpapers – Download HD Wallpapers for your PC and Mobile.

19. Web Capture – Download Screenshot of any Website for free.

20. Download Copy of Facebook Data –  Download your whole facebook data including your facebook messages.

Posted in Information Technology

Billions of Messages a Day

Original writer: Justin C., Software Engineer

Faced with the challenges of scaling out its engineering organization, Yelp transitioned to a service oriented architecture (SOA). Services improved developer productivity but introduced new communications challenges. To solve these problems, Yelp built a real-time streaming data platform.

We built a unified system for producer and consumer applications to stream information between each other efficiently and scalably. It does this by connecting applications via a common message bus and a standardized message format. This allows us to stream database changes and log events into any service or system that needs them, for example: Amazon Redshift, Salesforce, and Marketo.

The Challenges with Scaling Out

In 2011, Yelp had more than a million lines of code in a single monolithic repo, “yelp-main”. We decided to break the monolith apart into a service oriented architecture (SOA), and by 2014 had more than 150 production services, with over 100 services owning data. Breaking apart “yelp-main” allowed Yelp to scale both development and the application, especially when coupled with our platform-as-a-service, PaaSTA.

Services don’t solve everything. Particularly when dealing with communication and data, services introduce new challenges.

Service to Service Communication

Service-to-service Communication Scales Poorly

Metcalfe’s Law says that the value of a communications network is proportional to the square of the number of connected compatible communications devices. Translating to a SOA, the value of a network of services is proportional to the square of the number of connected services. The trouble is, the way that service-to-service communication is typically implemented isn’t very developer-efficient.

Implementing RESTful HTTP connections between every pair of services scales poorly. HTTP connections are usually implemented in an ad hoc way, and they’re also almost exclusively omni-directional. 22,350 service-to-service omni-directional HTTP connections would be necessary to fully connect Yelp’s 150 production services. If we were to make a communications analogy, this would mean every time you wanted to visit a new website, you’d have to first have a direct link installed between your computer and the site. That’s woefully inefficient.

Failing at Failure

Aside from complexity, consistency is problematic. Consider this database transaction and service notification:

session.begin()
business = Business()
session.add(business)
session.commit()
my_service_client.notify_business_changed(business.id)

If the service call fails, the service may never be notified about the business creation. This could be refactored like:

session.begin()
business = Business()
session.add(business)
my_service_client.notify_business_changed(business.id)
session.commit()

Then the commit could fail, in which case the service would be notified that a business was created that doesn’t exist.

Workarounds exist. The service could poll for new businesses, or use a messaging queue and call back to make sure the business was actually added. None of this is as easy as it initially appears. In a large SOA, it wouldn’t be strange to find multiple notification implementations, with varying degrees of correctness.

Working with Data Across Services is Hard

~86 million is a magic number

Yelp passed 100 million reviews in March 2016. Imagine asking two questions. First, “Can I pull the review information from your service every day?” Now rephrase it, “I want to make more than 1,000 requests per second to your service, every second, forever. Can I do that?” At scale, with more than 86 million objects, these are the same thing. At scale the reasonable becomes unreasonable. Bulk data applications become service scalability problems.

Joins get pretty ugly. The N+1 Query Problem tends to turn into the N Service Calls problem. Instead of code making N extra queries, it makes N service calls instead. The N+1 Query Problemis already well understood. Most ORMs implement the eager loading solution out-of-the-box. There isn’t a ready solution for service joins.

Without a ready solution, developers tend to design adhoc bulk data APIs. These APIs tend to be inconsistent because developers are distributed across teams and services. Pagination is particularly prone to inconsistency issues, with no clear standard. Popular public APIs use everything from custom response metadata to HTTP Link headers.

To join across services scalably you need to upgrade your service stack. Every data-owning service and client library will need work.

Possible Solutions?

The first solution that developers usually come to is implementing bulk data APIs. Of course, implementing a bulk data API for every data type stored by every service can be very time consuming. Somewhat naturally, a generalized bulk data API comes up, where the API can take arbitrary SQL, execute it, and return the results. Unfortunately, this is a pretty major violation of service boundaries. It’s equivalent to connecting to a service’s database to create new data, resulting in a distributed monolith. And it’s brittle. Every caller needs to know a lot about the internal representation of data inside the services that it needs data from, and needs to respond in lockstep to data changes in the service, tightly coupling the caller and service.

A potential solution for bulk data sharing is periodically snapshotting the service database, and sharing the snapshots. This approach shares the brittleness of the bulk data API, with the added challenge that differential updates can be difficult to implement correctly, and full updates can be very expensive. Snapshots are further complicated by having some data that is meaningless without the underlying code. Concretely, boolean flags stored in bitfields or categorical data stored as integer enums are common examples of data that isn’t inherently meaningful without context.

A Generalized Solution

Now that you have the context of the problem, we’ll explore how it can be solved at a high level using a message bus and standardized data formatting. We’ll also discuss the system architecture when those two components are integrated, and what can be accomplished with that architecture.

The Message Bus

Architecturally, a message bus seemed like a good starting point for addressing these issues.

Message Bus

A bus would reduce the connection complexity from n^2 to n, and in our case from more than 22,000 connections to just 150.

Apache Kafka, a distributed, partitioned, replicated commit log service, is ideal for this application. Aside from being both fast and reliable, it has a feature called log compaction that’s very useful in this context. Log compaction prunes topics with a simple guarantee – the most recent message for a given key is guaranteed to remain in the topic. This yields an interesting property, if you were to write every change that happens in a database table into a topic, keyed by the primary key in the database, replaying the topic would yield the current state of the database table.

Log compaction retains at least the most recent message for every key.

Log compaction retains at least the most recent message for every key.

Stream-table duality is well-covered by Jay Kreps in The Log: What every software engineer should know about real-time data’s unifying abstraction, and in the Kafka Streams docs. Exploiting this duality using log compaction allows us to solve many of our bulk data problems. We can provide streaming differential updates and with them guarantee that a new consumer, replaying a topic from the beginning, will eventually reconstruct the current state of a database table. In Yelp’s data pipeline, this property enables engineering teams to initially populate and stream data changes to Redshift clusters.

Decoupled Data Formats

Selecting how data will be transported is only part of the solution. Equally important is determining how the transported data will be formatted. All messages are “just bytes” to Kafka, so the message format can really be anything. The obvious answer to this is JSON, since it has performant parsing implementations in most languages, is very broadly supported, and is easy to work with. However, JSON has one core issue: it’s brittle. Developers can change the contents, type, or layout of their JSON data at any time, and in a distributed application it’s hard to know the impact of data changes. Unfortunately, JSON data changes often are first detected as production errors, necessitating either a hotfix or rollback, and causing all kinds of problems downstream.

Yelp’s data processing infrastructure is tree-like. Our core data processing tends to produce intermediate outputs that are consumed, reprocessed, and refined by multiple layers and branches. Upstream data problems can cause lots of downstream problems and backfilling, across many different teams, especially if they’re not caught early. This problem is one we wanted to address, when we moved to a streaming architecture.

Apache Avro, a data serialization system, has some really nice properties, and is ultimately what we selected. Avro is a space-efficient binary serialization format that integrates nicely with dynamic languages like Python, without requiring code generation. The killer feature of Avro, for our system, is that it supports schema evolution. That means that a reader application and a writer application can use different schema versions to consume and produce data, as long as the two are compatible. This decouples consumers and producers nicely – producers can iterate on their data format, without requiring changes in consumer applications.

We built an HTTP schema store called the “Schematizer,” that catalogs all of the schemas in Yelp’s data pipeline. This enables us to transport data without schemas. Instead, all of our avro-encoded data payloads are packed in an envelope with some metadata, including a message uuid, encryption details, a timestamp, and the identifier for the schema the payload was encoded with. This allows applications to dynamically retrieve schemas to decode data at runtime.

High Level Architecture

If we standardize the transport and formatting of data, we can build universal applications that don’t care about the data itself.

Data Pipeline Source/Target Architecture

Messages generated by our logging system are treated exactly the same as messages generated from database replication or from a service event. Circling back to Metcalfe’s Law, this architecture increases the value of Yelp’s streaming data infrastructure so that it scales quadratically with the number of universal consumer or producer applications that we build, yielding strong network effects. Concretely, as a service author, it means that if you publish an event today, you can ingest that event into Amazon Redshift and our data lake, index it for search, cache it in Cassandra, or send it to Salesforce or Marketo without writing any code. That same event can be consumed by any other service, or by any future application we build, without modification.

Yelp’s Real-Time Data Pipeline

The data pipeline’s high level architecture gives us a framework in which to build streaming applications. The remaining sections will discuss the core of Yelp’s real-time data pipeline, focusing on the invariants that the system provides, and the system-level properties that result. Following posts in the series will discuss specific applications in depth.

A Protocol for Communication

Yelp’s Real-Time Data Pipeline is, at its core, a communications protocol with some guarantees. In practice, it’s a set of Kafka topics, whose contents are regulated by our Schematizer service. The Schematizer service is responsible for registering and validating schemas, and assigning Kafka topics to those schemas. With these simple functions, we’re able to provide a set of powerful guarantees.

Guaranteed Format

All messages are guaranteed to be published with a pre-defined schema, and the schemas are guaranteed to be registered with the schema store. Data Pipeline producers and consumers deal with data at the schema level, and topics are abstracted away. Schema registration is idempotent, and registered schema are immutable.

Schemas control format of topics

Any consumer, when first encountering data written with any arbitrary schema, can fetch that schema exactly once, and decode any data written with it.

Guaranteed Compatibility

One of the Schematizer’s core functions is assigning topics to schemas. In doing so, the Schematizer guarantees that if a consumer starts reading messages from a topic with an active schema from that topic, it will be able to continue doing so forever, despite upstream schema changes. In other words, every active schema assigned to a topic is guaranteed to be compatible with every other active schema assigned to the same topic. Applications won’t break because of schema changes.

Consumers load schemas dynamically as they receive messages

At runtime applications will fetch schemas used to write data messages dynamically, as messages encoded with previously unseen schemas appear in the topic. A producer can change the data format it’s producing without any action from any downstream consumers. The consumers will automatically retrieve the new writer schemas, and continue decoding the data with the reader schema they’ve been using. Producer and consumer data evolution is decoupled.

Guaranteed Registration

Data producers and consumers are required to register whenever they produce or consume data with a schema. We know what teams and applications are producing and consuming data across Yelp, which schemas they’re using, and with what frequency.

This allows producers to coordinate breaking data changes with their consumers, and allows for automated alerting of consumers in the event of a data fault. Producers are given the tools that they need to coordinate incompatible schema changes in advance. Registration further enables the deprecation and inactivation of outdated schemas. We can detect when a schema no longer has producers, and can coordinate the migration of consumers to more recent active schema versions out-of-band. Registration simplifies the compatibility story, since we can artificially constrain the number of active schemas in a topic – compatible schema changes typically need to be compatible with only a single existing schema.

Guaranteed Documentation and Data Ownership

The Schematizer requires documentation on all schema fields, and requires that all schemas assign a team that owns the data. Any schemas without this information will fail validation. That documentation and ownership information is then exposed through a web interface called Watson, where additional wiki-like documentation and comments can be added.

Watson Business Documentation

In many cases, we’ve extended this capability to systems that generate messages and schemas automatically. For example, schemas derived from database tables are documented by extracting docstring and ownership information from the corresponding models in our codebase. Automated tests prevent adding new data models without documentation and owners, or modifying existing data models without adding documentation.

Watson enables users to publicly ask the data owners questions, and to browse and contact data producers and consumers. The Schematizer has the concept of data sources and data targets, where it can track, for example, that a schema originates from a MySQL database table, and the data is streamed into a Redshift table. It’s able to produce documentation views dynamically for these data sources and targets. Effectively, adding documentation to code automatically documents Redshift tables, MySQL tables, and Kafka topics.

Guaranteed Data Availability

As mentioned above, one of the major issues with data transfer between services is dealing efficiently with bulk data. Using Kafka log compaction and keyed messages, we’re able guarantee that the most recent message for each key is retained.

This guarantee is particularly useful in the context of database change capture. When materializing a table from a topic containing captured database changes, this guarantees that if a consumer replays the topic from the beginning, and catches up to real time, it will have a complete view of the current state of the data in that table. The same system that provides differential updates, can thus be used to reconstruct a complete snapshot. All Aboard the Databus! describes the utility of streaming database change capture, which is effectively a single universal producer application in our unified infrastructure.

source: https://engineeringblog.yelp.com/2017/11/breaking-down-the-monolith-with-aws-step-functions.html

Posted in Information Technology

Breaking down the monolith with AWS Step Functions

As we’ve discussed in earlier blog posts, Yelp Engineering has been working hard to break down our largest monolithic code base (yelp-main) for the past few years. We’ve made great progress but some of our oldest, most critical code remains within yelp-main. A great example of an older, more established system is our monthly subscription billing cycle. The system is core to how Yelp collects revenue and has proven technically challenging and risky to transition.

The Revenue engineering team knows these older systems should be moved into services, but the challenge of extracting tangled, business-critical code has proven expensive and dangerous. Luckily a new framework was announced by Amazon Web Services at the end of 2016, AWS Step Functions, that’s allowed us to make this transition a reality. This post covers how we’re leveraging Step Functions to achieve escape velocity from yelp-main, better represent our business processes, and build a more reliable, observable system along the way.

A subscription billing primer

To set the stage a bit, here’s a look at what made this process so technically challenging. The subscription billing process lies at the center of a large nightly chain of batches. This pipeline takes hours to run each night and its ownership spans several teams at Yelp. Subscription billing consists of three conceptual jobs in the center of this pipeline:

  1. Billing accounts (how much does each account owe?)
  2. Invoicing (rolling up these line items into a single bill)
  3. Collections (actually collecting payment for the invoices)

Each of these three steps runs over all relevant payment accounts before the next step proceeds. This fact has two implications for the stability and performance of these jobs:

  • Each of these steps is doing significant work, so billing across more than 100k accounts takes hours to run in the best case.
  • Since these steps operate over all accounts at once, one broken or slow account can block the entire rest of the pipeline. Making sure these steps complete cleanly and correctly is incredibly important!

As the number of advertisers at Yelp has grown, it has been a challenge to keep this process scaling successfully. We’ve introduced concurrency frameworks into our batch processing libraries, been very conservative when changing the code, and spent a ton of time maintaining the status quo.

A few key limitations have kept us from moving this code into services:

  • These processes did not have clear APIs in our monolith. They were invoked by the daily billing pipeline and ran over all accounts in semi-parallel fashion.
  • The data backing these various processing steps is entrenched in yelp-main. Moving the backing data out felt impossibly expensive, but if the data was left in yelp-main it wasn’t clear what value a service would provide.
  • Some steps – like marking invoices as paid – currently leverage very stringent ACID guarantees (via MySQL transactions) to ensure our ledgers are consistent with payments we’ve collected from advertisers. Moving to a service would require devising an alternative way to maintain the same consistency guarantees.

These collective challenges prevented extraction of subscription billing. We saw no cost-effective way for a small team to decouple the system without unacceptable consequences. Luckily for us, a solution was right around the corner.

AWS Step Functions steps into the breach

AWS Step Functions released to the public in December 2016, offering state machine-based workflow coordination as a service. It implements basic primitives for you, including retries, branching, and timeouts. Step Functions delegates tasks to your code for more complex side effects, like writing to a database, or calling an internal service. These are called “activity tasks” when run from your own servers, instances, or containers. Step Functions can also dispatch tasks to AWS Lambda functions.

A sample workflow from the Step Functions UI

The flexibility of the state machine description language makes it applicable for a variety of use cases, but the core appeal for us came down to a few important features:

  • There are few limitations on how long an activity task can run. If you aren’t ready to break up your code into many bite-sized activity tasks, you can run a few, very large activity tasks
  • Activity tasks are codebase agnostic. This means your workflow can seamlessly coordinate multiple activity tasks that live across services.
  • Retries and timeouts allow you to flexibly ensure individual activity tasks are robust and complete successfully.
  • Concurrent executions can be run in parallel at significant scale. Up to a maximum of 1 million executions can be run at once.

Step Functions has a lot of potential as a framework that can support monolithic code that wants to act in a service-like way. It seemed like a great match for our workflow-like subscription billing process, so we launched an initial project to integrate it.

Our first pass – bookkeeping on Step Functions

The diagram of the subscription billing process showed three fundamental steps: bookkeeping, invoicing, and collections. For our first stab at this project, we decided to tackle moving the first step of this process behind a Step Functions workflow. This offered a relatively well-scoped amount of work and let us work on the interface of this process without scoping in the task of migrating invoicing and collections.

Our very first step was to choose the workflow’s interface. The old batch code looked roughly like this:

def runner(all_accounts):
    for chunk in all_accounts:
        dispatch_work(chunk)

def worker(chunk):
    for account in chunk:
        bill_account(account)

Our batch parallelism framework divided the set of all accounts into sections, and dispatched each chunk of accounts to a different worker process. We wanted to take advantage of the significant parallelism available in Step Functions, so we determined that a single workflow should perform subscription billing for a single account. To avoid a regression from our legacy system’s end-to-end performance, we could simply fire off many of these workflows concurrently.

Concurrency variation over time in the old framework: Old framework concurrency (note the degradation over time)Concurrency staying stable over time with Step Functions: Step Functions concurrency

Once the interface was determined, we built an API for billing a single account which only required an account ID and a date for billing (in case we needed to re-run past days workflows). Now note the simplicity and beauty of this: we just made running subscription billing feel like talking to a service, even though we have yet to migrate the actual behavior out of our monolith. This isn’t quite as good as actual service migration, but it’s a big step in the right direction for very little effort! We had originally considered just wrapping the yelp-main function in a very small monolith-based API, but using Step Functions let us keep all API management code cleanly separated, which would prove essential as the workflow evolved.

Finally we need to implement our workflow’s state machine. We started by keeping it simple: making one large activity to match our old monolithic functionality! We knew this probably wouldn’t be a permanent solution, but it made migration incredibly straightforward and let us quickly and easily establish a baseline implementation for our Step Functions workflow.

A initial (and very simple!) billing workflow

Polishing for observability and performance

We tested this in production and saw the pieces working together just as intended. We kept a careful eye on two historic pain points for subscription billing: making sure any issues in the pipeline were highly visible to on-call engineers, and that the pipeline stayed highly performant.

If there were any transient issues while billing an account, the retries we built into our workflow would simply re-run the bill_account activity. If an account had fundamental issues that caused billing to fail repeatedly, we’d eventually exhaust our retries or timeouts and Step Functions would mark the whole execution as failed. This execution failure was so important to us that we added an explicit state in the workflow to represent success and failure. These activities were solely there to push that success/failure fact into Yelp monitoring systems (like SignalFX) along with basic identifying information like account ID and execution ARN.

Billing with decoupled error handling

This set-up ensures that one-off errors are cleanly handled by retries but also ensures that on-call engineers stay aware of any issues that cause billing to fail systematically. That increased awareness also means if one account cannot be billed, we can continue billing other accounts in parallel while our on-call engineers are notified — no more blocking the whole pipeline for a single bad account!

We also saw performance wins. Execution concurrency worked well, even outperforming our own previous batch-based parallelism by avoiding some previous limitations. We further increased this advantage by revisiting our workflow design and breaking bill_account into a few parallel steps for different types of subscription products. It was the work of a couple hours to design a new workflow where these steps were done in parallel, and the associated changes to the activities were quite easy.

Billing with parallel bookkeeping tasks

The result was a faster billing process with nearly no extra engineering effort. The role of Step Functions as a highly-scalable coordinator of these complex workflows worked very nicely. We proved our hypothesis that workflows would be easy to refactor down the line.

Results and future plans

We have installed this new subscription billing process alongside the old one and are rolling over all accounts to be processed by the Step Functions workflow. So far, we’ve found it to be very stable and capable of significant parallelism (after some adjustment of default API limits). We have gained a clear API to bill a single account, and the whole process is more resilient and observable.

To recap, the development process consisted of the following steps:

  • Start with a single “bill account” function called from our monolithic batch process.
  • Wrap a Step Functions API around the “bill account” function. Trigger concurrent executions for improved performance.
  • Extract retries and failure handling from “bill account”. Move these into their own activities in Step Functions and build high quality metrics watching how often they are executed.
  • Use this improved observability to make even more fundamental changes to the workflow (like breaking out parallel tasks). Functions get simpler and more decoupled while the overall workflow gets faster and easier to understand.
  • Rinse and repeat until satisfied with the workflow’s design

Looking forward, we aim to continue incorporating invoice and collection steps into this workflow. Each of these is even more complicated internally, and we aim to simplify the number of dependent systems by leveraging the retries, timeouts, and parallelism built right into Step Functions. Look for us at re:Invent 2017, speaking in breakout session CMP319 on this project alongside the Step Functions team.

 

source:

https://engineeringblog.yelp.com/2017/11/breaking-down-the-monolith-with-aws-step-functions.html

Posted in Information Technology

AWS Step Function vs AWS Simple Workflow (SWF)

Asynchronous tasks are essential to most applications, but it can be a challenge to maintain state with these processes. There are two native options to address this challenge on AWS, though, for most workloads, one is superior to the other.

A developer conducts these types of multistep tasks for regular data updates, to process incoming events or to implement business functions in a certain sequence. Once a workflow starts, they have to keep track of several elements, such as the status of different steps involved, the different conditions that trigger a particular path within a workflow, as well as error handling, retry logic and scale.

For AWS workflows, developers can use Step Functions or Simple Workflow Service (SWF) to manage all of the complexities. Let’s compare the capabilities of both to see why Step Functions is likely the better fit for most workloads.

Step Functions vs. SWF

Step Functions coordinates multiple AWS Lambda functions or other AWS resources based on user-defined workflow steps. It keeps track of workflow state in the cloud, as well as flow conditions based on inputs and outputs from each step. Step Functions also helps define error handling, parallel or sequential branching, schedules and retry behavior for all of the workflow steps.

Step Functions is a managed service, so users don’t have to deploy or maintain any infrastructure for either the workflow management or the tasks themselves.

SWF also manages workflow state in the cloud. However, unlike Step Functions, a user has to manage the infrastructure that runs the workflow logic and tasks. Tasks can run inside EC2 instances or on any server, including on premises and on other clouds.

The table below highlights some key differences between Step Functions and SWF for AWS workflows:

Step Functions vs. Simple Workflow Service

Most developers will find the Step Functions learning curve to be reasonable. It’s easy to define and operate AWS workflows, and the growing list of integrations with other native services expands its use cases, such as long-running AWS workflows, compute-intensive tasks and hybrid cloud deployments.

An organization could potentially opt for SWF as part of a hybrid architecture or if it didn’t want to expose internal application components over an HTTPS endpoint, which Step Functions requires to integrate with custom code. But even with those considerations, Step Functions is often the better choice because of lower costs and lack of management requirements.

It also appears that AWS is no longer as focused on SWF, which debuted nearly five years before Step Functions. The Flow Framework recently stopped active development for Ruby, which leaves Java as the only supported framework, and there haven’t been any notable upgrades to SWF in the past three years.

source:

https://engineeringblog.yelp.com/2017/11/breaking-down-the-monolith-with-aws-step-functions.html

%d bloggers like this: