“Just-in-Time means making only what is needed, when it is needed, and in the amount needed […] it is necessary to create a detailed production plan […] to eliminate waste, inconsistencies, and unreasonable requirements, resulting in improved productivity.” – Just in Time, Philosophy of Complete Elimination of Waste byToyota.
“Each unpredictable feature demanded by customers is considered an opportunity […] this requires rapid adjustment of production capability. Dynamic and flexible network utilizations in functional modules can maximize the strength of each resource and the overall risk and costs are reduced.” – Flexible Manufacturing System for Mass Customization Manufacturing by Guixiu Qiao, Roberto Lu and Charles McLean.
“Providing capacity in a more expedient fashion allows us to deploy a functioning and consumable business service more quickly […] at the core of our self-service functionality is a hosting automation […] On-demand self-service is a critical aspect of our cloud environment; however, without underlying business logic, controls, and transparency, an unconstrained on-demand enterprise private cloud will quickly exceed its capacity by doling out allocations beyond its supply.” – Implementing On-Demand Services by Intel.
“Elasticity is commonly understood as the ability of a system to automatically provision and deprovision computing resources on demand as workloads change […] in a way that the end-user does not experience any performance variability.” – Elasticity in Cloud Computing: What It Is, and What It Is not by Nikolas Roman Herbst, Samuel Kounev and Ralf Reussner.
This past few months I’ve followed a few discussions on virtualization and scalability.
There is such a thing as becoming a victim of success when pent up demand strikes and a business fails to scale accordingly.
Capacity management has typically prompted over-engineering decisions and long lead times taking a year or more in the telecoms industry. This can result in concerns about delayed breakeven points, underutilizing precious resources as well as limited offerings due to the higher cost of oversubscribing.
Lean means staying nimble at any size, streamlining and keeping lead times as short as possible by design. Effective and efficient capacity management relies on understanding economies of scale and scope. The first relates to achieving larger scales triggering more efficient utilization levels and, therefore, lower and more competitive average costs.
Scope means taking advantage of synergies and common infrastructure and platforms to deliver a variety of services, application multi-tenancy being an example in NFV’s (Network Functions Virtualization’s) context.
Active portfolio management follows: complementary application lifecycles can share resources and raise overall utilization levels in the process. Moreover, some applications can be deconstructed and modularized so that specific subsets become standalone services available to (or reused by) other applications. These can be decoupled to join a common pool and scale independently.
In some discussions we refer to growth models where “scale” follows a “vertical” approach while “scope” adds breath with new functions and is, therefore, a horizontal expansion model. This breakdown allows for plotting and segmenting growth/de-growth scenarios in a simple matrix. I am experimenting with new ways of helping visualize these concepts. This is work in progress and the final result will look different from early drafts poste here. Though, I think that they can be used for the time being.
One other thought… elasticity relates to following demand curves: offer meets demand by dynamically adapting capacity. This entails provisioning, deprovisioning and a virtuous circle by means of gracefully tearing down resources, which are freed up and exposed for other applications to leverage. Elastic computing seems to make us think of unlimited just-in-time capacity, but there are upper and lower boundaries involving diminishing returns. It just so happens that virtualization has pushed the envelope by considerably widening and shifting these constrains.
It is worth reflecting on Gordon Moore’s law in this context: many incremental and disruptive innovations yield exponential performance improvements in today’s cloud age. That can be coupled with NFV’s (Network Functions Virtualization’s) shift from lengthy lead times, cumbersome operations and costly dedicated hardware to automated systems working with a wide supply of more affordable COTS (Commercial of The Shelf) hardware and open source solutions.
Let’s now focus on the notion of service decomposition and how that impacts scaling.
This exercise often starts with deconstructing monolithic systems typically relying on vertically integrated architectures, then looking at the actual services involved, dependencies, flows… and figuring out what is best to keep integrated vs. modularized, centralized vs. distributed.
This also entails looking at opportunities for what it takes to streamline development time, such as code reuse and processes worth exposing by means of API (Application Programming Interfaces). Note that many applications do not need to duplicate assets and can become distributed systems consuming resources and processes running elsewhere.
In this section’s graphic, the application is a VNF (Virtual Network Function) which has been decomposed and right-sized to run in three different VMs (Virtual Machines) of different volumes instead of procuring a single physical server for just this application.
Lighter gray blocks at the back end present a pool of services available to that and other applications. As an example, when decoupling an application’s logic from the app’s data we get to leverage DaaS (Database as a Service) as one of the shared services.
These are the “scaling” terms provided by ETSI (European Telecommunications Standards Institute) NFV reference documents:
- Scaling up: extending a resource (compute, memory, storage) to a given VM.
- Scaling down: decreasing resource allocation.
- Scaling out: creating a new instance, adding VMs.
- Scaling in: removing VMs.
Circling back with service decomposition: there are scaling scenarios where there is no need to go through the trouble of scaling out an entire application, but just a specific service at stake, such as one of the VMs or the database in the previous example.
In some other scenarios scaling can prompt application updates and/or upgrades to enable new functionality. Suitable “upgrade windows” can be hard to find when multiples services are in demand and expected to remain always-on anytime. A stateless architecture means that the session’s state is kept outside of the application, with the shared database in this example. Traffic can be redirected to an application’s mated pair, this is a second instance which was kept on active standby mode until the maintenance event.
This also means going beyond 1+1 models where everything is duplicated (mated pair concept) for failover sake. There often are more efficient n+k systems in HA (High Availability) environments. Note that, paradoxically enough, rolling out upgrades happens to be a primary source of maintenance issues thereafter, adding to the need for sustaining service continuity at all times coupled with zero touch and zero downtime.
Zero touch is delivered by automation, which relies on continuous system monitoring, engineering triggers and preceding work with recipes, templates and/or playbooks (these are alternative terms based on different technologies) detailing what needs to happen for to execute a lifecycle event. Scaling is the subject of this post and onboarding, backup, healing, termination are other lifecycle events just to name a few more.
Programmability drives flexible automation, which is data driven and based on analytics. Predictive analytics goes a step further to project and address trends so that actions can be taken in advance. In our Lean NFV Ops demonstration we purposely stimulate network traffic with a load generator to exemplify this. We run scenarios illustrating both (a) fully automated scaling and (b) autonomation by switching to manual controls that put the operations team in charge at every step.
Autonomic computing is powered by machine learning. Research on NFV autonomics points to the ability to self-configure, specially so under unplanned conditions. Looking into automation and distribution modes helps define maturity levels for NFV, that being a topic for another article.
Let’s zoom out to discuss scaling in the context of the platform.
ETSI NFV defines MANO as the Management and Orchestration system. “Managing” refers to addressing the application’s lifecycle needs, scaling being one of them. The notion of “orchestrating” focuses on the underlying resources to be consumed.
The MANO layer is thought out as NFV’s Innovation Platform, which I show in purple color: the thickness of that layer conveys the degree to which an application uses more (right) or less (left) of MANO’s capabilities. This is an application multi-tenant environment where VNF1 shows a monolithic app example in contrast to VNFn which is meant to take full advantage of MANO’s automation.
This cross-section shows a horizontal architecture as the platform supports multiple applications as well as back end systems. Horizontal and vertical solutions scale differently. A common platform presents à la carte features and start small, growing and scaling to enable homogenous end to end management across the applications, while the monolithic approach moves forward with siloed operations on an application by application basis.
One more example, growing by adding interdependent services is a discouraging endeavor when reconfiguring multiple functions becomes overwhelming. SFC (Service Function Chaining) comes to the rescue in a virtual environment by providing network programmability and dynamic automation to create networks connecting new services. NFV’s scaling needs make a good case for SDN (Software Defined Networking), the technology behind SFC.
Now moving to what’s under the hood.
NFVI stands for Network Functions Virtualization Infrastructure. Most typically, what we can see and touch is a data center environment providing resources consumed by the applications such as compute, memory, storage and networking to begin with.
The visual in this section shows a conceptual server farm right under the platform. Blue nodes on the left and brown ones on the right are physically placed at different geographic locations, yet forming part of the same NFVI orchestrated by MANO. The gray one is being added: scaling out of the existing infrastructure. The green node lays outside and can be leveraged when bursting:
- Scaling out: adding more servers (gray cube).
- Scaling up: leveraging clusters and/or distributed computing to share the load (blue and brown cubes).
- Bursting: tapping into third party infrastructure to address capacity spikes (green cube).
Note that, in this context, scaling up can also mean upgrading servers to handle larger workloads. This can also be about using an existing chassis while replacing a server with a new node featuring more processing, data acceleration, lower energy needs, etc.
Early on we talked about COTS’ being easier to scale out when compared to proprietary dedicated hardware. It has partly to do with standardization, centralized management and consolidation, the existing supply chain for x86 systems and node automation.
We can also factor consumption based models where a given application’s business case is not impacted by up-front CAPEX (Capital Expenditures). Instead, the application business case accounts for resource usage levels which, once again, benefits from economies of scale and scope. The notion of elasticity makes infrastructure planning transparent to the application.
Capacity and performance management skills remain of the essence: the move to applications based on stateless architectures means that scaling distributed applications places a greater emphasis on API behavior by addressing capacity and speed in terms of RPS (Requests Per Second). And, nonetheless, the telecommunications industry is known to require high capacity, low latency SFC, which is driving data plane acceleration solutions.
We can now zoom out.
Scaling is not a new thing or need. Conventional architectures can scale, they just don’t do it fast or effectively enough in a cost effective fashion. Taking months and years to get the job done risks missing markets and taxing resources which would have been needed to create innovative services.
Admittedly, one of the objectives behind writing this was wrestling with jargon by outlining “scaling” terms in context, whether related to application, platform or infrastructure. Hopefully, that goal was accomplished. Otherwise, please let me know.
One other thought… NFV is a change agent. Hence, cool technical wizardry alone does not suffice. We are discussing emerging technologies causing interest in connecting dots across behavioral economics (and not just the business case) and organizational cultures and decision making in the telecoms sector. Understanding the human factor matters.
As usual, I will be glad to continue the conversation by exchanging emails, over LinkedIn or in person if you happen to be around at IDF15, Intel Developers Forum, in San Francisco’s Moscone Center on August 18-20.
“The automatic telephone switchboard was introduced in 1892 along with dial telephones. By 1929, 31.9% of the Bell system was automatic. Automatic telephone switching originally used vacuum tube amplifiers and electro-mechanical switches, which consumed a large amount of electricity. Call volume eventually grew so fast that it was feared the telephone system would consume all electricity production, prompting Bell Labs to begin research on the transistor. The logic performed by telephone switching relays was the inspiration for the digital computer.” – “Automation” by Wikipedia.
We kept extremely busy in Q1 to deliver the Lean NFV Ops demo at Mobile World Congress back in March. I am glad to share that the project’s success led to a hectic roadshow in Q2: our live demo system has been showcased at a number of industry and private events as well as in customer workshops worldwide.
Each conversation with network operators, partners, analysts and public officials has delivered a wealth of insights: most validating the project’s objectives while some challenging us to do even more to take things to the next level.
Q3 is about furthering the Lean NFV Ops conversation and we will soon make available a brief paper and a full length video sharing design principles. Stay tuned. Though, I would like to first start with a brief discussion on S2O (Self-Service Ops) given a recent batch of questions on what that entails.
This is just a quick note: all conversations regarding Lean NFV Ops involve data driven automation and the human factor. This is a live demonstration system that couples (a) flexible “automation” involving correlated metrics, predictive analytics, directories, policies and research findings on “autonomics” (machine learning) with (b) visibility and controls where “autonomation” engages human intelligence in terms of situational awareness, supervision, root cause analysis, programmability… and new skills involving workstyles and organizational behaviors. There you have it: managed to get “automation”, “autonomics” and “autonomation” in just one paragraph : )
S2O, this post’s focus subject, reflects the fact that a number of CSP (Communication Service Providers) are developing B2B (Business to Business) markets by providing services to other network operators under the carrier’s carrier model, MVNOs (Mobile Virtual Network Operators) and enterprise verticals and customers of all sizes. Though, we are also learning about lengthy resource consuming operations that trigger costlier services than planned and/or limited offerings constrained by what can effectively be managed under the current PMO (Present Mode of Operations).
Thinking of Network Functions Virtualization (NFV) means shifting to a FMO (Future Mode of Operations) based on cloud economics. More specifically, this means enabling business models such as Infrastructure and Platform as a Service (IaaS and PaaS) which are driven by self-service interactions.
This 10+ minute video shows the first version of the Lean NFV Ops demo where our emphasis was on communicating what NFV can deliver to CPS’ in-house ops teams. The above graphic portrays the S2O use case where:
- B2B: A CSP is in business with several customers (other carriers, MVNOs, enterprises, public administration).
- XaaS: A given CSP’s customer works with the same toolset leveraged by the CSP’s own in-house ops team and benefits from the “X” (anything) as a Service model.
- DevOps: That CSP customer’s own IT team embraces self-service by deploying apps and creating service chains at multiple sites, scaling and reconfiguring systems as needed.
Left: Screen capture of the demo’s NFV Ops Center – S2O View. Right: Screen captures of support systems involved: Motive Dynamic Operations, CloudBand Management System, Nuage Networks, Bell Labs Analytics.
In a nutshell: a significant share of operations have been outsourced by the CSP to the business customer under the S2O use case . This is a mutually beneficially arrangement as follows:
- The CSP’s business customer is empowered to best conduct timely operations as they see fit.
- The CSP leverages automation to reap self-service efficiencies whether that involves in-house teams or those engaged by business customers themselves.
S2O prompts CX (Customer Experience) implications encompassing fulfillment and assurance, as well as consumption based pricing models, in a highly dynamic environment, which makes Lean NFV Ops’ end-to-end system engineering approach of the essence.
As usual, I will be happy to address your comments, exchange emails or trade messages over LinkedIn. Our team will be doing demos at IDF 2015 (Intel Developers Forum) in San Francisco on August 18-20 at Alcatel-Lucent’s booth. Hope to see you there : )
“This interactive demonstration shows the positive impact of agile service launch subject to Reliability, Availability, Serviceability (RAS) scenarios. It features an application centered system involving sophisticated Virtual Network Functions (VNF) and integrates Operations Support System (OSS), NFV’s Management and Orchestration (MANO) as well as Software Defined Networking (SDN) under a modular and scalable approach.”
“In addition to Alcatel-Lucent’s portfolio, which is represented by Motive Dynamic Operations (MDO), CloudBand Management Platform (CBMS) and Cloud Node, Nuage Networks, Virtual Evolved Packet Core (vEPC), Virtual IP Multimedia Subsystem (vIMS) our conversation illustrates Ecosystem examples involving third party partners, findings from Bell Labs Research and presents opportunities for following up with hands-on activities at the Cloud Innovation Center (CIC).”
00:00 – Hi, my name is Jose. We are going to discuss operations in the context of NFV, Network Functions Virtualization. We will do that for the purpose of delivering service agility because launching new applications in the marketplace should be as easy as getting them deployed with just one click.
00:30 – This is a real environment, this is not a proof of concept. These are products that are either available today or in production in 2015. Namely Motive Dynamic Operations (MDO), the OSS, Nuage Networks’ SDN (Software Defined Networking) framework, the CloudBand platform, which manages the lifecycle of the VNFs (Virtual Network Functions) as well as orchestrating the underlying cloud infrastructure. Last but not least, we will also discuss findings from Bell Labs’ research. To complete the environment that we are operating with today, you will see a fully virtualized RAN (Radio Access Network) as well as the mobile core with the vEPC (virtual Evolved Packet Core) and vIMS (virtual IP Multimedia Subsystem), all working together to deliver this VoLTE (Voice over Long Term Evolution) live video session.
01:20 – We are going to follow two basic principles in this demonstration. Principle number one: these are very sophisticated systems and we are bringing them together, therefore, there is no denial that we need to abstract out complexity to deliver simplicity, that way we can manage operations. Principle number two: no matter what we do in the background operationally speaking, the user experience, the video in this case, should continue to play completely unscratched. At the end of this demonstration we will review these two principles to check how we did.
01:50 – Deploying any application should be as easy as… and here is the virtualization catalog that we use in our labs at the Cloud Innovation Center, it should be as easy as selecting what I need and launching the application to the NFV Operations Center. The heavy lifting is actually performed by CloudBand, the MANO (Management and Orchestration) platform. It understands the application requirements, the lifecycle, and will make sure that things talk to the right components to spin up virtual machines and onboard the service.
02:20 – Moreover, now we need for traffic to flow through this new application, this new service. I am now talking to Nuage Network’s SDN (Software Defined Networking) framework to get that going in a split second. So, I am now working on SFC (Service Function Chaining). And there you are.
02:45 – Now, let’s continue to test more things in the marketplace in real time. I am now delivering yet another application: a content filtering service. Maybe I should also deploy a WebRTC (Web Real Time Communications) server. And here it is. By the way, all the virtual machines in green color are carrying load this minute, the virtual machines shown in blue are on standby. These other are mated pairs for reliability so that we can work in HA, this is a High Availability environment. Moreover, virtual machines laid horizontally are services and products from third party partners also onboarded on the CloudBand platform.
03:25 – As you see, we need to do some more service chaining, and we are now working again with Nuage Networks’s SDN. I am going to do the chaining for this one application. Note that this is fully programmable, everything is fully automated.
03:40 – Let’s discuss what happens when a network operator becomes victim of success. That would be a situation where this video service becomes very popular because it works well. There is [unplanned] pent up demand with more subscribers using the service. Therefore traffic grows. Let’s simulate that kind of situation. These are load generators which I am going to work with to conduct a stress test. As you can see, traffic is ramping up already. The question now is, will we have enough capacity available to meet new demand? Things are not looking that good… but as we detect this trend thanks to Bell Labs analytics, the platform starts spinning up new virtual machines and onboarding necessary services so that we can get some relief. [As a result] now we are working with new subscribers without a glitch.
04:40 – The opposite is also true. Let’s say that there is no longer that much demand for this one service. There aren’t so many subscribers. Traffic is no longer flowing through our system at the same scale. Let’s simulate that. Traffic is going down this minute. The very same way we were scaling and creating more capacity before, we are now going to take down all of those added systems so that we can make the underlying resources for the next batch of successful applications to utilize. As you see, the ones in red are continuously being monitored so that we can clean up and, once again, gracefully terminate those services.
05:20 – We can do all of these things because we are working in a data center environment. These are CloudBand’s Cloud Nodes. This is COTS (Commercial Off The Shelf) infrastructure, these are not dedicated servers. Therefore, we can continue to spin up new virtual machines and onboard applications. We can continue to reuse these resources [compute, memory, storage, networking] at very high utilization levels over and over.
05:50 – If you are successful, in addition to experiencing demand and coping with capacity… at some point you will be facing updates, upgrades… maintenance events. Let’s simulate that too. This is a RAS (Reliability, Availability, Serviceability) test. We could start by opening a maintenance window, the more applications we have, the harder it is to find those at the right time without disrupting the video experience, the user experience. We could trigger a network failure instead, some issue that impacts QoS (Quality of Service) or, perhaps, a cloud failure that could involve a corrupted virtual machine. Let’s cause that last one.
06:30 – The machine that has been compromised has been flagged [in red]. The load has already been placed on the mated pair. There was [service continuity] no disruption of any kind as far as the user experience is concerned. Be have been able to do that thanks to smart placement combined with a distributed architecture. The data center that you see on the left, DC number one, is based at a central location where we have consolidated assets for the purpose of delivering cost efficiencies. [On the right] data center number two is at a distributed location closer to the network’s edge for performance sake instead.
07:10 – Everything that we have been discussing up to this point is available from Alcatel-Lucent’s portfolio in 2015. In the next few minutes, I will share with you research findings from Bell Labs projects. These relate to analytics for smart load placement and autonomics, that is machine learning for NFV.
07:30 – You were able to notice that as I moved the load to the other data center, the service was not disrupted but I lost HA (High Availability) [by operating in a simplex environment instead]. Now I need to look for the best placement for the new mated pair that will become my new backup should something happen to the virtual machine that’s carrying the load right now. The question is: where should I do that?
07:55 – Bell Labs’ recommendations engine is checking cloud requirements and conditions, it couples that with equivalent network requirements and conditions, it understands what any given application needs in the lifecycle. It reads the contract because it does not make sense for me to deploy something in a more expensive environment, which would defeat my business case and cloud economics. By the same token, I cannot deploy the load in an inferior environment, which would not meet the SLA (Service Level Agreement). Additional policies: these could be engineering events or any other kind of rules. This could be weather conditions because I wouldn’t like to move the load to a data center that is going to be compromised by terrible weather for that matter.
08:45 – If I like this recommendation which prompts me to move the load from “cloud one” to the “Barcelona data center” I could just click “accept” and move forward. What if there was a better option? I am going to ask the recommendations engine to present another option. In this other case it says that I should be moving the load to a different data center closer to my next destination, so that the service is provided closer to my location.
09:10 – In any case, at any given point of time, I should be able to do RCA (Root Cause Analysis). For that purpose we get to display fine grained, correlated analytics. We built a dynamic dashboard that we can always check to asses the current situation and do troubleshooting accordingly. The various metrics come, and are fed, by the different solutions that you see represented in the smaller screens on each side of the NFV Ops Center. If this was a false alarm I would then click on “stand down” and nothing would executed. The reality is that false alarms can happen. If I need to buy more time to get more data, to do further analysis, I would then click on “standby” instead.
10:10 – There is research on autonomics as I was sharing before. This means that the recommendations engine, time after time, learns from these behaviors and it becomes more predictive and, eventually, it gives you even better custom recommendations further optimizing system performance as well as any other kind of efficiencies.
10:30 – I am going to accept the recommendation that works best for me, which is the first one. In the background, what you would see are the very same things that we saw early on: virtual machines being spun up, applications being onboarded, networks being created… with all of that happening literally in just minutes. This is very different from PMO (Present Mode of Operations) where it takes filling out forms, scheduling meetings, talking to a lot of people. Then it takes maybe hours, if not days, perhaps, weeks before we get anything done. Here things are programmable, fully automated, and things happen in real time as you can see by means of this demonstration.
11:10 – We have also brought to you a single pane of glass to abstract out complexity. When drilling down, it pays to go to the UI (User Interfaces) of the specific solutions. This [single pane of glass] is not an Alcatel-Lucent product, this is just illustrating a requirement from many of our customers who are asking for the APIs (Application Programming Interfaces) from this various solutions to build their own dashboards and their own screens.
11:30 – Well, this completes the demonstration. As I was saying early on: a 100% real, this is no PoC (Proof of Concept), all of the products with the exception of Bell Labs research. which we just discussed, are currently available or in production in 2015, this year. Thank you.